comment on

Sorry for the delay.

I do read in the whole file at once and pass it as a variable to XML::Twig and make three consecutive passes. Here are the first, second, and final passes. The first pass just collects some information, the second pass actually starts processing the elements. The are a few map and grep functions used in the various handlers (not listed):

...
my $xml = XML::Twig->new(
    pretty_print => 'nsgmls',   # nsgmls for parsability
    output_encoding => 'UTF-8',
    twig_roots => {  'office:automatic-styles' => 1 },
    twig_handlers =>
    {
        'style:style[@style:family="text"]/style:text-properties' => \
+&handler_style_collector,
            'style:style' => \&handler_paragraph_style_collector,
    }, );

# $content is not saved from the first pass, it only builds some hashe
+s
$xml->parse($content);
$xml->dispose;

$xml = XML::Twig->new(
    pretty_print => 'nsgmls',   # nsgmls for parsability
    output_encoding => 'UTF-8',
    twig_roots => { 'office:body' => 1 },
    twig_handlers =>
    {
        # link anchors (text:boomark) must be handled before
        # processing the internal links (text:a)
        '*[text:bookmark]' => \&handler_bookmark,
            'text:note[@text:note-class="footnote"]/text:note-body'
            => \&handler_footnotes,
            'text:note-citation' => \&handler_citation, # only some ki
+nds
            'text:span' => \&handler_span,      # typographic markup
            'text:list-item' => \&handler_list_item, # all lists becom
+e unordered
            'table:table-header-rows' => \&handler_table_header_rows,
            'table:table-row' => \&handler_table_row,
            'table:table' => \&handler_table,   # primitive table supp
+ort
            'text:line-break' => \&handler_line_break,
            'text:table-of-content' => sub { $_->delete },
            'text:index-body' => sub { $_->delete },
            'text:alphabetical-index' => sub { $_->delete },
    }, );

$xml->parse($content);
$content = $xml->sprint;
$xml->dispose;

$xml = XML::Twig->new(
    pretty_print => 'nsgmls',
    empty_tags => 'html',
    output_encoding => 'UTF-8',
    twig_roots => { 'office:body' => 1 },
    twig_handlers =>
    {
        # links (text:a) must be handled after the link targets (text:
+bookmark)
        'text:a' => \&handler_links,

            'text:h' => \&handler_h,
            'text:p' => \&handler_p,
            'draw:frame' => \&handler_draw_frame,
            'office:annotation' => sub { $_->delete },
            'office:annotation-end' => sub { $_->delete },
            'text:sequence-decls' => sub { $_->delete },
            'text:tracked-changes' => sub { $_->delete },
            'text:table-of-content' => sub { $_->delete },
            'office:forms' => sub { $_->delete },
            'text:list' => \&handler_lift_up,
            'text:section' => \&handler_lift_up,
            'office:body' => \&handler_lift_up,
            'office:text' => \&handler_lift_up,
    }, );

$xml->parse($content);
$content = $xml->sprint;
$xml->dispose;

. . .
[download]

The first pass is necessary to collect some information about typographical markup before actual processing begins. Then there are some manipulations which have to be kept separate for the second and third passes, but that helps because spreading some of the other handlers across the two passes seems to speed up processing.

Then there is some regex tidying of the markdown stored in $content afterward. It seemed best to do that as regex.

It's a bit moot at this point, though: While the script does the job quite nicely on the use-cases I've tried it on, the person I wrote it for has access to additional data (for his specific use-case) so he was inspired to write a second version which is more tightly couple with that data.

In reply to Re^2: Speed comparison of foreach vs grep + map by mldvx4
in thread Speed comparison of foreach vs grep + map by mldvx4

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.