Actually, here's what I ended up with the one time. It transforms certain HTML documents to \latex{}. (I just wanted to print the documents out, but the page break algorithms in web browsers are nonexistent.) Thought you might want to see a sample of the module in action.

my $dom = Mojo::DOM->new($html); my $body = $dom->at('.article-bodycopy'); $body->find('p, table')->each(sub { my $node = shift; if ($node->{class} eq 'SubHead') { print '\subsection{' . $node->text . "}"; return; } elsif ($node->type eq "table") { my $img = $node->find('img')->[0]->{src}; my $cap = filter($node->find('.Figure1')->[0]); $img =~ s/\.gif/\.png/; print join("\n", '\begin{Figure}', '\centering', '\includegraphics[width=0.65\linewidth,' . 'height=0.85\textheight,keepaspectratio]{' . $img . '} +', '\captionof{figure}{' . $cap . '}', '\end{Figure}'); return; } if ($node->children->size == 0) { print filter($node); } else { # node has sub-tags $node->children->each(sub { my $n = shift; my $tag = $n->type; if ($tag eq 'b') { $n->replace('{\bf ' . $n->text . '}'); } else { print STDERR "UNHANDLED MARKUP TYPE: " . $n->type +. "\n"; } }); print filter($node); } });

In reply to Re^4: extracting data from HTML by Anonymous Monk
in thread extracting data from HTML by Jurassic Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.