Here's a bit of code that parses one of your HTML files into a hash of hashes, as an example. I used HTML::TokeParser::Simple because I like how it gives me one token (a start tag, end tag or piece of text) at a time — just like one would read one line at a time from a text file.

Now the code itself might look somewhat confusing because I've interwoven the loop of getting the next token with a conditional using .., which neatly allows me to extract multiple consecutive tokens from the HTML, between for example a start tag and its associated end tag. That won't work as neatly if you had nested tags of the same type, for example nested divs or tables — in that case, you would have been forced to count how deep the nesting is to decide if you got to the end of it. But luckily that isn't the case here.

The total code is 40-50 lines long, which isn't that bad, I suppose.

Enjoy.

#! perl -w use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('Scroll_of_colors.html'); $p->get_token('table'); my($key, @table); # "globals" that have to stick across loops while(my $t = $p->get_token) { if(my $f = $t->is_start_tag('th') .. $t->is_end_tag('th')) { if($f == 1) { if($key) { # end of previous section $hash{$key} = { @table }; } $_ = ''; @table = (); } elsif($t->is_text) { $_ .= $t->as_is; } elsif($f =~ /E/) { s/\s+/ /g; s/^ //; s/ $//; $key = $_; } } elsif($f = $t->is_start_tag('td') .. $t->is_end_tag('td') || $t- +>is_end_tag('tr')) { if($f == 1) { $_ = ""; my $colspan = $t->get_attr('colspan'); if($colspan) { push @table, $colspan == 2 ? '*' : '='; # fake attrib +ute names } } elsif ($f =~ /E/) { s/\s+/ /g; s/^ //; s/[ :]+$//; push @table, $_; # key or value } elsif ($t->is_text) { $_ .= $t->as_is; } } elsif($t->is_end_tag('table')) { # end of last section if($key && @table) { $hash{$key} = { @table }; } last; } } use Data::Dumper; print Dumper \%hash;
p.s. I used 1 letter variables for often used variables, because, well, there aren't many of them and I didn't feel like typing long names over and over again — it'd only make the code much longer, not more readable, using longer names.

Legend:

$p
parser
$t
token
$f
flag

In reply to Re: Parsing HTML into various files by bart
in thread Parsing HTML into various files by Lady_Aleena

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.