in reply to Parsing HTML into various files
Now the code itself might look somewhat confusing because I've interwoven the loop of getting the next token with a conditional using .., which neatly allows me to extract multiple consecutive tokens from the HTML, between for example a start tag and its associated end tag. That won't work as neatly if you had nested tags of the same type, for example nested divs or tables — in that case, you would have been forced to count how deep the nesting is to decide if you got to the end of it. But luckily that isn't the case here.
The total code is 40-50 lines long, which isn't that bad, I suppose.
Enjoy.
p.s. I used 1 letter variables for often used variables, because, well, there aren't many of them and I didn't feel like typing long names over and over again — it'd only make the code much longer, not more readable, using longer names.#! perl -w use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('Scroll_of_colors.html'); $p->get_token('table'); my($key, @table); # "globals" that have to stick across loops while(my $t = $p->get_token) { if(my $f = $t->is_start_tag('th') .. $t->is_end_tag('th')) { if($f == 1) { if($key) { # end of previous section $hash{$key} = { @table }; } $_ = ''; @table = (); } elsif($t->is_text) { $_ .= $t->as_is; } elsif($f =~ /E/) { s/\s+/ /g; s/^ //; s/ $//; $key = $_; } } elsif($f = $t->is_start_tag('td') .. $t->is_end_tag('td') || $t- +>is_end_tag('tr')) { if($f == 1) { $_ = ""; my $colspan = $t->get_attr('colspan'); if($colspan) { push @table, $colspan == 2 ? '*' : '='; # fake attrib +ute names } } elsif ($f =~ /E/) { s/\s+/ /g; s/^ //; s/[ :]+$//; push @table, $_; # key or value } elsif ($t->is_text) { $_ .= $t->as_is; } } elsif($t->is_end_tag('table')) { # end of last section if($key && @table) { $hash{$key} = { @table }; } last; } } use Data::Dumper; print Dumper \%hash;
Legend:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Parsing HTML into various files
by Lady_Aleena (Priest) on Aug 25, 2010 at 03:01 UTC | |
by psini (Deacon) on Aug 25, 2010 at 08:48 UTC | |
by bart (Canon) on Aug 25, 2010 at 10:00 UTC | |
by Lady_Aleena (Priest) on Aug 25, 2010 at 17:56 UTC | |
by bart (Canon) on Aug 25, 2010 at 18:43 UTC | |
by Lady_Aleena (Priest) on Aug 25, 2010 at 19:05 UTC | |
by psini (Deacon) on Aug 25, 2010 at 18:09 UTC | |
by bart (Canon) on Aug 25, 2010 at 18:38 UTC |