Now the code itself might look somewhat confusing because I've interwoven the loop of getting the next token with a conditional using .., which neatly allows me to extract multiple consecutive tokens from the HTML, between for example a start tag and its associated end tag. That won't work as neatly if you had nested tags of the same type, for example nested divs or tables — in that case, you would have been forced to count how deep the nesting is to decide if you got to the end of it. But luckily that isn't the case here.
The total code is 40-50 lines long, which isn't that bad, I suppose.
Enjoy.
p.s. I used 1 letter variables for often used variables, because, well, there aren't many of them and I didn't feel like typing long names over and over again — it'd only make the code much longer, not more readable, using longer names.#! perl -w use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('Scroll_of_colors.html'); $p->get_token('table'); my($key, @table); # "globals" that have to stick across loops while(my $t = $p->get_token) { if(my $f = $t->is_start_tag('th') .. $t->is_end_tag('th')) { if($f == 1) { if($key) { # end of previous section $hash{$key} = { @table }; } $_ = ''; @table = (); } elsif($t->is_text) { $_ .= $t->as_is; } elsif($f =~ /E/) { s/\s+/ /g; s/^ //; s/ $//; $key = $_; } } elsif($f = $t->is_start_tag('td') .. $t->is_end_tag('td') || $t- +>is_end_tag('tr')) { if($f == 1) { $_ = ""; my $colspan = $t->get_attr('colspan'); if($colspan) { push @table, $colspan == 2 ? '*' : '='; # fake attrib +ute names } } elsif ($f =~ /E/) { s/\s+/ /g; s/^ //; s/[ :]+$//; push @table, $_; # key or value } elsif ($t->is_text) { $_ .= $t->as_is; } } elsif($t->is_end_tag('table')) { # end of last section if($key && @table) { $hash{$key} = { @table }; } last; } } use Data::Dumper; print Dumper \%hash;
Legend:
In reply to Re: Parsing HTML into various files
by bart
in thread Parsing HTML into various files
by Lady_Aleena
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |