in reply to Parsing HTML into various files

Here's a bit of code that parses one of your HTML files into a hash of hashes, as an example. I used HTML::TokeParser::Simple because I like how it gives me one token (a start tag, end tag or piece of text) at a time — just like one would read one line at a time from a text file.

Now the code itself might look somewhat confusing because I've interwoven the loop of getting the next token with a conditional using .., which neatly allows me to extract multiple consecutive tokens from the HTML, between for example a start tag and its associated end tag. That won't work as neatly if you had nested tags of the same type, for example nested divs or tables — in that case, you would have been forced to count how deep the nesting is to decide if you got to the end of it. But luckily that isn't the case here.

The total code is 40-50 lines long, which isn't that bad, I suppose.

Enjoy.

#! perl -w use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new('Scroll_of_colors.html'); $p->get_token('table'); my($key, @table); # "globals" that have to stick across loops while(my $t = $p->get_token) { if(my $f = $t->is_start_tag('th') .. $t->is_end_tag('th')) { if($f == 1) { if($key) { # end of previous section $hash{$key} = { @table }; } $_ = ''; @table = (); } elsif($t->is_text) { $_ .= $t->as_is; } elsif($f =~ /E/) { s/\s+/ /g; s/^ //; s/ $//; $key = $_; } } elsif($f = $t->is_start_tag('td') .. $t->is_end_tag('td') || $t- +>is_end_tag('tr')) { if($f == 1) { $_ = ""; my $colspan = $t->get_attr('colspan'); if($colspan) { push @table, $colspan == 2 ? '*' : '='; # fake attrib +ute names } } elsif ($f =~ /E/) { s/\s+/ /g; s/^ //; s/[ :]+$//; push @table, $_; # key or value } elsif ($t->is_text) { $_ .= $t->as_is; } } elsif($t->is_end_tag('table')) { # end of last section if($key && @table) { $hash{$key} = { @table }; } last; } } use Data::Dumper; print Dumper \%hash;
p.s. I used 1 letter variables for often used variables, because, well, there aren't many of them and I didn't feel like typing long names over and over again — it'd only make the code much longer, not more readable, using longer names.

Legend:

$p
parser
$t
token
$f
flag

Replies are listed 'Best First'.
Re^2: Parsing HTML into various files
by Lady_Aleena (Priest) on Aug 25, 2010 at 03:01 UTC

    Quick question, does use strict; and putting in my %hash; change the basic makeup of the script? I am getting the following error after those two changes:

    Can't call method "get_token" on an undefined value at C:\Documents an +d Settings\ME\My Documents\fantasy\files\perl\parser.pl line 11.

    Line 11 in my copy of the script is $parser->get_token('table'); Yes, I expanded the variables in the script. :)

    Have a cookie and a very nice day!
    Lady Aleena

      From the error message it looks like $parser is undef. So it is probably the previous line

      my $parser = HTML::TokeParser::Simple->new(...);
      which fails. Check if $parser is defined, and if the filename is valid; I don't think that HTML::TokeParser::Simple->new returns an error message, so best chance is that the file name is invalid.

      Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      psini is absolutely right, adding use strict and declaring all variables will not change the working of script at all.

      So the only explanation I can think of is that it can't read the file. BTW in my case I downloaded the file from the URL and put it right next to the script. Did you forget that? If the file is elsewhere, you have to adjust the file path.

        ACK! It was the file name, I had a small typo in it that I didn't catch earlier. So now I ran it, but there are still some issues which I can't pin down.

        The errors and output

        Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 17. Odd number of elements in anonymous hash at C:\..\perl\tokeparser.pl l +ine 51. $VAR1 = { 'Saving Throw:' => { 'None' => undef }, 'Casting Time:' => { '2' => undef }, 'Area of Effect:' => { '10 yds./level' => undef }, 'Range:' => { '0' => undef }, 'Duration:' => { '5 rds./level' => undef }, 'Level:' => { '2' => undef }, 'Components:' => { 'V, S, M' => undef } };

        line 17

        $hash{$key} = { @table };

        line 51

        $hash{$key} = { @table };

        I still haven't had that A-HA! moment where I get how this works.

        Have a cookie and a very nice day!
        Lady Aleena