tanger has asked for the wisdom of the Perl Monks concerning the following question:

Hi-

I'm trying to parse this html and having no luck with LWP.

I just started learning it and i'm stuck on parsing this HTML code. (decided to do this for fun as a learning project)
<td><div class="banner"><span id="fulldescription" class="text11g"><b> +Description:</b></span></div> </td> </tr> </table> <div align="center" style="padding-top:13"> <table width="98%" border="0" cellspacing="0" cellpadding="0"> <tr> <td> <div class="text12"> Season one of the ladder contest runs from July 1, 2004 to September 3 +0, 2004. During this time, all solo and 2v2 games on the Lordaeron, A +zeroth, Kalimdor, and Northrend Gateways will be tracked by Blizzard. + The players with the most experience in ladder play will then be mat +ched against each other in a series of tournaments to determine the u +ltimate winner in both the Solo and 2v2 formats. </div> </td>


it took me about 10 minutes to find a HTML coding like this from a gaming website.

basically i want my script to parse the description paragraph ("season one of the ladder contest...etc). i've been playing with the LWp code and am having no luck.

now a regular parsing code for the div class="text12" tag will make sense on retrieving the descriptiong. for example:

if ($token->[0] eq 'S' and $token->[1] eq 'div' and ($token->[2]{'class'} || '') eq 'text12') { print $stream->get_trimmed_text('/div'); }


But the thing is there are more then one "div class=text12" tags on the HTML page im retrieving with LWP. so I have to narrow my coding more so it parses inside that area only.

I tried this but no luck, any ideas?
while(my $token = $stream->get_token) { if ($token->[0] eq 'S' and $token->[1] eq 'span' and ($token->[2]{'id'} || '') eq 'fulldescription') { #found the <span class="fulldescription"> tag if ($token->[0][0] eq 'S' and $token->[0][1] eq 'div' and ($token->[0][2]{'class'} || '') eq 'text12') { print $stream->get_trimmed_text('/div'); } } }
the above code prints out nothing because i know i'm doing the coding wrong for this. I'm not sure on how to do the token sequence parsing so it narrows it down more.

Bobby

Replies are listed 'Best First'.
Re: Parsing this HTML description
by tachyon (Chancellor) on Oct 27, 2004 at 10:11 UTC

    You just need to maintain state. I use HTML::Parser directly. There are two flags. One flip flops as we enter and leave a text12 chunk. The other counts text12 chunks. You could just as easily trigger on the <span id='fulldescription'> tag.

    use HTML::Parser (); my $CHUNK = 2; # get Nth instance of text12 $p = HTML::Parser->new( api_version => 3, start_h => [ \&start, "self, tagname, attr" ] +, end_h => [ \&end, "self, tagname" ], text_h => [ \&text, "self, dtext" ] ); $p->parse_file(*DATA); sub start{ my ( $self, $tagname, $attr ) = @_; if ( $tagname eq 'div' and $attr->{class} eq 'text12' ) { $self->{text12}++; $self->{text12_item}++; } } sub end{ my ( $self, $tagname ) = @_; $self->{text12} = 0 if $tagname eq 'div'; } sub text{ my ( $self, $text ) = @_; print $text if $self->{text12} and $self->{text12_item} == $CHUNK; } __DATA__ <div class="text12"> Chunk 1 </div> <div class="text12"> Chunk 2 </div> <div class="text12"> Chunk 3 </div>

    cheers

    tachyon

Re: Parsing this HTML description
by skillet-thief (Friar) on Oct 27, 2004 at 11:21 UTC

    This probably isn't what you want to learn, but with HTML::Tree (and HTML::Element) it's easy (though this is untested pseudo-code)

    use strict; use HTML::Tree; my $tree = HTML::Tree->new_from_content($html_from_lwp); my $description = $tree->look_down( "_tag", "div", "class", "text12"); + my $trimmed_text = $description->as_text();

    This assumes that you want the first <div> where class = text12.