HTML TokeParser - help with using get_text, get_trimmed

tanger has asked for the wisdom of the Perl Monks concerning the following question:

hi!,

I'm trying to parse this html using html tokerparser.

HTML:

<div class="text12">This text I don't need.</div>
<div class="text12">This long test that is at least 201 characters in 
+length is what I want to parse.
<br><br>
Here is a list that I want to retrieve in this "div" tag.<br>
List:<br>
<li>item 1
<li>item 2
<li>item 4
<div>
[download]

Now since, I don't want to retrieve both div tags, i'll use a if command to check if the character length is at least 200.

Now my problem is when I use the below code, I retrieve the text only , but not the HTML, therefore leaving out the <li> tags and giving me spaces. Once retrieving the value, I can't really do anything to format the text I retrieved to a HTML list.

Is there anyway to work around this? Perhaps another command that allows Tokerparser to retrieve the HTML tags withing the 'div' tags rather then ignorning them.

my coding:




while( $token = $stream->get_token) {

if ($token->[0] eq 'S' and $token->[1] eq 'div' and ($token->[2]{'clas
+s'} || '') eq 'text12') {
    
         $description = $stream->get_trimmed_text('/div');

         $num = length($description);
        
                if ($num > 200) {  

        print "$description<br><br>";

                        }
   }
}
[download]

the above just prints out the text that turns out to look like this:

#description value:
Here is a list that I want to retrieve in this "div" tag.
List:item 1 item 2 item 4

#I want to take the description value and print it out as HTML but sin
+ce the <li> tags aren't entact, I can't properly display the list in 
+a browser.
[download]

Thanks!
Tanger

Comment on HTML TokeParser - help with using get_text, get_trimmed_text Select or Download Code

Replies are listed 'Best First'.
Re: HTML TokeParser - help with using get_text, get_trimmed_text by tachyon (Chancellor) on Nov 10, 2004 at 00:59 UTC
Here is an HTML::Parser API2 example that plonks all the stuff between text12 divs into an array - one element per div. You can easily modify it to drop tags you don't want but retain the list items. use HTML::Parser; { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ( $tagname eq 'div' and $attr->{class} eq 'text12' ) { push @{$self->{text12}}, ''; # start a new text 12 collec +tion $self->{in_text12} = 1; } else { $self->{text12}->[-1] .= $origtext if $self->{in_text12}; } } sub end { my($self, $tagname, $origtext) = @_; $self->{in_text12} = 0 if $tagname = 'div'; $self->{text12}->[-1] .= $origtext if $self->{in_text12}; } sub text { my($self, $origtext, $is_cdata) = @_; $self->{text12}->[-1] .= $origtext if $self->{in_text12}; } } my $p = MyParser->new; $p->parse_file(*DATA); print "Got: $_\n\n" for @{$p->{text12}}; __DATA__ <div class="text12">This text I don't need.</div> <div class="text12">This long test that is at least 201 characters in +length is what I want to parse. <br><br> Here is a list that I want to retrieve in this "div" tag.<br> List:<br> <li>item 1 <li>item 2 <li>item 4 </div> [download] cheers tachyon	[reply] [d/l]
Re: HTML TokeParser - help with using get_text, get_trimmed_text by steves (Curate) on Nov 10, 2004 at 00:30 UTC
HTML::Parser, while harder to use, will give you control over what tags are parsed and how that parsing is handled per tag if needed. Since HTML::TokeParser is breaking all tags down for you, you have to examine every token it gives you and put those "back together" that you want to output. Each token has enough information to reconstruct the data for output.	[reply]