Skipping HTML tags with HTML::TokeParser

patgas has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(ichimunki) Re: Skipping HTML tags with HTML::TokeParser by ichimunki (Priest) on Jul 31, 2001 at 22:02 UTC
Recommend to use a different HTML::Parser method. get_tag is attractive, but insufficient for your task. The following rips through grabbing all URLs. If it encounters an ignore comment tag as shown, it will rip through until it gets to an anchor end tag. #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $sample_HTML = "<a href=\"http://www.foobar1.com/\">link 1</a> " . "<!--ignore--><a href=\"http://www.foobar2.com/\">link 2</a> " . "<a href=\"http://www.foobar3.com/\">link 3</a> "; my $p = HTML::TokeParser->new( \$sample_HTML ); my $token; my $link_count = 1; while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] eq 'a' ) { my $text = $token->[2]->{'href'}; print "Found link $link_count: $text\n"; $link_count++; } if( $token->[0] eq 'C' && $token->[1] eq '<!--ignore-->' ) { while ( $token->[0] ne 'E' && $token->[1] ne 'a' ) { $token = $p->get_token(); } } } __output___ %perl ignore_some.pl Found link 1: http://www.foobar1.com/ Found link 2: http://www.foobar3.com/ [download]	[reply] [d/l]
Skipping HTML tags using the "get_token" method with HTML::TokeParser module by Anonymous Monk on Mar 14, 2002 at 10:04 UTC
I don't know how to use the "get_token" method in the HTML::TokeParser module to skip the following set of html tags. `<DD> <A NAME="394893"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex1>changing dates <a href="chview.htm#1052431">1</a>, <a href="chcncpt.htm#1052501">2</a> </zindex1> <DD> <A NAME="394896"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex2> jump to <a href="chcncpt.htm#1046200">1</a> </zindex2>` [download] I want extract the text between `<zindex1> and </zindex1>`, and the link between `<a href=...>1</a>` , and also the text between the sub-index `<zindex2> and </zindex2>` and the link between `<a href=..>1</a>` as well. So the ideal result would be like the following: `changing dates chview.htm#1052431 1 chcncpt.htm#1052501 2 jump to chcncpt.htm#1046200 1` [download] But I have not been able to do it with the get_tag() method, I don't know how to skip the `<DD> <A Name...></A><Font fact=...>` [download] tags using the "get_token" method. Any suggestion?	[reply] [d/l] [select]
(ichi) Re: Skipping HTML tags using the "get_token" method with HTML::TokeParser module by ichimunki (Priest) on Mar 14, 2002 at 19:14 UTC
Same as above, only instead of `while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] eq 'a' ) {` [download] use something like `while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] =~ /zindex/ ) {` [download] Although I've never heard of the zindex# tag for HTML, so I can't say whether HTML::Parser catches it.	[reply] [d/l] [select]
Re: Skipping HTML tags with HTML::TokeParser by voyager (Friar) on Jul 31, 2001 at 20:27 UTC
I'm not really familiar with Toke::Parser, but it is easily done with HTML::Parser. You would have a comment handler where you would set a variable ($in_ignore_section). Then the appearance of an A-tag if you were in an ignore section could be ignored by you. You will also need to take care of resetting the flag. Presumably you have a similar comment that says "end-ignore-section". Otherwise you'll have to clear it based on something else, i.e., the start / end of some other tag.	[reply]
Re: Skipping HTML tags with HTML::TokeParser by mexnix (Pilgrim) on Jul 31, 2001 at 20:32 UTC
You should take a looky at HTML::TokeParser Tutorial by crazyinsomniac. Although I haven't actually used the mod myself you should try something like crazy did in his recent CPAN fetcher gig and use some crazy if'ing. Hope this helps __________________________________________________ `<moviequote name="The Whole Nine Yards"> Jimmy T: Oz, we're friends, friends do not engage in sexual congress with each others wives. </moviequote>` `mexnix.perlmonk.org`	[reply]