(ichimunki) Re: Skipping HTML tags with HTML::TokeParser

Recommend to use a different HTML::Parser method. get_tag is attractive, but insufficient for your task. The following rips through grabbing all URLs. If it encounters an ignore comment tag as shown, it will rip through until it gets to an anchor end tag.

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser;

my $sample_HTML = 
  "<a href=\"http://www.foobar1.com/\">link 1</a> " .
  "<!--ignore--><a href=\"http://www.foobar2.com/\">link 2</a> " .
  "<a href=\"http://www.foobar3.com/\">link 3</a> ";

my $p = HTML::TokeParser->new( \$sample_HTML );
my $token;
my $link_count = 1;

while( $token = $p->get_token() ) {
  if( $token->[0] eq 'S' && 
      $token->[1] eq 'a' )
  {
    my $text = $token->[2]->{'href'};
    print "Found link $link_count: $text\n";
    $link_count++;
  }
  if( $token->[0] eq 'C' &&
      $token->[1] eq '<!--ignore-->' )
  {
    while ( $token->[0] ne 'E' &&
            $token->[1] ne 'a' )
    {
       $token = $p->get_token();
    }
  }
}

__output___

%perl ignore_some.pl

Found link 1: http://www.foobar1.com/
Found link 2: http://www.foobar3.com/
[download]

Comment on (ichimunki) Re: Skipping HTML tags with HTML::TokeParser Download Code

Replies are listed 'Best First'.
Skipping HTML tags using the "get_token" method with HTML::TokeParser module by Anonymous Monk on Mar 14, 2002 at 10:04 UTC
I don't know how to use the "get_token" method in the HTML::TokeParser module to skip the following set of html tags. `<DD> <A NAME="394893"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex1>changing dates <a href="chview.htm#1052431">1</a>, <a href="chcncpt.htm#1052501">2</a> </zindex1> <DD> <A NAME="394896"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex2> jump to <a href="chcncpt.htm#1046200">1</a> </zindex2>` [download] I want extract the text between `<zindex1> and </zindex1>`, and the link between `<a href=...>1</a>` , and also the text between the sub-index `<zindex2> and </zindex2>` and the link between `<a href=..>1</a>` as well. So the ideal result would be like the following: `changing dates chview.htm#1052431 1 chcncpt.htm#1052501 2 jump to chcncpt.htm#1046200 1` [download] But I have not been able to do it with the get_tag() method, I don't know how to skip the `<DD> <A Name...></A><Font fact=...>` [download] tags using the "get_token" method. Any suggestion?	[reply] [d/l] [select]
(ichi) Re: Skipping HTML tags using the "get_token" method with HTML::TokeParser module by ichimunki (Priest) on Mar 14, 2002 at 19:14 UTC
Same as above, only instead of `while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] eq 'a' ) {` [download] use something like `while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] =~ /zindex/ ) {` [download] Although I've never heard of the zindex# tag for HTML, so I can't say whether HTML::Parser catches it.	[reply] [d/l] [select]