patgas has asked for the wisdom of the Perl Monks concerning the following question:

I'm using the following bit of code to list out all the links in an HTML page, but I'd like it to ignore <a> tags that are immediately preceded by something like <!--ignore-->.

So <a href="http://perlmonks.org"> would get listed, but <!--ignore--><a href="http://perlmonks.org"> would not. Any suggestions?

#!/usr/bin/perl -w use strict; use HTML::TokeParser; -e $ARGV[0] or die "File does not exist: $ARGV[0]\n"; my $p = HTML::TokeParser->new( shift ); while ( my $token = $p->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $p->get_trimmed_text("/a"); print "$url\n"; }

Replies are listed 'Best First'.
(ichimunki) Re: Skipping HTML tags with HTML::TokeParser
by ichimunki (Priest) on Jul 31, 2001 at 22:02 UTC
    Recommend to use a different HTML::Parser method. get_tag is attractive, but insufficient for your task. The following rips through grabbing all URLs. If it encounters an ignore comment tag as shown, it will rip through until it gets to an anchor end tag.
    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $sample_HTML = "<a href=\"http://www.foobar1.com/\">link 1</a> " . "<!--ignore--><a href=\"http://www.foobar2.com/\">link 2</a> " . "<a href=\"http://www.foobar3.com/\">link 3</a> "; my $p = HTML::TokeParser->new( \$sample_HTML ); my $token; my $link_count = 1; while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] eq 'a' ) { my $text = $token->[2]->{'href'}; print "Found link $link_count: $text\n"; $link_count++; } if( $token->[0] eq 'C' && $token->[1] eq '<!--ignore-->' ) { while ( $token->[0] ne 'E' && $token->[1] ne 'a' ) { $token = $p->get_token(); } } } __output___ %perl ignore_some.pl Found link 1: http://www.foobar1.com/ Found link 2: http://www.foobar3.com/
      I don't know how to use the "get_token" method in the HTML::TokeParser module to skip the following set of html tags.
      <DD> <A NAME="394893"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex1>changing dates <a href="chview.htm#1052431">1</a>, <a href="chcncpt.htm#1052501">2</a> </zindex1> <DD> <A NAME="394896"></A><FONT FACE="helvetica, arial, sans-serif" SIZE="-1"><zindex2> jump to <a href="chcncpt.htm#1046200">1</a> </zindex2>
      I want extract the text between <zindex1> and  </zindex1>, and the link between <a href=...>1</a>

      , and also the text between the sub-index <zindex2> and </zindex2> and the link between

      <a href=..>1</a> as well. So the ideal result would be like the following:

      changing dates chview.htm#1052431 1 chcncpt.htm#1052501 2 jump to chcncpt.htm#1046200 1
      But I have not been able to do it with the get_tag() method, I don't know how to skip the
      <DD> <A Name...></A><Font fact=...>
      tags using the "get_token" method. Any suggestion?
        Same as above, only instead of
        while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] eq 'a' ) {
        use something like
        while( $token = $p->get_token() ) { if( $token->[0] eq 'S' && $token->[1] =~ /zindex/ ) {
        Although I've never heard of the zindex# tag for HTML, so I can't say whether HTML::Parser catches it.
Re: Skipping HTML tags with HTML::TokeParser
by voyager (Friar) on Jul 31, 2001 at 20:27 UTC
    I'm not really familiar with Toke::Parser, but it is easily done with HTML::Parser.

    You would have a comment handler where you would set a variable ($in_ignore_section). Then the appearance of an A-tag if you were in an ignore section could be ignored by you.

    You will also need to take care of resetting the flag. Presumably you have a similar comment that says "end-ignore-section". Otherwise you'll have to clear it based on something else, i.e., the start / end of some other tag.

Re: Skipping HTML tags with HTML::TokeParser
by mexnix (Pilgrim) on Jul 31, 2001 at 20:32 UTC
    You should take a looky at HTML::TokeParser Tutorial by crazyinsomniac. Although I haven't actually used the mod myself you should try something like crazy did in his recent CPAN fetcher gig and use some crazy if'ing. Hope this helps

    __________________________________________________
    <moviequote name="The Whole Nine Yards">
    Jimmy T: Oz, we're friends, friends do not engage in sexual congress with each others wives.
    </moviequote>

    mexnix.perlmonk.org