in reply to Stripping a-href tags from an HTML document

Thanks to Ovid, you can do this:

#!/usr/bin/perl use HTML::TokeParser::Simple; use strict; warn "Strip HREF\n"; my $p = HTML::TokeParser::Simple->new($ARGV[0]); while ( my $token = $p->get_token ) { next if ($token->is_start_tag('a') || $token->is_end_tag('a')) +; print $token->as_is; } warn "Extract HREF\n"; my $p = HTML::TokeParser::Simple->new($ARGV[0]); while ( my $token = $p->get_token ) { next unless ($token->is_start_tag('a')); print $token->return_attr->{href}, "\n"; }

HTH, Valerio

  • Comment on Re: Is there a faster / more efficient / quicker or easier way to do this ?
  • Download Code

Replies are listed 'Best First'.
Re: Re: Is there a faster / more efficient / quicker or easier way to do this ?
by PodMaster (Abbot) on Jan 09, 2003 at 23:15 UTC
    I like anyone who likes that module ;)

    I'd use next if $token->is_tag('a'); instead, but you really wanna combine your snippets, something like

    use HTML::TokeParser::Simple; use strict; for(@ARGV){ my $p = HTML::TokeParser::Simple->new($_); my $hrefCount = 0; print "STRIPPING A-HREF TAGS in '$_'. PLEASE WAIT..\n"; open(TEMPO,">$_.tempo) or die "coudln't create $_.tempo($!)"; while(defined( my $t = $p->get_token() )){ if( $t->is_start_tag('a') ){ my $attr = $t->return_attr; if(exists $attr->{href}) { $hrefCount++; print "\nHREF TAG-->[$hrefCount]-->", $t->return_attr->{href},"\n\n"; } next; } elsif( $t->is_end_tag('a') ) { next; } else { print TEMPO $t->as_is; } } close(TEMPO); rename "$_.tempo", $_ or warn "couldn't rename '$_.tempo' to '$_'" +; }


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      In trying to rewrite it to satisfy my sense of Fewer Indentation Levels Are Better, I rephrased the loop like this:
      while(defined(my $t = $p->get_token())){ print(TEMPO $t->as_is), next unless $t->is_tag('a'); my $attr = $t->return_attr; print( "\nHREF TAG-->[", ++$hrefCount, "]-->", $attr->{href}, "\n\n" ) if exists $attr->{href}; }

      Doing so it occured to me it will discard A NAME too - and fixing that is not entirely trivial as you need to keep track of whether the start tag was dropped or kept when you come across a closing /A.

      Update: this should work. Untested, but you get the idea.

      my @stack; while(defined(my $t = $p->get_token())){ if($t->is_start_tag('a')) { my $attr = $t->return_attr; push @stack, exists $attr->{href}; print( "\nHREF TAG-->[", ++$hrefCount, "]-->", $attr->{href}, "\n\n" ), next if $stack[-1]; } next if $t->is_end_tag('a') and pop @stack; print TEMPO $t->as_is; }

      Makeshifts last the longest.