keyDemun has asked for the wisdom of the Perl Monks concerning the following question:

Please show me some different approaches to this kind of text-processing somebody !
# Script name: srtiptag.pl # Purpose: Strips html a-href tags from a document # Author: keyDemun # http://c0depalace.netfirms.com/cgi-bin/cforum.pl print "STRIPPING A-HREF TAGS. PLEASE WAIT..\n"; open(FO,"@ARGV[0]") or print "File not found!\n"; while(<FO>) { $fc=$fc.$_; } close(FO); for($i=0;$i<length($fc);$i++) { $str=substr($fc,$i,1); if($str eq "<") { $st=substr($fc,$i,2); if($st eq "<a" || $st eq "<A") { while(substr($fc,$i,1) ne ">") { @hRef[$cnt]=@hRef[$cnt].substr($fc,$i,1); + $i++; } @hRef[$cnt]=@hRef[$cnt]."></a>"; $cnt++; } } } for($j=0;$j<$cnt;$j++) { print "\nHREF TAG-->[$j]-->@hRef[$j]\n\n"; }

Title edit by tye

Replies are listed 'Best First'.
Re: Is there a faster / more efficient / quicker or easier way to do this ?
by Hofmator (Curate) on Jan 09, 2003 at 12:47 UTC
    See also HTML::LinkExtractor.

    And this:

    while(<FO>) { $fc=$fc.$_; }
    can be written better as: $fc = do { local $/; <FO> };

    -- Hofmator

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by Zaxo (Archbishop) on Jan 09, 2003 at 12:41 UTC

    See HTML::Parser.

    Definitely easier and more effective.

    After Compline,
    Zaxo

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by valdez (Monsignor) on Jan 09, 2003 at 12:58 UTC

    Thanks to Ovid, you can do this:

    #!/usr/bin/perl use HTML::TokeParser::Simple; use strict; warn "Strip HREF\n"; my $p = HTML::TokeParser::Simple->new($ARGV[0]); while ( my $token = $p->get_token ) { next if ($token->is_start_tag('a') || $token->is_end_tag('a')) +; print $token->as_is; } warn "Extract HREF\n"; my $p = HTML::TokeParser::Simple->new($ARGV[0]); while ( my $token = $p->get_token ) { next unless ($token->is_start_tag('a')); print $token->return_attr->{href}, "\n"; }

    HTH, Valerio

      I like anyone who likes that module ;)

      I'd use next if $token->is_tag('a'); instead, but you really wanna combine your snippets, something like

      use HTML::TokeParser::Simple; use strict; for(@ARGV){ my $p = HTML::TokeParser::Simple->new($_); my $hrefCount = 0; print "STRIPPING A-HREF TAGS in '$_'. PLEASE WAIT..\n"; open(TEMPO,">$_.tempo) or die "coudln't create $_.tempo($!)"; while(defined( my $t = $p->get_token() )){ if( $t->is_start_tag('a') ){ my $attr = $t->return_attr; if(exists $attr->{href}) { $hrefCount++; print "\nHREF TAG-->[$hrefCount]-->", $t->return_attr->{href},"\n\n"; } next; } elsif( $t->is_end_tag('a') ) { next; } else { print TEMPO $t->as_is; } } close(TEMPO); rename "$_.tempo", $_ or warn "couldn't rename '$_.tempo' to '$_'" +; }


      MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
      ** The Third rule of perl club is a statement of fact: pod is sexy.

        In trying to rewrite it to satisfy my sense of Fewer Indentation Levels Are Better, I rephrased the loop like this:
        while(defined(my $t = $p->get_token())){ print(TEMPO $t->as_is), next unless $t->is_tag('a'); my $attr = $t->return_attr; print( "\nHREF TAG-->[", ++$hrefCount, "]-->", $attr->{href}, "\n\n" ) if exists $attr->{href}; }

        Doing so it occured to me it will discard A NAME too - and fixing that is not entirely trivial as you need to keep track of whether the start tag was dropped or kept when you come across a closing /A.

        Update: this should work. Untested, but you get the idea.

        my @stack; while(defined(my $t = $p->get_token())){ if($t->is_start_tag('a')) { my $attr = $t->return_attr; push @stack, exists $attr->{href}; print( "\nHREF TAG-->[", ++$hrefCount, "]-->", $attr->{href}, "\n\n" ), next if $stack[-1]; } next if $t->is_end_tag('a') and pop @stack; print TEMPO $t->as_is; }

        Makeshifts last the longest.

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by jdporter (Paladin) on Jan 09, 2003 at 16:33 UTC
    This is the kind of job for which HTML::TreeBuilder was designed.
    use HTML::TreeBuilder; sub strip_a_elements { my( $html, $preserve_formatting ) = @_; my $t = HTML::TreeBuilder->new; if ( $preserve_formatting ) { $t->no_space_compacting(1); $t->ignore_ignorable_whitespace(0); $t->store_comments(1); $t->store_declarations(1); $t->store_pis(1); } $t->parse( $html )->eof; # we've parsed; now do the desired transformation: $_->replace_with_content for $t->find_by_tag_name('a'); # and return the resulting hunk of html: $t->as_HTML }
    Update...

    If you're sure you won't care about the formatting of the raw HTML, the above can be simplified to:
    sub strip_a_elements { my $t = HTML::TreeBuilder->new_from_content( $_[0] ); $_->replace_with_content for $t->find_by_tag_name('a'); $t->as_HTML }
    Another idea I have is to make a new method in HTML::TreeBuilder (actually HTML::Element) for the purpose of removing elements like that.
    # remember that HTML::TreeBuilder inherits from HTML::Element. sub HTML::Element::strip_elements { my( $e, $tag ) = @_; $_->replace_with_content for $e->find_by_tag_name($tag); $e } # now we can write our subroutine like this: sub strip_a_elements { HTML::TreeBuilder ->new_from_content( $_[0] ) ->strip_elements('a') ->as_HTML } # and call it: my $html_minus_links = strip_a_elements( $html );

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.