Stripping a-href tags from an HTML document

keyDemun has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by Hofmator (Curate) on Jan 09, 2003 at 12:47 UTC

HTML::LinkExtractor

And this:

while(<FO>) {
    $fc=$fc.$_;
  }
[download]

$fc = do { local $/; <FO> };

-- Hofmator

[reply]
[d/l]
[select]

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by Zaxo (Archbishop) on Jan 09, 2003 at 12:41 UTC

See HTML::Parser.

Definitely easier and more effective.

After Compline,
Zaxo

[reply]

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by valdez (Monsignor) on Jan 09, 2003 at 12:58 UTC

Thanks to Ovid, you can do this:

#!/usr/bin/perl

use HTML::TokeParser::Simple;
use strict;

warn "Strip HREF\n";
my $p = HTML::TokeParser::Simple->new($ARGV[0]);

while ( my $token = $p->get_token )
{
        next if ($token->is_start_tag('a') || $token->is_end_tag('a'))
+;
        print $token->as_is;
}

warn "Extract HREF\n";
my $p = HTML::TokeParser::Simple->new($ARGV[0]);

while ( my $token = $p->get_token )
{
        next unless ($token->is_start_tag('a'));
        print $token->return_attr->{href}, "\n";
}
[download]

HTH, Valerio

[reply]
[d/l]

Re: Re: Is there a faster / more efficient / quicker or easier way to do this ?

by PodMaster (Abbot) on Jan 09, 2003 at 23:15 UTC

I'd use next if $token->is_tag('a'); instead, but you really wanna combine your snippets, something like


use HTML::TokeParser::Simple;
use strict;

for(@ARGV){

    my $p = HTML::TokeParser::Simple->new($_);
    my $hrefCount = 0;

    print "STRIPPING A-HREF TAGS in '$_'. PLEASE WAIT..\n";

    open(TEMPO,">$_.tempo) or die "coudln't create $_.tempo($!)";

    while(defined( my $t = $p->get_token() )){

        if( $t->is_start_tag('a') ){

            my $attr = $t->return_attr;

            if(exists $attr->{href}) {

                $hrefCount++;

                print "\nHREF TAG-->[$hrefCount]-->",
                       $t->return_attr->{href},"\n\n";
            }

            next;

        } elsif( $t->is_end_tag('a') ) {

            next;

        } else {

            print TEMPO $t->as_is;
        }
    }

    close(TEMPO);

    rename "$_.tempo", $_ or warn "couldn't rename '$_.tempo' to '$_'"
+;
}
[download]

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]

Re^3: Is there a faster / more efficient / quicker or easier way to do this?

by Aristotle (Chancellor) on Jan 12, 2003 at 00:12 UTC

    while(defined(my $t = $p->get_token())){
        print(TEMPO $t->as_is), next
            unless $t->is_tag('a');

        my $attr = $t->return_attr;
        print(
            "\nHREF TAG-->[",
            ++$hrefCount,
            "]-->",
           $attr->{href},
           "\n\n"
        ) if exists $attr->{href};
    }
[download]

Doing so it occured to me it will discard A NAME too - and fixing that is not entirely trivial as you need to keep track of whether the start tag was dropped or kept when you come across a closing /A.

Update: this should work. Untested, but you get the idea.

    my @stack;
    while(defined(my $t = $p->get_token())){
        if($t->is_start_tag('a')) {
            my $attr = $t->return_attr;
            push @stack, exists $attr->{href};
            print(
                "\nHREF TAG-->[",
                ++$hrefCount,
                "]-->",
                $attr->{href},
                "\n\n"
            ), next if $stack[-1];
        }

        next if $t->is_end_tag('a') and pop @stack;

        print TEMPO $t->as_is;
    }
[download]

Makeshifts last the longest.

[reply]
[d/l]
[select]

Re: Is there a faster / more efficient / quicker or easier way to do this ?
by jdporter (Paladin) on Jan 09, 2003 at 16:33 UTC

  use HTML::TreeBuilder;

  sub strip_a_elements
  {
    my( $html, $preserve_formatting ) = @_;
    my $t = HTML::TreeBuilder->new;
    if ( $preserve_formatting )
    {
      $t->no_space_compacting(1);
      $t->ignore_ignorable_whitespace(0);
      $t->store_comments(1);
      $t->store_declarations(1);
      $t->store_pis(1);
    }
    $t->parse( $html )->eof;
    # we've parsed; now do the desired transformation:
    $_->replace_with_content for $t->find_by_tag_name('a');
    # and return the resulting hunk of html:
    $t->as_HTML
  }
[download]

Update...

  sub strip_a_elements
  {
    my $t = HTML::TreeBuilder->new_from_content( $_[0] );
    $_->replace_with_content for $t->find_by_tag_name('a');
    $t->as_HTML
  }
[download]

  # remember that HTML::TreeBuilder inherits from HTML::Element.
  sub HTML::Element::strip_elements
  {
    my( $e, $tag ) = @_;
    $_->replace_with_content for $e->find_by_tag_name($tag);
    $e
  }

  # now we can write our subroutine like this:
  sub strip_a_elements
  {
    HTML::TreeBuilder
    ->new_from_content( $_[0] )
    ->strip_elements('a')
    ->as_HTML
  }

  # and call it:
  my $html_minus_links = strip_a_elements( $html );
[download]

jdporter
The 6th Rule of Perl Club is -- There is no Rule #6.

[reply]
[d/l]
[select]