Who wants to help me adjust LinkExtor::Simple?

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I know the beauty of Perl modules is that I should be able to edit them to suit my purposes, and of course I should write to the author (in this case Brian D Foy) and ask him for an update to the code, but right now, can I get a witness, I mean, can I get a hand adding one more sub to SimpleLinkExtor so that I can use it to grab remote script files. The tag is <SCRIPT> and the attribute is SRC and I swear, I've tried, I just can't figure out where to start...

($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Comment on Who wants to help me adjust LinkExtor::Simple? Select or Download Code

Replies are listed 'Best First'.

Re: Who wants to help me adjust LinkExtor::Simple?
by tachyon (Chancellor) on Jun 15, 2004 at 06:01 UTC

Add this line:

    script tag
[download]

to %AUTO_METHODS hash and it should work. (untested)

%AUTO_METHODS = qw(
    background attribute
    href   attribute
    src    attribute

    a       tag
    area    tag
    base    tag
    body    tag
    img    tag
    frame    tag
     
    script    tag
    );
[download]

cheers

tachyon

[reply]
[d/l]
[select]

Re^2: Who wants to help me adjust LinkExtor::Simple?

by brian_d_foy (Abbot) on Jun 15, 2004 at 20:13 UTC

HTML::SimpleLinkExtor 1.05, just released, contains this addition. I also included a note in the docs to tell people that they can do this while they wait for me to upload a fix.

Re^2: Who wants to help me adjust LinkExtor::Simple?

by Cody Pendant (Prior) on Jun 15, 2004 at 06:23 UTC

($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Re: Who wants to help me adjust LinkExtor::Simple?
by PodMaster (Abbot) on Jun 15, 2004 at 05:56 UTC

The tag is <SCRIPT> and the attribute is SRC and I swear, I've tried, I just can't figure out where to start...

I would just use HTML::LinkExtractor. The interface is a tad different, but it has a better support for "links."

On the other hand, looking at the HTML::SimpleLinkExtor source I can see that it's using HTML::LinkExtor which decides what constitutes a link via %HTML::Tagset::linkElements.

update: Hmm, it looks like %HTML::Tagset::linkElements already has script/src in there, so I'd suggest you show the html and the perl that's not working out for you.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Re: Who wants to help me adjust LinkExtor::Simple?
by saskaqueer (Friar) on Jun 15, 2004 at 05:45 UTC

This:

... and of course I should write to the author (in this case Brian D Foy) and ask him for an update to the code...

and this:

$ perldoc HTML::SimpleLinkExtor
<snip>

=head1 TO DO

This module doesn't handle all of the HTML tags that might 
have links.  If someone wants those, I'll add them.
[download]

That's all that anybody needs to say :)

Re^2: Who wants to help me adjust LinkExtor::Simple?

by Cody Pendant (Prior) on Jun 15, 2004 at 05:57 UTC

I did see that! : ) I was just opening it up, hoping it would say "here's where we extract the IMG SRC attributes" then go ahead with something I could munge, like if ($tag == 'img'){push @img_srcs $attributes{'src'}} kind of thing, but no, it was all rather more obscure than that.

I'll write to Brian but in the meantime, anyone got any ideas for me?

The other way to go of course is to get into TokeParser or something and do it myself, but that seems a shame when this module does everything I want except that one thing.

($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print

Re: Who wants to help me adjust LinkExtor::Simple?
by saskaqueer (Friar) on Jun 15, 2004 at 06:26 UTC

Carrying on with my recent fetish for HTML::Parser, I provide this. It will grab the 'src', 'href', 'background' and any other attributes you wish from any HTML element. This could be changed very easily to limit which tag elements are recognized, etc. Enjoy.

#!perl -w

use strict;
use HTML::Parser;

# list of html attributes which contain UR[IL]s
my @ATTR = qw( src href background );

my $parser = HTML::Parser->new(
    start_h => [ \&parser_tag, 'self, attr' ]
);
$parser->parse_file( *DATA );

print join($/, @{ $parser->{_links} }), $/;


sub parser_tag {
    my ($self, $attr) = @_;

    while ( my ($attr_n, $attr_v) = each %$attr ) {
        next unless grep $_ eq $attr_n, @ATTR;
        push @{ $self->{_links} }, $attr_v;
    }
}

__DATA__
<html>
    <head>
        <title>Sample Page</title>
        <script language="JavaScript" src="/foo.js"></script>
        <style type="text/css" src="/bar.css"></style>
    </head>

    <body background="/qux.jpg">
        <a href="/some-link"><img src="/some-image" /></a>
    </body>
</html>
[download]

Back to Seekers of Perl Wisdom