Re: new patent crawler

#!/usr/bin/perl -w
# xurl - extract unique, sorted list of links from URL
use HTML::LinkExtor;
use LWP::Simple;
                                                                      
+          
$base_url = shift;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
    my @element  = @$linkarray;
    my $elt_type = shift @element;
    while (@element) {
        my ($attr_name , $attr_value) = splice(@element, 0, 2);
        $seen{$attr_value}++;
    }<br>
}<br>
for (sort keys %seen) { print $_, "\n" }
[download]

//the problem so far is it just outputs the links to the screen, I think it's stored in an array - how can I access each one individually so I can do a get for each one... thanks

Edit by castaway - code tags

Comment on Re: new patent crawler Download Code

Replies are listed 'Best First'.
Re^2: new patent crawler by tachyon (Chancellor) on Sep 07, 2004 at 08:39 UTC
Did you try anything? You are looping over the links and printing them! They are stored as the keys of a hash. This has been done to remove duplicates. If you want and arrary you could just do `@links = keys %seen;` But as you are already looping over them did it cross your mind to get them as well? Here I am assigning to $link instead of using the sefault assignment to $_ for clarity of code..... `for my $link (sort keys %seen) { print "Getting $link....."; my $html = get($link); if ( $html =~ m/whatever/ ) { print "Wohoo!\n"; } else { print "Bugger\n"; } }` [download] cheers tachyon	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: new patent crawler
by tachyon (Chancellor) on Sep 07, 2004 at 08:39 UTC

Did you try anything? You are looping over the links and printing them! They are stored as the keys of a hash. This has been done to remove duplicates. If you want and arrary you could just do @links = keys %seen;

But as you are *already* looping over them did it cross your mind to get them as well? Here I am assigning to $link instead of using the sefault assignment to $_ for clarity of code.....

for my $link (sort keys %seen) { 
    print "Getting $link.....";
    my $html = get($link);
    if ( $html =~ m/whatever/ ) {
        print "Wohoo!\n";
    }
    else {
        print "Bugger\n";
    } 
}
[download]

cheers

tachyon

[reply]
[d/l]
[select]