Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'd like to scrape the titles of new electronic patents, for example from: http://www.uspto.gov/web/patents/patog/week35/OG/ElecUtilityBody.htm I could use HTML::LinkExtor to get all the links, but I need to go to each one and then read the first two lines which always have the same format. The problem I'm having is telling the program to follow each of the links extracted...

Replies are listed 'Best First'.
Re: new patent crawler
by johnnywang (Priest) on Sep 06, 2004 at 19:20 UTC
Re: new patent crawler
by Zaxo (Archbishop) on Sep 06, 2004 at 15:48 UTC

    Please show us what you've tried, and what errors you get.

    After Compline,
    Zaxo

Re: new patent crawler
by Anonymous Monk on Sep 06, 2004 at 19:00 UTC
    #!/usr/bin/perl -w # xurl - extract unique, sorted list of links from URL use HTML::LinkExtor; use LWP::Simple; + $base_url = shift; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse(get($base_url))->eof; @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; while (@element) { my ($attr_name , $attr_value) = splice(@element, 0, 2); $seen{$attr_value}++; }<br> }<br> for (sort keys %seen) { print $_, "\n" }
    //the problem so far is it just outputs the links to the screen, I think it's stored in an array - how can I access each one individually so I can do a get for each one... thanks

    Edit by castaway - code tags

      Did you try anything? You are looping over the links and printing them! They are stored as the keys of a hash. This has been done to remove duplicates. If you want and arrary you could just do @links = keys %seen;

      But as you are *already* looping over them did it cross your mind to get them as well? Here I am assigning to $link instead of using the sefault assignment to $_ for clarity of code.....

      for my $link (sort keys %seen) { print "Getting $link....."; my $html = get($link); if ( $html =~ m/whatever/ ) { print "Wohoo!\n"; } else { print "Bugger\n"; } }

      cheers

      tachyon