My Little Time-Saving Spyder

(I didn't know if this was more of a Craft thang or a Cool Uses thang, so a posted my whole shpeil here and just the source in Craft)

Last week my boss approached me with a side project (I'm in Customer Service for a commercial website-- when we aren't answering e-mail, we do little content-based odd-jobs for the site): she wanted me to count every occurrence of phrases A and B on our website (things like "shipping address" vs "ship to address.") She suggested I browse the site manually to do this. I said that that was ridiculous, that it was a job for a machine, not a human being. She said yeah, whatever, get it done by May 5. (There was another similar job, finding uses of C vs. D, which I took on as well, on account she was actually going to lay this herculean task on one of my less-technophilc co-workers, who would have actually manually counted the occurrences in order to get it done.)

So, like any lazy man, I wrote a script using LWP to do this for me. When all was said and done, it took the script roughly an hour to spyder our site (this was on a slow day, even. I was spydering on the first day that ILOVEYOU had hit, so everything was grinding along like a three-legged dog under the strain of all that dumbass mail.) The output, with 1 return ("phrase found, URL") per line was 41 pages as a 10-point times Word document. Even with a 1/8th inch fat-stack of text-packed paper, I couldn't make it clear to my boss that this would have been a maddeningly un-fun job to do manually. C'est la vie , I guess.

At any rate, I'm insanely proud of the fact that I avoided wasting a couple of days doing something dumb that I would have hated, and instead spent a few hours doing something smart and I love-- with the added bonus that those couple hours produced a cute little prog that did the dumb task faster and better than I could, and saved me from ever having to do any such similar dumb-thing again. I feel like a king with a cubicle as his throne-room.

#!/usr/local/bin/perl 
                                                                 
use LWP::Simple;
$page = "http://www.COMPANY_HOMEPAGE.com";
&get_urls;

##fetches and parses pages
foreach $url(@urls){
        $visit = join(' ', @visit);
        $visit =~ tr/\?/Q/;
        if ($visit !~ /($url)/i){

                open (OUT, ">>LOG.borders");
                open (VISIT, ">>LOG.visited.borders");
                open (LOG, ">>LOG.urls.borders");

                $url =~ tr/Q/\?/;
                push(@visit, $url);
                print VISIT "$url \n";
                $page = $url;
                $print = get "$url";
                print "Getting $url...\n";
                &get_urls;
                foreach $pattern (""THING A", ""THING B", "THING C", "
+THING D"){
                        if ($print =~ /($pattern)/i){
                                print OUT "$1, $url\n";
                        };
                };
        };

close (LOG);
close (VIST);
close (OUT);

};
print "\nDone!!!\n";

sub get_urls{
##find all links within page
        $doc = get "$page";
        @doc = split(/\s/, $doc);
        foreach $a (@doc){
                if ($a =~ /href="(http:\/\/[^"]+)">/i){
            #I needed the script to skip certain URLs 
            #(to avoid unproductive spydering, among 
            #other things.)  The following hunklet of 
            #code keeps an eye out for these.
                        if ($1 !~ /BadThing1|BadThing2|BadThing3|#/i){
                                $foo = join(' ', @urls);
                                $moo = "$1";
                                $moo =~ tr/\?/Q/;
                                $foo =~ tr/\?/Q/;
                                if ($foo !~ /($1)/i){
                                        push(@urls, $moo);
                                        print LOG "$moo\n";
                                };
                        };
                };
        };
};
[download]

Comment on My Little Time-Saving Spyder Download Code

Replies are listed 'Best First'.
RE: My Little Time-Saving Spyder by Anonymous Monk on May 11, 2000 at 16:04 UTC
This sounds very nice. I agree with you - I just can't believe that some people would consider doing such a task manually. Doing it manually, is, of course, highly error-prone not to mention tedious. Another example is for a site I was working on. It required JPEG files to be named like days of the year, e.g. 010199.jpg 010299.jpg and so on. I had to do this for several years of files, and naturally I didn't really fancy doing it manually. Out comes a Perl script and bingo! It's done!	[reply]
RE: My Little Time-Saving Spyder by buzzcutbuddha (Chaplain) on Jun 06, 2000 at 18:47 UTC
In a word: sweet!	[reply]
Re: My Little Time-Saving Spyder by jodrell (Acolyte) on Aug 28, 2001 at 16:50 UTC
NB: I wrote a chapter on perl web clients for "Professional Perl Development" by Wrox Press. You could speed up your spidering using: Threading fork() the first is neater, the second is easier but you have to be careful you don't forkbomb your machine. As far as being careful about which URLs you spider, there's WWW::RobotRules to help you parse and obey robots.txt files. If you don't fancy buying the book you can download the examples I wrote from here. -- jodrell.uk.net	[reply]