Last week my boss approached me with a side project (I'm in Customer Service for a commercial website-- when we aren't answering e-mail, we do little content-based odd-jobs for the site): she wanted me to count every occurrence of phrases A and B on our website (things like "shipping address" vs "ship to address.") She suggested I browse the site manually to do this. I said that that was ridiculous, that it was a job for a machine, not a human being. She said yeah, whatever, get it done by May 5. (There was another similar job, finding uses of C vs. D, which I took on as well, on account she was actually going to lay this herculean task on one of my less-technophilc co-workers, who would have actually manually counted the occurrences in order to get it done.)
So, like any lazy man, I wrote a script using LWP to do this for me. When all was said and done, it took the script roughly an hour to spyder our site (this was on a slow day, even. I was spydering on the first day that ILOVEYOU had hit, so everything was grinding along like a three-legged dog under the strain of all that dumbass mail.) The output, with 1 return ("phrase found, URL") per line was 41 pages as a 10-point times Word document. Even with a 1/8th inch fat-stack of text-packed paper, I couldn't make it clear to my boss that this would have been a maddeningly un-fun job to do manually. C'est la vie , I guess.
At any rate, I'm insanely proud of the fact that I avoided wasting a couple of days doing something dumb that I would have hated, and instead spent a few hours doing something smart and I love-- with the added bonus that those couple hours produced a cute little prog that did the dumb task faster and better than I could, and saved me from ever having to do any such similar dumb-thing again. I feel like a king with a cubicle as his throne-room.
#!/usr/local/bin/perl use LWP::Simple; $page = "http://www.COMPANY_HOMEPAGE.com"; &get_urls; ##fetches and parses pages foreach $url(@urls){ $visit = join(' ', @visit); $visit =~ tr/\?/Q/; if ($visit !~ /($url)/i){ open (OUT, ">>LOG.borders"); open (VISIT, ">>LOG.visited.borders"); open (LOG, ">>LOG.urls.borders"); $url =~ tr/Q/\?/; push(@visit, $url); print VISIT "$url \n"; $page = $url; $print = get "$url"; print "Getting $url...\n"; &get_urls; foreach $pattern (""THING A", ""THING B", "THING C", " +THING D"){ if ($print =~ /($pattern)/i){ print OUT "$1, $url\n"; }; }; }; close (LOG); close (VIST); close (OUT); }; print "\nDone!!!\n"; sub get_urls{ ##find all links within page $doc = get "$page"; @doc = split(/\s/, $doc); foreach $a (@doc){ if ($a =~ /href="(http:\/\/[^"]+)">/i){ #I needed the script to skip certain URLs #(to avoid unproductive spydering, among #other things.) The following hunklet of #code keeps an eye out for these. if ($1 !~ /BadThing1|BadThing2|BadThing3|#/i){ $foo = join(' ', @urls); $moo = "$1"; $moo =~ tr/\?/Q/; $foo =~ tr/\?/Q/; if ($foo !~ /($1)/i){ push(@urls, $moo); print LOG "$moo\n"; }; }; }; }; };
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
RE: My Little Time-Saving Spyder
by Anonymous Monk on May 11, 2000 at 16:04 UTC | |
|
RE: My Little Time-Saving Spyder
by buzzcutbuddha (Chaplain) on Jun 06, 2000 at 18:47 UTC | |
by jodrell (Acolyte) on Aug 28, 2001 at 16:50 UTC |