walkingthecow has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys, I have written a script that gets results from searches on craigslist, and then text messages me if there are any new results from the search. I run the script for two hours, kill it, start it back up again 20 minutes later. I want it to save the results from the first run and not notify me on the 2nd run of the results that were in the first run. I am currently writing results to a file if the result does not already exist in file, and I am just wondering if there is a better way to keep track of the results? If there is a new result, I am text messaged that there is a new result. Look below for code. Any critique is greatly appreciated!! Thank you fellow monks!

NOTE: I have replaced the phone number with x.
#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use HTML::TokeParser; use MIME::Lite; while (1) { my $msg; my @search = ("broken hdtv","fix hdtv","does not work hdtv"); my $connection = WWW::Mechanize->new(); foreach my $term (@search) { $connection->get("http://portland.craigslist.org/"); $connection->form_number(1); $connection->field("query",$term); $connection->click(); my @link = $connection->find_all_links(url_regex => qr/(ele|zip +)\/[0-9]+/i); foreach my $link (@link) { my $tempLink=$link->url_abs; chomp $tempLink; my $check=`grep $tempLink craigslist`; if ($check eq "") { open OUTFILE,">>craigslist" or die $!; print OUTFILE $tempLink . "\n"; $msg = MIME::Lite->new( To =>'xxxxxxxxxx@vtext.com', Subject =>'CL Link', Type =>'text/plain; charset="iso-8859-1"', Data =>"There is a new link on craigslist." ); close OUTFILE; } } } if ( $msg ) { $msg->send; } sleep(300); }
  • Comment on Is there a better way to do this? I need to keep track of results...
  • Download Code

Replies are listed 'Best First'.
Re: Is there a better way to do this? I need to keep track of results...
by tilly (Archbishop) on Feb 20, 2009 at 05:26 UTC
    This is a perfect use case for a dbm like DB_File, DBM::Deep or even SDBM_File (that is the worst of the lot, but comes with Perl). Then you can have a hash that is transparently mirrored in a file, and accessing it will be much more efficient than your solution.

    That would look something like this (untested):

    #!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use HTML::TokeParser; use MIME::Lite; use Fcntl; use SDBM_File; tie my %url_seen, 'SDBM_File', 'urls.dbm', O_RDWR|O_CREAT, 0666); while (1) { my $msg; my @search = ("broken hdtv","fix hdtv","does not work hdtv"); my $connection = WWW::Mechanize->new(); foreach my $term (@search) { $connection->get("http://portland.craigslist.org/"); $connection->form_number(1); $connection->field("query",$term); $connection->click(); my @link = $connection->find_all_links(url_regex => qr/(ele|zip +)\/[0-9]+/i); foreach my $link (@link) { my $tempLink=$link->url_abs; chomp $tempLink; if (not $url_seen{$tempLink}) { $url_seen{$tempLink} = 1; $msg = MIME::Lite->new( To =>'xxxxxxxxxx@vtext.com', Subject =>'CL Link', Type =>'text/plain; charset="iso-8859-1"', Data =>"There is a new link on craigslist." ); close OUTFILE; } } } if ( $msg ) { $msg->send; } sleep(300); }
Re: Is there a better way to do this? I need to keep track of results...
by CountZero (Bishop) on Feb 20, 2009 at 06:20 UTC
    On the search results page, it seems that the ads are sorted by date (newest first). If you save in a file the date and time of the latest message found in your previous check, it is easy to follow each link and check the date and time of each message. Once you hit a message equal to or older than your saved date and time you do not have to look any further.

    Alternatively, it looks as if you can also check the link itself. That seems to contain a very big sequence number (http://portland.craigslist.org/mlt/rid/1042365384.html) that is larger for newer messages. So if you keep the sequence number of your latest message found, you can probably check on this number. That would save you from having to follow the link to find the date and time and from doing some date-time conversions / parsing.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Is there a better way to do this? I need to keep track of results...
by dsheroh (Monsignor) on Feb 20, 2009 at 06:39 UTC
    I'll second CountZero's suggestion of just recording the last check time instead of storing all seen messages... if you're going to do this yourself.

    You may want to take a look at Google alerts first to see whether that does what you want. Your description of what you're doing sounds pretty close to what you'd get by setting an alert on 'site: portland.craigslist.org <search terms>'.

Re: Is there a better way to do this? I need to keep track of results...
by leocharre (Priest) on Feb 20, 2009 at 14:16 UTC

    Why not store the url as identifier. http://washingtondc.craigslist.org/mld/sks/1040593349.html

    Or.. you can wget the message and md5sum the content. That way you save md5sum, url, and timestamp of when you got it. The rest is trash.

    You could do this with a simple hash. Store it with YAML.. or even better: File::Cache !

    my %hits = { 
       $md5sum => { url => $url, time => time() },
    };