DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I have considered writing a news retrieval application for a few weeks. Essentially, I envision it has having the following capabilities:


I am considering using WWW::Mechanize but having never used it, I would greatly appreciate advice from those who have or any feedback from those who might suggest an alternate course of action. Included below, is my fledging foray into this vast arena.

use warnings; use strict; use LWP::UserAgent; my $agent = LWP::UserAgent->new(); my $site = 'http://www.perl.com'; my $response = $agent->get( $site ); my $content = $response->content(); if( $content =~ m/Chromosome/i ) { open( FH, ">>news.html" ) || die "Error : $!\n"; print FH $content; close( FH ) || die "Error : $!\n"; } else { print "Nothing!\n"; }


Thanks,
-Katie.

Replies are listed 'Best First'.
Re: Writing a news retrieval application.
by PodMaster (Abbot) on Oct 27, 2003 at 07:03 UTC
    Getting started with WWW::Mechanize is easy (even if WWW::Mechanize::Shell is slightly behind the times)
    C:\new\WWW-Mechanize-Shell-0.29>perl -MWWW::Mechanize::Shell -e shell Module File::Modified not found. Automatic reloading disabled. >get http://perl.com/ Retrieving http://perl.com/(200) http://perl.com/>open /Chromosome/ 83: A Chromosome at a Time with Perl, Part 2 99: A Chromosome at a Time with Perl, Part 1 http://perl.com/>open 83 (200) http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>content bioin +formatics2.html http://www.perl.com/pub/a/2003/10/15/bioinformatics.html>back http://perl.com/>open 99 (200) http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>content bioin +formatics1.html http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>script bionfo +rmatics.pl http://www.perl.com/pub/a/2003/09/10/bioinformatics.html>q C:\new\WWW-Mechanize-Shell-0.29>dir bio* Directory of C:\new\WWW-Mechanize-Shell-0.29 10/26/2003 11:05p 38,616 bioinformatics.html 10/26/2003 11:08p 38,617 bioinformatics1.html 10/26/2003 11:08p 30,658 bioinformatics2.html 10/26/2003 11:09p 721 bionformatics.pl C:\new\WWW-Mechanize-Shell-0.29>cat bionformatics.pl #!C:\Perl\bin\perl.exe -w use strict; use WWW::Mechanize; use WWW::Mechanize::FormFiller; use URI::URL; my $agent = WWW::Mechanize->new(); my $formfiller = WWW::Mechanize::FormFiller->new(); $agent->env_proxy(); $agent->get('http://perl.com/'); $agent->form(1) if $agent->forms and scalar @{$agent->forms}; $agent->follow('83'); { my $filename = q{bioinformatics2.html}; local *F; open F, "> $filename" or die "$filename: $!"; binmode F; print F $agent->content,"\n"; close F }; $agent->back(); $agent->follow('99'); { my $filename = q{bioinformatics1.html}; local *F; open F, "> $filename" or die "$filename: $!"; binmode F; print F $agent->content,"\n"; close F }; C:\new\WWW-Mechanize-Shell-0.29>
    This is like the first time i've used these (I have before, but since I don't remember anything, it's like I haven't).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Writing a news retrieval application.
by pg (Canon) on Oct 27, 2003 at 06:54 UTC

    You can simply utilize the search ability of those news site. For exmple, if you want to search news about the sniper case thru yahoo, you can simply send a HTTP request for URL "http://search.news.yahoo.com/search/news/?c=&p=sniper". Don't do it with brutal force on your side.

Re: Writing a news retrieval application.
by Art_XIV (Hermit) on Oct 27, 2003 at 13:53 UTC

    LWP or WWW::Mechanize should work just fine for your scraping needs, but here's a hint -

    Let one of the HTML:: modules do your parsing for you. You'll be glad you did after your scraped site(s) go through a few layout changes.

Re: Writing a news retrieval application.
by chromatic (Archbishop) on Oct 28, 2003 at 01:17 UTC

    Perhaps searching RSS feeds would be simpler; they're often provided for similar purposes.