I've got it working thanks to a lot of help from a lot of Monks. I would love to hear your comments. It works, as they say, for me, but it's probably not nearly as efficient or elegant as it could be.

PDAScraper.pm

What is it?

I like to read news websites on my Palm.

I love sites like the BBC and the Guardian and the Onion which have PDA-friendly versions.

And when sites do RSS properly, then that's also good. But some sites don't have RSS feeds, or are very stingy about what they put in their RSS, or of course they only link to a version of the page with lots of ads, navigation and images which aren't suitable for a PDA. So, I've written a variety of scraping scripts to grab useful content from websites to my HD (normally in the background, using cron), then sync them to the Palm for use in a browser like the excellent Plucker.

So, I finally bit the bullet and wrote a scraper class. What it does is quite simple. It scrapes a page of headlines and downloads a selected subset of the links found to local pages, plus of course an index to those pages.

How's it supposed to work?

The scraper works best when the website meets these conditions:

There is a chunk of the page which contains only the links you want, and which can be identified by HTML::Element's "look_down" method. For instance, a DIV tag which has the ID "headlines".
The "print-friendly" version of a story can be arrived at by doing a simple regex on the story's URL, i.e. transforming "foobar.com/story.cgi?id=123" into "foobar.com/story.cgi?id=123&print=true"

But as those conditions aren't always true:

The fallback position is to identify a part of the page by a capturing regex.
If the "print-friendly" version of a story can't be arrived at with a simple regex, it can be arrived at by doing a more complex interpolated-on-the-RHS regex, as in transforming "foobar.com/id/(\d+)/" into "foobar.com/toolbar.aspx?action=print&id=$1".

(There are still some sites whose print-friendly URLs can't be arrived at using a regex because the URLs don't follow a workable pattern, because there is no discernable relationship between the regular and the print URLs. On the to-do list is "add an arbitrary number of further regexes to clean up the content" so that, if we have to, the non-PDA-friendly version can be made friendly by brute force.)

And if there isn't any identifiable "chunk" of the headlines which contains the "good" links? The final fallback is to grab all the links from a page and use HTML::Element to filter them by their attributes.

Where do the scraping details for the website come from?

PerlMonk jdporter convinced me that the best way to do it was to have a single module for each website which returns the rules. It's very modular and efficient that way. Hopefully, people could create their own and share them around.

What about sites requiring registration?

I didn't code for that, but it wouldn't be hard to add.

How polished is this module?

Not at all. It's set up at the moment to just scrape to an arbitrary location on my HD at $ENV{'HOME'}/scrape/$foo/ where $foo is the (munged) name of the website in question.

I'm putting it up because I'm sick of staring at the damn thing, and because it does actually work, pretty much.

Please feel free to point out any and all problems. I'm far from a professional coder but don't worry, I know that.

Files:

PDAScraper.pm

#!/usr/bin/perl
package PDAScraper;
use strict;
use Exporter;
use vars qw($VERSION @ISA @EXPORT);
$VERSION = 0.00;
@ISA     = qw( Exporter );
@EXPORT  = qw( &scrape  );
use URI::URL;
use HTML::TreeBuilder;
use HTML::Template;

use LWP::UserAgent;
my $ua = LWP::UserAgent->new();

# $ua->proxy( ['http'], 'http://foo.bar:8080/' ); # If needed.

###   Grab the template for the 'index.html' file. Stored
###   in-module here, not necessarily the best way but it
###   cuts down on external files.

my @html_array = <DATA>;

sub new {

    ###   PerlMonk jdporter clued me in on how
    ###   to Do This Properly. There are sub-modules
    ###   which contain the rules for each website.
    ###   Thanks J D.
    
    my ( $pkg, $rules ) = @_;
    bless { rules => $rules, }, $pkg;
}

sub scrape {
    my $self        = shift;
    my $obj         = $self->{'rules'}->config();
    my $template    = undef;
    my @all_links   = ();
    my @good_links  = ();
    my $tree        = undef;
    my $chunk       = undef;
    my $file_number = 0;

    print "getting " . $obj->{'name'} . $/;
    my $response = $ua->get( $obj->{'start_from'} );
    ### get the front page which has the links
    unless ( $response->is_success() ) {
        warn
          "Failed to get starter page: $obj->{'start_from'}\n";
        return;
    } else {
        print "Got starter page: $obj->{'start_from'}\n";
    }

    my $page = $response->content();
    if ( $obj->{'chunk_spec'} ) {

        ###   if we're parsing the HTML the Good Way using TreeBuilder

        $chunk = HTML::TreeBuilder->new_from_content( $page );
        $chunk->elementify() || die "$!";
        $chunk = $chunk->look_down( @{ $obj->{'chunk_spec'} } );
    } elsif ( $obj->{'chunk_regex'} ) {

        ###   if we're parsing the HTML the Bad Way using a regex

        $page =~ $obj->{'chunk_regex'};

        unless (defined $1){
        print "Regex failed to match\n";
        return;
        }

        $chunk = HTML::TreeBuilder->new_from_content( $1 );
        $chunk->elementify();
    } else {

        ###   if we're just grabbing the whole page,
        ###   probably not a good idea, but see 
        ###   link_spec below for a way of filtering links

        $chunk = HTML::TreeBuilder->new_from_content( $page );
        $chunk->elementify();
    }
    if ( defined( $obj->{'link_spec'} ) ) {

        ###   If we've got a TreeBuilder link filter to grab only
        ###   the links which match a certain format

        @all_links =
          $chunk->look_down( '_tag', 'a',
            @{ $obj->{'link_spec'} } );
    } else {
        @all_links = $chunk->look_down( '_tag', 'a' );
    }

    print "found " . scalar( @all_links ) . " links.\n";

    for ( @all_links ) {

        ###   Avoid three problem conditions -- no text means
        ###   we've probably got image links (often duplicates)
        ###   -- "#" as the href means a JavaScript link --
        ###   and tags with no HREF are also no use to us:

        next
          unless ( defined( $_->attr( 'href' ) )
            && $_->as_text()      ne ''
            && $_->attr( 'href' ) ne '#' );

        my $href = $_->attr( 'href' );

        ###   It's expected that we'll need to transform
        ###   the URL from regular to print-friendly:

        if ( defined( $obj->{'url_regex'} )
            && ref( $obj->{'url_regex'}->[1] ) eq 'CODE' )
        {

            ###   PerlMonk Roy Johnson is my saviour here.
            ###   Solution to the problem of some url regexes
            ###   needing backreferences and some not.

            $href =~ s{$obj->{'url_regex'}->[0]}
                  {${$obj->{'url_regex'}->[1]->()}}e;
        } elsif ( defined( $obj->{'url_regex'} ) ) {

            ###   If there is a regex object at all:

            $href =~ s{$obj->{'url_regex'}->[0]}
                  {$obj->{'url_regex'}->[1]};
        }

        ###   Transform the URL from relative to absolute:

        my $url = URI::URL->new( $href, $obj->{'start_from'} );
        my $abs_url = $url->abs();

        ###   Make a data structure with all the stuff we're
        ###   going to get on the next pass:

        push(
            @good_links,
            {
                text => $_->as_text(),
                url  => "$abs_url"
            }
        );

    }

    print "found " . scalar( @good_links ) . " 'good' links.\n";
    if ( scalar( @good_links ) == 0 ) {
        print "No 'good' links found.\n";
        return;
    }

    ( my $foldername = $obj->{'name'} ) =~ s/\W//g;
    ###   Make a foldername with no non-word chars

    unless ( -e "$ENV{'HOME'}/scrape" ) {
        ###   Make a scrape folder if there isn't one
        mkdir "$ENV{'HOME'}/scrape" || die "$!";
    }
    unless ( -e "$ENV{'HOME'}/scrape/$foldername" ) {
        ###   Make a folder for this content if there isn't one
        mkdir "$ENV{'HOME'}/scrape/$foldername" || die "$!";
    }

    foreach ( @good_links ) {
        my $response = $ua->get( $_->{'url'} );
        unless ( $response->is_success() ) {
            warn "didn't get " . $_->{'url'} . "$!" . $/;
            return;
        } else {
            print "got " . $_->{'url'} . $/;
        }
        my $page = $response->content();

        ###   TO DO: arbitrary number of further regexes
        ###   in case users want to clean content up more?

        ###   Filenames sprintf'd for neatness only:

        my $local_file = sprintf( "%03d.html", $file_number );

        ###   add a localfile value to the AoH for use in the index:

        $_->{localfile} = $local_file;

        ###   Print out the actual content page locally:
        open( PAGE,
            ">$ENV{'HOME'}/scrape/$foldername/$local_file" )
          || die "$!";
        print PAGE $page;
        close( PAGE );
        $file_number++;
    }

    ###   [die_on_bad_params is off because the AoH contains
    ###   one item we don't need, the original URL]

    $template = HTML::Template->new(
        arrayref          => \@html_array,
        debug             => 0,
        die_on_bad_params => 0
    );

    ###   Use the name and the links array to fill out the template:

    $template->param(
        links    => \@good_links,
        sitename => $obj->{'name'}
    );

    ###   Output the index page locally:

    open( INDEX, ">$ENV{'HOME'}/scrape/$foldername/index.html" )
      || die "$!";
    unless ( print INDEX $template->output() ) {
        print "Error in HTML::Template output\n";
        return;
    }
    close( INDEX );
    
    print "Finished scraping $obj->{'name'}\n\n";
    
    ###   Clean up after HTML::Tree as recommended
    
    $chunk->delete(); 

}

1;

__DATA__
<html>
<head>
  <title><tmpl_var name="sitename"></title>
</head>
<body>
  <h1><tmpl_var name="sitename"></h1>
  <ul><tmpl_loop name="links">
    <li>
  <a href="<tmpl_var name="localfile">"><tmpl_var name="text"></a>
    </li>
  </tmpl_loop></ul>
</body>
</html>
[download]

PDAScraper::YahooTV.pm

package PDAScraper::YahooTV;


# PDAScraper.pm rules for scraping the
# Yahoo TV website

sub config {
    return {
        name       => 'Yahoo TV',
        start_from => 'http://news.yahoo.com/i/763',
        chunk_spec => [ "_tag", "div", "id", "indexstories" ],
        url_regex => [ '/[^/]*$', '&printer=1' ]
    };
}

1;
[download]

PDAScraper::Slate.pm

package PDAScraper::Slate;

# PDAScraper.pm rules for scraping the
# Slate website

sub config {
    return {
        name       => 'Slate',
        start_from =>
          'http://www.slate.com/id/2065896/view/2057069/',
        url_regex => [
            '/id/(\d+)/',
            sub { \ "/toolbar.aspx?action=print&id=$1" }
        ],
        chunk_regex =>
          qr{</p></td></tr></table><table border="0" width="486" cellp
+adding="0" cellspacing="0">(.*?)<a><img border="0" src="http://img.sl
+ate.msn.com/media/GlobalNav/Sports_462x35.gif" alt=""></a>}
    };
}

1;
[download]

PDAScraper::NewScientist.pm

package PDAScraper::NewScientist;

# PDAScraper.pm rules for scraping the 
# New Scientist website

sub config {
    return {
        name       => 'New Scientist Headlines',
        start_from => 'http://www.newscientist.com/news.ns',
        chunk_spec => [ "_tag", "div", "id", "newslisting" ],
        url_regex => [ '$', '&print=true' ]
    };
}

1;
[download]

Sample Script:

#!/usr/bin/perl
use strict;
use warnings;
use PDAScraper;

use PDAScraper::YahooTV;
my $YM_Scraper = PDAScraper->new('PDAScraper::YahooTV') 
|| die "$!";
$YM_Scraper->scrape();

use PDAScraper::Slate;
my $Slate_Scraper = PDAScraper->new('PDAScraper::Slate') 
|| die "$!";
$Slate_Scraper->scrape();

use PDAScraper::NewScientist;
my $NS_Scraper = PDAScraper->new('PDAScraper::NewScientist') 
|| die "$!";
$NS_Scraper->scrape();
[download]

Template for writing PDAScraper rules:

package PDAScraper::Foo;

# PDAScraper.pm rules for scraping the
# Foo website

sub config {
    return {
        name => 'Foo',
        # Name of the website. Arbitrary text.

        start_from => 'http://www.foo.com/news/',
        # URL where the scraper should find the links.

        url_regex => [ '$', '&print=1' ],
        # This is the simple form of the url_regex, which
        # is used to change a regular link to a "print-friendly"
        # link. Simple because there are no backreferences 
        # neede on the RHS.

        #   url_regex => [
        #      '/id/(\d+)/',
        #      sub { \ "/toolbar.aspx?action=print&id=$1" }
        #  ],
        #  This is the complex form of the url_regex, using
        #  a sub to return because it needs to evaluate a 
        #  backreference i.e. $1, $2 etc.

        chunk_spec => [ "_tag", "div", "id", "headlines" ],
        # A list of arguments to HTML::Element's look_down()
        # method. This one will return an HTML::Element object
        # matching the first ID tag having the attribute
        # "id" with value "headlines".

        # If you can't use a chunk_spec, you'll have to use a
        # chunk_regex:

        chunk_regex => 
          qr{<table border="0" width="512">(.*?)</table>}s
        # A regular expression which returns your desired
        # chunk of the page as $1. Using chunk_spec is better.
        
        link_spec => [sub { $_[0]->as_text ne 'FULL STORY' }]
        # All links are grabbed from the page chunk by default,
        # but chunk_spec allows you to add HTML::Element
        # filtering, here, for example, rejecting links in the
        # form <a href="foo">FULL STORY</a>, but you could also
        # reject them on any attribute, see HTML::Element.
    };
}

1;
[download]

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Moved from SOPW to Meditations by Arunbear.

Renamed from "A module to scrape website content for PDAs: your comments please. by holli.

Retitled by holli from 'RFC: PDAScraper'.

Comment on RFC: PDAScraper, A module to scrape website content for PDAs Select or Download Code

Replies are listed 'Best First'.
Re: RFC: PDAScraper, A module to scrape website content for PDAs (nice) by tye (Sage) on Jul 28, 2005 at 18:09 UTC
Very nice, indeed. Quite minor suggestions: Move the __DATA__ lines into a here-doc (though I mostly dislike here-docs) so I don't have to scroll down to see what the point of that `<DATA>` line is and so your module doesn't leave a file handle open that it then never uses again. I dislike 'unless' and your use of it did cause me to have to do a double-take when reading the code (which reinforces my dislike for it). Regarding: Sample Script: `#!/usr/bin/perl use strict; use warnings; use PDAScraper; use PDAScraper::YahooTV; my $YM_Scraper = PDAScraper->new('PDA�Scraper::YahooTV') \|\| die "$!"; $YM_Scraper->scrape(�); use PDAScraper::Slate; my $Slate_Scraper = PDAScraper->new('PDA�Scraper::Slate') \|\| die "$!"; $Slate_Scraper->scra�pe(); use PDAScraper::NewScien�tist; my $NS_Scraper = PDAScraper->new('PDA�Scraper::NewScientis�t') \|\| die "$!"; $NS_Scraper->scrape(�);` [download] How about make that sample script possible with just: `use PDAScraper qw( YahooTV Slate NewScientist );` [download] Then one could even just do `perl -MPDAScraper=YahooTV,Slate,...`. Thanks for sharing this. - tye	[reply] [d/l] [select]
Re^2: RFC: PDAScraper, A module to scrape website content for PDAs (nice) by Cody Pendant (Prior) on Jul 29, 2005 at 02:51 UTC
Thanks for that. I really appreciate your comments. My liking for "unless" is quite irrational, I know. Shall I submit this to CPAN, or somewhere else, assuming those changes? Its target audience is relatively small, but on the other hand, it would grow in usefulness for that audience if there was a community of people sharing and improving the individual scrapers. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^3: RFC: PDAScraper, A module to scrape website content for PDAs (CPAN) by tye (Sage) on Jul 29, 2005 at 03:15 UTC
Yes, it appears to me to be plenty polished enough to be on CPAN. Though for that I'd prefer it have a name that fit into an existing top-level namespace. Even just HTTP::PDAScraper would be fine with me. But I won't dwell on that and I'll trust you with the naming. - tye	[reply]