Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a website scraper module. Let's say it's "Scraper.pm". It uses sets of rules for each website, for example you might have:

$yahoo_headlines = { starting_page => 'www.yahoo.com/foo/', some_regex => qr/foo(.*?)bar/, html_tree_spec => [ '_tag', 'div', 'id', 'headlines' ] }

and what I thought would be a good way to store them was in a separate module, like Scraper::Library.pm, and then export them. That way your script is just

use Scraper; use Scraper::Library qw( $yahoo ); scrape( $yahoo );

That way, users can write their own scrapers and put them into Library.pm. Is there a better way? It kind of seems wrong to me to be exporting all those references, plus, just as fiddly for users if they have to edit the Exporter section of Library.pm.

Should I create Scraper::Library::Yahoo and export a single sub which scrapes Yahoo, lather, rinse, repeat? Create my own files like a "Yahoo.scrape" which just contain the reference to the data structure, "require" them, and not try and do namespaces properly?



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: How To Store Data Structures
by jdporter (Paladin) on Jul 20, 2005 at 02:43 UTC
    Should I create Scraper::Library::Yahoo and export a single sub which scrapes Yahoo, lather, rinse, repeat?
    Yes, that. But you should not export the function; instead, it should be called as a (class) method. I.e.
    use Scraper::Library::Yahoo; Scraper::Library::Yahoo->scrape();
    Of course, all the generic bits involved in scraping could (should?) be in the base class.

    An even better way is to let these "libraries" be strategies of the Scraper class. All the real work is done in method(s) of the main class (Scraper), but it delegates to one of the library classes for certain functions. And that could be something as simple as a function that returns some config data.

    package Scraper; sub new { my( $pkg, $strategy_class ) = @_; bless { strategy_class => $strategy_class, }, $pkg } sub scrape { my $self = shift; my $config_data = $self->{'strategy_class'}->config_data(); # ... proceed to scrape using this config data } package Scraper::Yahoo; # as a strategy of Scraper, this class only needs to implement those m +ethods # a Scraper will call. sub config_data { return( starting_page => 'www.yahoo.com/foo/', some_regex => qr/foo(.*?)bar/, html_tree_spec => [ '_tag', 'div', 'id', 'headlines' ], ); } . . . package main; # pass the strategy class name to the constructor: my $yahoo_scraper = new Scraper 'Scraper::Yahoo';
    Of course, depending on your architecture (I don't know how your Scraper really works), it might make as much sense to have individual bits of configuration returned by discrete methods:
    package Scraper; sub scrape { my $self = shift; my $starting_page = $self->{'strategy_class'}->starting_page(); my $some_regex = $self->{'strategy_class'}->some_regex(); my $html_tree_spec = $self->{'strategy_class'}->html_tree_spec(); # ... proceed to scrape using this config data } package Scraper::Yahoo; sub starting_page { 'www.yahoo.com/foo/' } sub some_regex { qr/foo(.*?)bar/ } sub html_tree_spec { [ '_tag', 'div', 'id', 'headlines' ] }
    That's how I usually do it. I'm a big fan of strategy classes. :-)
      Thanks a lot! That makes tons of sense. And the object (no pun intended) is to have a scraper that handles all contingencies in the scrape() function, and allow people to create and publish new scrapers and make the available to others.

      So going by your first example, it would work like

      my $yahoo_scraper = new Scraper 'Scraper::Yahoo'; $yahoo_scraper->scrape();
      having read all the key information in the new() call.

      Nobody's written to say that in the brave new world of RSS we don't need scrapers any more? Good...



      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
      Just a note, it only works for me if the site-specific package is coded like this:
      sub config_data { return { starting_page => 'www.yahoo.com/foo/', some_regex => qr/foo(.*?)bar/, html_tree_spec => [ '_tag', 'div', 'id', 'headlines' ], }; }
      With curly brackets not parentheses. Did I, or did you, get something wrong?


      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print
        Ah, good catch. Yes, I thought (mistakenly) that your original code was returning a hashref. Adjust as necessary. :-)