in reply to How To Store Data Structures

Should I create Scraper::Library::Yahoo and export a single sub which scrapes Yahoo, lather, rinse, repeat?
Yes, that. But you should not export the function; instead, it should be called as a (class) method. I.e.
use Scraper::Library::Yahoo; Scraper::Library::Yahoo->scrape();
Of course, all the generic bits involved in scraping could (should?) be in the base class.

An even better way is to let these "libraries" be strategies of the Scraper class. All the real work is done in method(s) of the main class (Scraper), but it delegates to one of the library classes for certain functions. And that could be something as simple as a function that returns some config data.

package Scraper; sub new { my( $pkg, $strategy_class ) = @_; bless { strategy_class => $strategy_class, }, $pkg } sub scrape { my $self = shift; my $config_data = $self->{'strategy_class'}->config_data(); # ... proceed to scrape using this config data } package Scraper::Yahoo; # as a strategy of Scraper, this class only needs to implement those m +ethods # a Scraper will call. sub config_data { return( starting_page => 'www.yahoo.com/foo/', some_regex => qr/foo(.*?)bar/, html_tree_spec => [ '_tag', 'div', 'id', 'headlines' ], ); } . . . package main; # pass the strategy class name to the constructor: my $yahoo_scraper = new Scraper 'Scraper::Yahoo';
Of course, depending on your architecture (I don't know how your Scraper really works), it might make as much sense to have individual bits of configuration returned by discrete methods:
package Scraper; sub scrape { my $self = shift; my $starting_page = $self->{'strategy_class'}->starting_page(); my $some_regex = $self->{'strategy_class'}->some_regex(); my $html_tree_spec = $self->{'strategy_class'}->html_tree_spec(); # ... proceed to scrape using this config data } package Scraper::Yahoo; sub starting_page { 'www.yahoo.com/foo/' } sub some_regex { qr/foo(.*?)bar/ } sub html_tree_spec { [ '_tag', 'div', 'id', 'headlines' ] }
That's how I usually do it. I'm a big fan of strategy classes. :-)

Replies are listed 'Best First'.
Re^2: How To Store Data Structures
by Cody Pendant (Prior) on Jul 20, 2005 at 05:08 UTC
    Thanks a lot! That makes tons of sense. And the object (no pun intended) is to have a scraper that handles all contingencies in the scrape() function, and allow people to create and publish new scrapers and make the available to others.

    So going by your first example, it would work like

    my $yahoo_scraper = new Scraper 'Scraper::Yahoo'; $yahoo_scraper->scrape();
    having read all the key information in the new() call.

    Nobody's written to say that in the brave new world of RSS we don't need scrapers any more? Good...



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
Re^2: How To Store Data Structures
by Cody Pendant (Prior) on Jul 20, 2005 at 07:24 UTC
    Just a note, it only works for me if the site-specific package is coded like this:
    sub config_data { return { starting_page => 'www.yahoo.com/foo/', some_regex => qr/foo(.*?)bar/, html_tree_spec => [ '_tag', 'div', 'id', 'headlines' ], }; }
    With curly brackets not parentheses. Did I, or did you, get something wrong?


    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      Ah, good catch. Yes, I thought (mistakenly) that your original code was returning a hashref. Adjust as necessary. :-)