Hello all!
Since I'm doing web querying/scraping of tabular data in several of my scripts, I want to refactor this functionality into a convenient module, and I would appreciate some input/criticism/wisdom from the other resident monks on that topic... :)
By "query", I mean a complete recipe for downloading a document from the web and extracting relevant pieces from it using XPath expressions and/or regexes.
By "tabular data", I mean that when a query is executed, it returns a list of items where each item has the same set of pre-defined "fields".
The module's API should allow query definitions to be:
My current idea (a.k.a. "first draft") for the API is demonstrated by the following simple usage example, which would be a script for finding TV series by name:
#!/usr/bin/perl use warnings; use strict; use My::Hypothetical::Web::Query::Module qw(register_query); # Defining the query: register_query 'tvdb_series_search' => { url => 'http://thetvdb.com/api/GetSeries.php?seriesname=%s', cache => 'tvdb_search_%s.xml', items => '/Data/Series', # this is an XPath expression! class => 'Series', parse => { 'name' => './SeriesName', 'year' => ['./FirstAired', qr/^(\d{0,4})/], 'desc' => './Overview' }, }; # Defining the type for the query's result items: package Series { use Moo; has name => (is => 'ro'); has year => (is => 'ro'); has desc => (is => 'ro'); sub summary { my $self = shift; return sprintf "%s [%s]\n %s\n", $self->name, $self->year, substr($self->desc, 0, 64).'...'; } } # Executing the query and iterating over its result list: my $it = tvdb_series_search( $ARGV[0] ); while (my $series = $it->()) { print $series->summary(); }
Sample output:
$ ./findseries.pl cards House of Cards [1990] The PM made a deadly mistake when he passed over Francis Urquhar... House of Cards (US) [2013] House of Cards is an American political drama series developed a...
I.e. a query is defined using a hash, some of whose values (like the URL) are sprintf patterns that will get filled by the parameter(s) passed to the query each time it is executed. Each of the parsing rules (items and the child values of parse) can be given as an XPath expression or a regex, or as an array of multiple XPath expressions and regexes where each one narrows down the result of the preceding one.
The register_query function takes this hash, and constructs a query subroutine from it which it dynamically injects into the program's symbol table, so that executing the query is as simple as calling that function with the desired parameters.
The result items are optionally returned as objects of the type specified by the class option, which admittedly is overkill in a simple example like the above, but will be useful in more complex use-cases.
Some additional planned options for register_query, that are not demonstrated above:
args => sub { ... }, # transform query parameters before they # are passed to sprintf pre => &IO::Uncompress::Unzip::unzip, # preprocess download # before parsing
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: API for declarative parameterized web queries
by Corion (Patriarch) on Apr 11, 2014 at 06:36 UTC | |
by smls (Friar) on Apr 12, 2014 at 07:33 UTC | |
|
Re: RFC: API for declarative parameterized web queries
by Don Coyote (Hermit) on Apr 11, 2014 at 14:30 UTC | |
|
Re: RFC: API for declarative parameterized web queries
by Anonymous Monk on Apr 10, 2014 at 21:50 UTC | |
by smls (Friar) on Apr 11, 2014 at 10:15 UTC | |
by Anonymous Monk on Apr 12, 2014 at 06:44 UTC |