Hello all!

Since I'm doing web querying/scraping of tabular data in several of my scripts, I want to refactor this functionality into a convenient module, and I would appreciate some input/criticism/wisdom from the other resident monks on that topic... :)

By "query", I mean a complete recipe for downloading a document from the web and extracting relevant pieces from it using XPath expressions and/or regexes.
By "tabular data", I mean that when a query is executed, it returns a list of items where each item has the same set of pre-defined "fields".

The module's API should allow query definitions to be:

What I have so far:

My current idea (a.k.a. "first draft") for the API is demonstrated by the following simple usage example, which would be a script for finding TV series by name:

#!/usr/bin/perl use warnings; use strict; use My::Hypothetical::Web::Query::Module qw(register_query); # Defining the query: register_query 'tvdb_series_search' => { url => 'http://thetvdb.com/api/GetSeries.php?seriesname=%s', cache => 'tvdb_search_%s.xml', items => '/Data/Series', # this is an XPath expression! class => 'Series', parse => { 'name' => './SeriesName', 'year' => ['./FirstAired', qr/^(\d{0,4})/], 'desc' => './Overview' }, }; # Defining the type for the query's result items: package Series { use Moo; has name => (is => 'ro'); has year => (is => 'ro'); has desc => (is => 'ro'); sub summary { my $self = shift; return sprintf "%s [%s]\n %s\n", $self->name, $self->year, substr($self->desc, 0, 64).'...'; } } # Executing the query and iterating over its result list: my $it = tvdb_series_search( $ARGV[0] ); while (my $series = $it->()) { print $series->summary(); }

Sample output:

$ ./findseries.pl cards House of Cards [1990] The PM made a deadly mistake when he passed over Francis Urquhar... House of Cards (US) [2013] House of Cards is an American political drama series developed a...

I.e. a query is defined using a hash, some of whose values (like the URL) are sprintf patterns that will get filled by the parameter(s) passed to the query each time it is executed. Each of the parsing rules (items and the child values of parse) can be given as an XPath expression or a regex, or as an array of multiple XPath expressions and regexes where each one narrows down the result of the preceding one.

The register_query function takes this hash, and constructs a query subroutine from it which it dynamically injects into the program's symbol table, so that executing the query is as simple as calling that function with the desired parameters.

The result items are optionally returned as objects of the type specified by the class option, which admittedly is overkill in a simple example like the above, but will be useful in more complex use-cases.

Some additional planned options for register_query, that are not demonstrated above:

args => sub { ... }, # transform query parameters before they # are passed to sprintf pre => &IO::Uncompress::Unzip::unzip, # preprocess download # before parsing

Questions:

Thanks!

In reply to RFC: API for declarative parameterized web queries by smls

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.