RFC: API for declarative parameterized web queries

Hello all!

Since I'm doing web querying/scraping of tabular data in several of my scripts, I want to refactor this functionality into a convenient module, and I would appreciate some input/criticism/wisdom from the other resident monks on that topic... :)

By "query", I mean a complete recipe for downloading a document from the web and extracting relevant pieces from it using XPath expressions and/or regexes.
By "tabular data", I mean that when a query is executed, it returns a list of items where each item has the same set of pre-defined "fields".

The module's API should allow query definitions to be:

encapsulated - Everything that defines a particular query, should appear together as one self-contained unit in the user's source code.
declarative - Don't make the user write unnecessary boilerplate code or worry about implementation details.
reusable - Once a query is defined, it should be very easy for the user to execute it many times with different parameters during the program's execution.

What I have so far:

My current idea (a.k.a. "first draft") for the API is demonstrated by the following simple usage example, which would be a script for finding TV series by name:

#!/usr/bin/perl
use warnings;
use strict;
use My::Hypothetical::Web::Query::Module qw(register_query);


# Defining the query:

register_query 'tvdb_series_search' => {
    url   => 'http://thetvdb.com/api/GetSeries.php?seriesname=%s',
    cache => 'tvdb_search_%s.xml',
    items => '/Data/Series',  # this is an XPath expression!
    class => 'Series',
    parse => { 'name' => './SeriesName',
               'year' => ['./FirstAired', qr/^(\d{0,4})/],
               'desc' => './Overview' },
};


# Defining the type for the query's result items:

package Series {
    use Moo;
    has name => (is => 'ro');
    has year => (is => 'ro');
    has desc => (is => 'ro');
    
    sub summary {
        my $self = shift;
        return sprintf "%s [%s]\n  %s\n", $self->name, $self->year,
                   substr($self->desc, 0, 64).'...';
    }
}


# Executing the query and iterating over its result list:

my $it = tvdb_series_search( $ARGV[0] );

while (my $series = $it->()) {
    print $series->summary();
}
[download]

Sample output:

$ ./findseries.pl cards
House of Cards [1990]
  The PM made a deadly mistake when he passed over Francis Urquhar...
House of Cards (US) [2013]
  House of Cards is an American political drama series developed a...
[download]

I.e. a query is defined using a hash, some of whose values (like the URL) are sprintf patterns that will get filled by the parameter(s) passed to the query each time it is executed. Each of the parsing rules (items and the child values of parse) can be given as an XPath expression or a regex, or as an array of multiple XPath expressions and regexes where each one narrows down the result of the preceding one.

The register_query function takes this hash, and constructs a query subroutine from it which it dynamically injects into the program's symbol table, so that executing the query is as simple as calling that function with the desired parameters.

The result items are optionally returned as objects of the type specified by the class option, which admittedly is overkill in a simple example like the above, but will be useful in more complex use-cases.

Some additional planned options for register_query, that are not demonstrated above:

    args  => sub { ... },  # transform query parameters before they
                           # are passed to sprintf

    pre   => &IO::Uncompress::Unzip::unzip,  # preprocess download
                                             # before parsing
[download]

Questions:

Does what I'm trying to do seem sensible, or am I completely off track?
Any comments (or suggestions for improvement) regarding the proposed API?
Is dynamically adding functions to the global symbol table considered an acceptable practice these days? Will it cause problems? And which would be the preferable way to implement it:
1. by generating code for a function that is perfectly tailored to the query at hand, and then eval'ing it?
2. by instantiating a closure that closes over the hash (and will re-parses it each time the query is executed), and using typeglob magic to give that closure a global name?
Would it be too crazy to push the "result items as objects" idea further, and let register_query automatically generate a class for them based on the keys of the parse option, and take an optional role option for specifying a Moo::Role that will be applied to that class (if that is even possible to do at runtime)?

Thanks!

Comment on RFC: API for declarative parameterized web queries Select or Download Code

Replies are listed 'Best First'.
Re: RFC: API for declarative parameterized web queries by Corion (Patriarch) on Apr 11, 2014 at 06:36 UTC
For prior art, see Web::Scraper, which seems to do a similar thing at least on the extraction side, except with fancy syntactic sugar instead of data.	[reply]
Re^2: RFC: API for declarative parameterized web queries by smls (Friar) on Apr 12, 2014 at 07:33 UTC
Thanks for the pointer, Corion. Here's a simple example to try and compare my hypothetical API to Web::Scraper: Example: Get the 5 most recent tweets from someone's Twitter page The idea is that the demo script should take a twitter nickname as a command-line argument, and print the time, author, and text body of the last five tweets from that person's Twitter page to STDOUT. The implementation with Web::Scraper: use URI; use Web::Scraper; my $tweets_url = "http://twitter.com/%s"; my $tweets_query = scraper { process 'li[data-item-type="tweet"]', 'tweets[]' => scraper { process '[data-name]', 'name' => '@data-name'; process '[data-time]', 'time' => '@data-time'; process '.content p', 'text' => 'TEXT'; }; }; my $tweets = $tweets_query->scrape( URI->new(sprintf $tweets_url, $ARGV[0])); for my $tweet (@{$tweets->{tweets}}[0..4]) { last if !$tweet; my $date = strftime('%b %d', localtime $tweet->{time}); print "\n$tweet->{name} tweeted on $date:\n $tweet->{text}\n"; } [download] The implementation with my proposed API: `use My::Query qw(register_query); register_query 'recent_tweets' => { url => "http://twitter.com/%s", items => '//li[@data-item-type="tweet"]', parse => { 'name' => '//@data-name', 'time' => '//@data-time', 'text' => '//[@class="content"]/p' }, } my $it = recent_tweets( $ARGV[0] ); for (0..4) { my $tweet = $it->() or last; my $date = strftime('%b %d', localtime $tweet->{time}); print "\n$tweet->{name} tweeted on $date:\n $tweet->{text}\n"; }` [download] (I haven't used the "return rows as objects" thing here, as I've decided it should be optional and it doesn't gain us anything in simple cases like this.)* Sample output (same for both implementations): `$ ./recent-tweets.pl TimToady Larry Wall tweeted on Mar 15: @anocelot Lemme guess, only the first word is different... Larry Wall tweeted on Mar 13: I need to ask the Guinness folks what the current world record is fo +r number of invitations to connect on LinkedIn ignored. Larry Wall tweeted on Feb 14: @genespeth \o/ Larry Wall tweeted on Feb 12: Let us not forget that the perfect is also the enemy of the bad and +the ugly. Larry Wall tweeted on Feb 03: Wow. Just...wow. #sb48` [download] Both get the job done, and it's certainly not a difference as day and night, but I do prefer my API even for simple cases like this, because... ...of its declarative nature, i.e. instead of saying "Perform these operations, and while going along store these values", you say "I want a list of items/rows as output, and each of them should have these fields, so here are the parsing rules for extracting each of them". Which feels more elegant to me, but that's of course subjective. ...it doesn't make you jump through hoops to keep all information about the query (including how to construct the URL) in one place. What do you think?	[reply] [d/l] [select]
Re: RFC: API for declarative parameterized web queries by Don Coyote (Hermit) on Apr 11, 2014 at 14:30 UTC
hi smls your three requirements for encapsulation, declaration and reusability, would suggest to me for an object orientated interface. I am not convinced that overloading the global symbol space would really help with remaining encapsulated.And what you describe here sounds a lot like what Export is doing when you import the register_query routine. So are you in effect calling import for each query you register? Rather than for the class interface, then calling methods from the class on each query, which would adhere to the reusability and declaritve requirements too. As the query is in a hash format, I would look at making the register_query as a constructor, something like `my $rq = RQS->new;` then go on to load the query object. your module needs to do the work to define the methods you would then call to set up each specific query and the execution methods. In summary, I think the interface is a little muddled. You start off functionally, but then create objects on top of a functional interface. The well known Module `CGI` is a classic as it provides both interfaces so is worth a bit of study. I think you would be best to choose either functional, or object orientated. Or do something like CGI which offers either interface. I am not so sure about trying to implement both interfaces through one request though. Others have suggetsed modules similar to yours to look at, admittedly CGI is a monolith. They may not do what you are trying to do in terms of use, but they may provide you with some implementation hints, closer to your interface model. In terms of what you are describing, the use of this module would certainly help you, so no reason not to follow up with it, and usually if you have need for something, others may well also too. ^{My other OS is a gateway!}	[reply] [d/l] [select]
Re: RFC: API for declarative parameterized web queries by Anonymous Monk on Apr 10, 2014 at 21:50 UTC
I don't get it :) improved WWW::Search? `sub gimme_tvdb_blah { use Web::Query; use Web::Magic; use Web::Scraper; my $object = ... ; ## xpatish data extraction using magic above return $object; }` [download]	[reply] [d/l]
Re^2: RFC: API for declarative parameterized web queries by smls (Friar) on Apr 11, 2014 at 10:15 UTC
improved WWW::Search? No, this is not about search engines. A "query" targets a specific web page (usually a HTML or XML page), or a superposition of related pages that share the same structure and whose URLs differ only by some parameter. Whenever it is executed, the query downloads its target page and extracts information from it, where said information is logically structured as tabular data, and is returned as a sequence of hashes/objects (where each hash/object represents a "row" of the tabular data). use Web::Query; use Web::Magic; use Web::Scraper; my $object = ... ; ## xpatish data extraction using magic above Yes, it's totally possible to do it using those modules, just as it is possible to use LWP::UserAgent and XML::Parser/XML::LibXML/etc. directly as I have been doing. But then each query definition ends up being a block of imperative source code of some form, which of course provides maximum flexibility, but which I find inconvenient to debug and maintain when dealing with lots of queries. What I'm envisioning, is a declarative approach that is tailored to the use-case of extracting tabular data as described above (and in the future maybe other regularly structured* data, such as "scalar" or "flat list"), and allows the user to specify all required information for specifying the query as Perl data (i.e. a hash) rather than as imperative code. It will have less power and flexibility than the modules you listed, but will hopefully allow query definitions to be neater and more regular, and thus easier to maintain. PS: Although Web::Magic does look pretty sweet...*	[reply]
Re^3: RFC: API for declarative parameterized web queries by Anonymous Monk on Apr 12, 2014 at 06:44 UTC
No, this is not about search engines. A "query" targets a specific web page ... Same difference :) a search engine is nothing bu a specific web page, and all those modules do is return ... frabular ... data from those "web pages" using the same interface sub whatever { WhateverYourModuleDoesOrExports(%args) } A plain ol function, no akward OOPy registering (or whatever that is supposed to be, the part that I don't understand)	[reply]

What I have so far:

Questions:

Example: Get the 5 most recent tweets from someone's Twitter page