Hello all!

Since I'm doing web querying/scraping of tabular data in several of my scripts, I want to refactor this functionality into a convenient module, and I would appreciate some input/criticism/wisdom from the other resident monks on that topic... :)

By "query", I mean a complete recipe for downloading a document from the web and extracting relevant pieces from it using XPath expressions and/or regexes.
By "tabular data", I mean that when a query is executed, it returns a list of items where each item has the same set of pre-defined "fields".

The module's API should allow query definitions to be:

What I have so far:

My current idea (a.k.a. "first draft") for the API is demonstrated by the following simple usage example, which would be a script for finding TV series by name:

#!/usr/bin/perl use warnings; use strict; use My::Hypothetical::Web::Query::Module qw(register_query); # Defining the query: register_query 'tvdb_series_search' => { url => 'http://thetvdb.com/api/GetSeries.php?seriesname=%s', cache => 'tvdb_search_%s.xml', items => '/Data/Series', # this is an XPath expression! class => 'Series', parse => { 'name' => './SeriesName', 'year' => ['./FirstAired', qr/^(\d{0,4})/], 'desc' => './Overview' }, }; # Defining the type for the query's result items: package Series { use Moo; has name => (is => 'ro'); has year => (is => 'ro'); has desc => (is => 'ro'); sub summary { my $self = shift; return sprintf "%s [%s]\n %s\n", $self->name, $self->year, substr($self->desc, 0, 64).'...'; } } # Executing the query and iterating over its result list: my $it = tvdb_series_search( $ARGV[0] ); while (my $series = $it->()) { print $series->summary(); }

Sample output:

$ ./findseries.pl cards House of Cards [1990] The PM made a deadly mistake when he passed over Francis Urquhar... House of Cards (US) [2013] House of Cards is an American political drama series developed a...

I.e. a query is defined using a hash, some of whose values (like the URL) are sprintf patterns that will get filled by the parameter(s) passed to the query each time it is executed. Each of the parsing rules (items and the child values of parse) can be given as an XPath expression or a regex, or as an array of multiple XPath expressions and regexes where each one narrows down the result of the preceding one.

The register_query function takes this hash, and constructs a query subroutine from it which it dynamically injects into the program's symbol table, so that executing the query is as simple as calling that function with the desired parameters.

The result items are optionally returned as objects of the type specified by the class option, which admittedly is overkill in a simple example like the above, but will be useful in more complex use-cases.

Some additional planned options for register_query, that are not demonstrated above:

args => sub { ... }, # transform query parameters before they # are passed to sprintf pre => &IO::Uncompress::Unzip::unzip, # preprocess download # before parsing

Questions:

Thanks!

Replies are listed 'Best First'.
Re: RFC: API for declarative parameterized web queries
by Corion (Patriarch) on Apr 11, 2014 at 06:36 UTC

    For prior art, see Web::Scraper, which seems to do a similar thing at least on the extraction side, except with fancy syntactic sugar instead of data.

      Thanks for the pointer, Corion.

      Here's a simple example to try and compare my hypothetical API to Web::Scraper:

      Example: Get the 5 most recent tweets from someone's Twitter page

      The idea is that the demo script should take a twitter nickname as a command-line argument, and print the time, author, and text body of the last five tweets from that person's Twitter page to STDOUT.

      The implementation with Web::Scraper:

      use URI; use Web::Scraper; my $tweets_url = "http://twitter.com/%s"; my $tweets_query = scraper { process 'li[data-item-type="tweet"]', 'tweets[]' => scraper { process '*[data-name]', 'name' => '@data-name'; process '*[data-time]', 'time' => '@data-time'; process '.content p', 'text' => 'TEXT'; }; }; my $tweets = $tweets_query->scrape( URI->new(sprintf $tweets_url, $ARGV[0])); for my $tweet (@{$tweets->{tweets}}[0..4]) { last if !$tweet; my $date = strftime('%b %d', localtime $tweet->{time}); print "\n$tweet->{name} tweeted on $date:\n $tweet->{text}\n"; }

      The implementation with my proposed API:

      use My::Query qw(register_query); register_query 'recent_tweets' => { url => "http://twitter.com/%s", items => '//li[@data-item-type="tweet"]', parse => { 'name' => '//@data-name', 'time' => '//@data-time', 'text' => '//*[@class="content"]/p' }, } my $it = recent_tweets( $ARGV[0] ); for (0..4) { my $tweet = $it->() or last; my $date = strftime('%b %d', localtime $tweet->{time}); print "\n$tweet->{name} tweeted on $date:\n $tweet->{text}\n"; }

      (I haven't used the "return rows as objects" thing here, as I've decided it should be optional and it doesn't gain us anything in simple cases like this.)

      Sample output (same for both implementations):

      $ ./recent-tweets.pl TimToady Larry Wall tweeted on Mar 15: @anocelot Lemme guess, only the first word is different... Larry Wall tweeted on Mar 13: I need to ask the Guinness folks what the current world record is fo +r number of invitations to connect on LinkedIn ignored. Larry Wall tweeted on Feb 14: @genespeth \o/ Larry Wall tweeted on Feb 12: Let us not forget that the perfect is also the enemy of the bad and +the ugly. Larry Wall tweeted on Feb 03: Wow. Just...wow. #sb48

      Both get the job done, and it's certainly not a difference as day and night, but I do prefer my API even for simple cases like this, because...

      • ...of its declarative nature, i.e. instead of saying "Perform these operations, and while going along store these values", you say "I want a list of items/rows as output, and each of them should have these fields, so here are the parsing rules for extracting each of them". Which feels more elegant to me, but that's of course subjective.
      • ...it doesn't make you jump through hoops to keep all information about the query (including how to construct the URL) in one place.

      What do you think?

Re: RFC: API for declarative parameterized web queries
by Don Coyote (Hermit) on Apr 11, 2014 at 14:30 UTC

    hi smls

    your three requirements for encapsulation, declaration and reusability, would suggest to me for an object orientated interface. I am not convinced that overloading the global symbol space would really help with remaining encapsulated.And what you describe here sounds a lot like what Export is doing when you import the register_query routine. So are you in effect calling import for each query you register? Rather than for the class interface, then calling methods from the class on each query, which would adhere to the reusability and declaritve requirements too.

    As the query is in a hash format, I would look at making the register_query as a constructor, something like my $rq = RQS->new; then go on to load the query object. your module needs to do the work to define the methods you would then call to set up each specific query and the execution methods.

    In summary, I think the interface is a little muddled. You start off functionally, but then create objects on top of a functional interface. The well known Module CGI is a classic as it provides both interfaces so is worth a bit of study. I think you would be best to choose either functional, or object orientated. Or do something like CGI which offers either interface. I am not so sure about trying to implement both interfaces through one request though.

    Others have suggetsed modules similar to yours to look at, admittedly CGI is a monolith. They may not do what you are trying to do in terms of use, but they may provide you with some implementation hints, closer to your interface model.

    In terms of what you are describing, the use of this module would certainly help you, so no reason not to follow up with it, and usually if you have need for something, others may well also too.


    My other OS is a gateway!
Re: RFC: API for declarative parameterized web queries
by Anonymous Monk on Apr 10, 2014 at 21:50 UTC
    I don't get it :) improved WWW::Search?
    sub gimme_tvdb_blah { use Web::Query; use Web::Magic; use Web::Scraper; my $object = ... ; ## xpatish data extraction using magic above return $object; }
      improved WWW::Search?

      No, this is not about search engines. A "query" targets a specific web page (usually a HTML or XML page), or a superposition of related pages that share the same structure and whose URLs differ only by some parameter.

      Whenever it is executed, the query downloads its target page and extracts information from it, where said information is logically structured as tabular data, and is returned as a sequence of hashes/objects (where each hash/object represents a "row" of the tabular data).

          use Web::Query;
          use Web::Magic;
          use Web::Scraper;
          my $object = ... ; ## xpatish data extraction using magic above
      

      Yes, it's totally possible to do it using those modules, just as it is possible to use LWP::UserAgent and XML::Parser/XML::LibXML/etc. directly as I have been doing.

      But then each query definition ends up being a block of imperative source code of some form, which of course provides maximum flexibility, but which I find inconvenient to debug and maintain when dealing with lots of queries. What I'm envisioning, is a declarative approach that is tailored to the use-case of extracting tabular data as described above (and in the future maybe other regularly structured data, such as "scalar" or "flat list"), and allows the user to specify all required information for specifying the query as Perl data (i.e. a hash) rather than as imperative code.

      It will have less power and flexibility than the modules you listed, but will hopefully allow query definitions to be neater and more regular, and thus easier to maintain.

      PS: Although Web::Magic does look pretty sweet...

        No, this is not about search engines. A "query" targets a specific web page ...

        Same difference :) a search engine is nothing bu a specific web page, and all those modules do is return ... frabular ... data from those "web pages" using the same interface

        sub whatever { WhateverYourModuleDoesOrExports(%args) }

        A plain ol function, no akward OOPy registering (or whatever that is supposed to be, the part that I don't understand)