I've been working on a replacement website for Plucker, and one of the features of it is a live pull of some of the Project Gutenberg etexts from their today.rss feed and their Top 100 list of electronic texts.

So far, this works great. I've even worked out a slight caching mechanism to only query the upstream data when it has changed.

From this data, I build an HTML table that links to several versions of the etext, for our users. That data looks like this:

PlaceEtext #Book TitleDownload as...
1 22617Chambers's Edinburgh Journal, No. 454 by Variouspdb html txt
2 22621The New England Magazine, Volume 1, No. 1, January 1886 by Variouspdb html txt
3 22610Punch, or the London Charivari, Vol. 150, January 19, 1916 by Variouspdb html txt
4 22612Punch, or the London Charivari, Vol. 150, January 26, 1916 by Variouspdb html txt
5 22611The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymouspdb html txt
6 22609The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowellpdb html txt
7 22619International Copyright by George Haven Putnampdb html txt
8 22614A Pavorosa Illusão by Manuel Maria Barbosa du Bocagepdb htmltxt
9 22616Salve, Rei! by Camilo Castelo Brancopdb htmltxt
10 22604Children and Their Books by James Hosmer Pennimanpdb html txt

In the above table, you can see that some elements are striked out. This is done with the following snippet of code:

my %gutentypes = ( plucker => { 'mirror' => "http://www.gutenberg.org/cache/plucker/ +$1/$1", 'content-type' => 'application/prs.plucker', 'string' => 'Plucker', 'format' => 'pdb' }, html => { 'mirror' => "http://www.gutenberg.org/dirs/$splitgut +en/$1/$1-h/$1-h.htm", 'content-type' => 'text/html', 'string' => 'Marked-up HTML', 'format' => 'html' }, text => { 'mirror' => "http://sailor.gutenberg.lib.md.us/$spli +tguten/$1/$1.txt", 'content-type' => 'text/plain', 'string' => 'Plain text', 'format' => 'txt' }, ); for my $types ( sort keys %gutentypes ) { my ($status, $type) = test_head($gutentypes{$types}{mirror}); if ($status == 200) { $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types +}{mirror}">$gutentypes{$types}{format}</a>\n}; } else { $gutentypes{$types}{link} = qq{<strike>$gutentypes{$types} +{format}</strike>}; } } sub test_head { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('pps Plucker Perl Spider, v0.1.83 [rss]'); my $request = HTTP::Request->new(HEAD => $url); my $response = $ua->request($request); my $status = $response->status_line; my $type = $response->header('Content-Type'); my $content = $response->content; $status =~ m/(\d+)/; return ($1, $type); }

The number of items shown in the list, is controlled with a scalar I set for maximum, and an array slice for my $line (@gutenbooks[0 .. $maximum-1]) {...}.

The more books I want to show, the longer it takes for the page to draw, because I'm doing a HEAD request on every title 3 times (plucker, html, text), and linking/striking-out accordingly.

If I display 15 titles, that's at least 45 HEAD requests I have to make. It happens in under 2-5 seconds, depending on the latency to the mirror servers I'm pointing to, but it is still a delay. If one of those mirrors is not responding, the page load time could take forever (or until the remote end or user's browser times out).

I looked into using HTTP::Lite, HTTP:GHTTP, LWP::Simple and others to try to speed it up, but straight LWP::UserAgent was far-and-away the fastest (by about 3x), so I'm back to the drawing board.

I also looked into using LWP::Parallel::UserAgent and/or LWP::Parallel::ForkManager, but they're a bit more complex than I'd hoped (registering the links, then passing through a callback, etc.)

This was briefly discussed in the CB yesterday and bart (I think, forgive me if I have the wrong monk), suggested that I just check HEAD every hour/day or at some interval, unrelated to the user's request of the same page, and store the results in a database, and have my script always query the database, instead of hitting the remote urls directly every time my page is requested. He's right to a point... 45 or 100 or 200 database queries is MUCH faster than issuing a new HEAD request 3 times for each title displayed.

After thinking about this, it presents a few possible problems:

  1. If I check the links at 1am in a cron(1) job on the server-side, and a user visits the page at 7pm that night, the links may be down/invalid/changed/redirected.
  2. Coupling my script to the system processes (i.e. a cron job), doesn't make it a clean and portable as I'd like, if I have to move it from system to system (it also doesn't easily allow me to move it to an upstream hosting provider where I may not have access to cron).

Another suggestion was that I use some AJAX glue, and let the end-user's browser figure out which links were dead or not, ONLY when they decide to click upon them.

This too, presents some problems:

  1. It limits the feature to those with a browser supporting Javascript (and having it enabled, from what I understand, a shrinking minority)
  2. It does not work in text-mode browsers or for web spiders
  3. Visually, there is no indication of which titles are available in that format or not.

Is there an easier way to do this, so the end-user experience is not so hampered?


In reply to Speeding up/parallelizing hundreds of HEAD requests by hacker

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.