comment on

I've been working on a replacement website for Plucker, and one of the features of it is a live pull of some of the Project Gutenberg etexts from their today.rss feed and their Top 100 list of electronic texts.

So far, this works great. I've even worked out a slight caching mechanism to only query the upstream data when it has changed.

From this data, I build an HTML table that links to several versions of the etext, for our users. That data looks like this:

Place Etext # Book Title Download as...

1 22617 Chambers's Edinburgh Journal, No. 454 by Various pdb html ~~txt~~

2 22621 The New England Magazine, Volume 1, No. 1, January 1886 by Various pdb html ~~txt~~

3 22610 Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various pdb html ~~txt~~

4 22612 Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various pdb html ~~txt~~

5 22611 The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous pdb html ~~txt~~

6 22609 The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell pdb html ~~txt~~

7 22619 International Copyright by George Haven Putnam pdb html ~~txt~~

8 22614 A Pavorosa Illusão by Manuel Maria Barbosa du Bocage pdb ~~html~~ ~~txt~~

9 22616 Salve, Rei! by Camilo Castelo Branco pdb ~~html~~ ~~txt~~

10 22604 Children and Their Books by James Hosmer Penniman pdb html ~~txt~~

Place	Etext #	Book Title	Download as...
1	22617	Chambers's Edinburgh Journal, No. 454 by Various	pdb	html	~~txt~~
2	22621	The New England Magazine, Volume 1, No. 1, January 1886 by Various	pdb	html	~~txt~~
3	22610	Punch, or the London Charivari, Vol. 150, January 19, 1916 by Various	pdb	html	~~txt~~
4	22612	Punch, or the London Charivari, Vol. 150, January 26, 1916 by Various	pdb	html	~~txt~~
5	22611	The Fox and the Geese; and The Wonderful History of Henny-Penny by Anonymous	pdb	html	~~txt~~
6	22609	The Writings of James Russell Lowell in Prose and Poetry, Volume V by James Russell Lowell	pdb	html	~~txt~~
7	22619	International Copyright by George Haven Putnam	pdb	html	~~txt~~
8	22614	A Pavorosa Illusão by Manuel Maria Barbosa du Bocage	pdb	~~html~~	~~txt~~
9	22616	Salve, Rei! by Camilo Castelo Branco	pdb	~~html~~	~~txt~~
10	22604	Children and Their Books by James Hosmer Penniman	pdb	html	~~txt~~

In the above table, you can see that some elements are ~~striked out~~. This is done with the following snippet of code:

    my %gutentypes = (   
        plucker => { 
            'mirror'       => "http://www.gutenberg.org/cache/plucker/
+$1/$1",
            'content-type' => 'application/prs.plucker',
            'string'       => 'Plucker',
            'format'       => 'pdb'
        },

        html    => {
            'mirror'       => "http://www.gutenberg.org/dirs/$splitgut
+en/$1/$1-h/$1-h.htm",
            'content-type' => 'text/html',
            'string'       => 'Marked-up HTML',
            'format'       => 'html'
        },
 
        text    => {
            'mirror'       => "http://sailor.gutenberg.lib.md.us/$spli
+tguten/$1/$1.txt",
            'content-type' => 'text/plain',
            'string'       => 'Plain text',
            'format'       => 'txt'
        },
    );

    for my $types ( sort keys %gutentypes ) {
        my ($status, $type) = test_head($gutentypes{$types}{mirror});

        if ($status == 200) {
            $gutentypes{$types}{link} = qq{<a href="$gutentypes{$types
+}{mirror}">$gutentypes{$types}{format}</a>\n};
        } else {
            $gutentypes{$types}{link} = qq{<strike>$gutentypes{$types}
+{format}</strike>};
        }
    }

    sub test_head {
        my $url = shift;

        my $ua          = LWP::UserAgent->new;
        $ua->agent('pps Plucker Perl Spider, v0.1.83 [rss]');

        my $request     = HTTP::Request->new(HEAD => $url);
        my $response    = $ua->request($request);
        my $status      = $response->status_line;
        my $type        = $response->header('Content-Type');
        my $content     = $response->content;

        $status =~ m/(\d+)/;

        return ($1, $type);
    }
[download]

The number of items shown in the list, is controlled with a scalar I set for maximum, and an array slice for my $line (@gutenbooks[0 .. $maximum-1]) {...}.

The more books I want to show, the longer it takes for the page to draw, because I'm doing a HEAD request on every title 3 times (plucker, html, text), and linking/striking-out accordingly.

If I display 15 titles, that's at least 45 HEAD requests I have to make. It happens in under 2-5 seconds, depending on the latency to the mirror servers I'm pointing to, but it is still a delay. If one of those mirrors is not responding, the page load time could take forever (or until the remote end or user's browser times out).

I looked into using HTTP::Lite, HTTP:GHTTP, LWP::Simple and others to try to speed it up, but straight LWP::UserAgent was far-and-away the fastest (by about 3x), so I'm back to the drawing board.

I also looked into using LWP::Parallel::UserAgent and/or LWP::Parallel::ForkManager, but they're a bit more complex than I'd hoped (registering the links, then passing through a callback, etc.)

This was briefly discussed in the CB yesterday and bart (I think, forgive me if I have the wrong monk), suggested that I just check HEAD every hour/day or at some interval, unrelated to the user's request of the same page, and store the results in a database, and have my script always query the database, instead of hitting the remote urls directly every time my page is requested. He's right to a point... 45 or 100 or 200 database queries is MUCH faster than issuing a new HEAD request 3 times for each title displayed.

After thinking about this, it presents a few possible problems:

If I check the links at 1am in a cron(1) job on the server-side, and a user visits the page at 7pm that night, the links may be down/invalid/changed/redirected.
Coupling my script to the system processes (i.e. a cron job), doesn't make it a clean and portable as I'd like, if I have to move it from system to system (it also doesn't easily allow me to move it to an upstream hosting provider where I may not have access to cron).

Another suggestion was that I use some AJAX glue, and let the end-user's browser figure out which links were dead or not, ONLY when they decide to click upon them.

This too, presents some problems:

It limits the feature to those with a browser supporting Javascript (and having it enabled, from what I understand, a shrinking minority)
It does not work in text-mode browsers or for web spiders
Visually, there is no indication of which titles are available in that format or not.

Is there an easier way to do this, so the end-user experience is not so hampered?

In reply to Speeding up/parallelizing hundreds of HEAD requests by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.