comment on

I'm taking another approach to this problem... based on the comments from theorbtwo. The current code looks like this:

sub gimme_guten_tables {
   my ($decoded, $maximum) = @_;

   $decoded =~ s,<li>\n(.*?)\n</li>,$1,g;
   $decoded =~ s,(.*?)<br><description>.*?</description>,$1,g;
   $decoded =~ s,<ul>(.*?)</ul>,$1,g;
   $decoded =~ s,<li>(.*?)</li>,$1,g;
   $decoded =~ s,<\/?ol>,,g;
   $decoded =~ s,<html xmlns:rss="http://purl.org/rss/1.0/"><body><ul>
+,,;
   $decoded =~ s,</ul></body></html>\n.*,,;
   $decoded =~ s,^\n<a,<a,g;

   my @gutenbooks = ($decoded =~ /([^\r\n]+)(?:[\r\n]{1,2}|$)/sg);

   my $guten_tables;

   my ($link_status, $plkr_type, $html_type, $text_type);

   my $count = 1;
   for my $line (@gutenbooks[0 .. $maximum-1]) {
      if ($line && $line =~ m/href=".+\/(\d+)">(.*?)(?: \((\d+)\))?<\/
+a>/) {
         my $splitguten = join('/', split(/ */, $1));
         my $clipguten = substr($splitguten, -2, 2, '');

         my $readmarks = $3 ? $3 : $1;

         my $title = $2;
         $title =~ s,by (.*?)</a>,</a> by $1,g;

         my %gutentypes = (
            plucker => {
               'mirror'       => "http://www.gutenberg.org/cache/pluck
+er/$1/$1",
               'content-type' => 'application/prs.plucker',
               'string'       => 'Plucker',
               'format'       => 'pdb'
            },

            html    => {
               'mirror'       => "http://www.gutenberg.org/dirs/$split
+guten/$1/$1-h/$1-h.htm",
               'content-type' => 'text/html',
               'string'       => 'Marked-up HTML',
               'format'       => 'html'
            },

            text    => {
               'mirror'       => "http://sailor.gutenberg.lib.md.us/$s
+plitguten/$1/$1.txt",
               'content-type' => 'text/plain',
               'string'       => 'Plain text',
               'format'       => 'txt'
            },
         );

      for my $types ( sort keys %gutentypes ) {
         my ($status, $type) = test_head($gutentypes{$types}{mirror});

         if ($status == 200) {
            $gutentypes{$types}{link} = 
               qq{<a href="$gutentypes{$types}{mirror}">$gutentypes{$t
+ypes}{format}</a>\n};
         } else {
            $gutentypes{$types}{link} = 
               qq{<s>$gutentypes{$types}{format}</s>};
         }
      }
 
      $guten_tables .= qq{<tr>

         <td width="40" align="center">$count</td>   
         <td width="40" align="right">$readmarks</td>
         <td width="500">
            <a href="http://www.gutenberg.org/etext/$1">$title</a>
         </td>
         <td align="center">$gutentypes{plucker}{link}</td>
         <td align="center">$gutentypes{html}{link}</td>

         <td align="center">$gutentypes{text}{link}</td>
         </tr>\n};

         $count++;
      }
   }

   $guten_tables =~ s,\&,\&amp;,g;
   $guten_tables =~ s,>\n\s+<,><,g;

   return $guten_tables;
}

sub test_head {  
   my $url = shift;

   my $ua     = LWP::UserAgent->new();
   $ua->agent('Mozilla/5.0 (Windows; U; Windows NT 5.1;) Firefox/2.0.0
+.6');

   my $request     = HTTP::Request->new(HEAD => $url);
   my $response    = $ua->request($request);
   my $status      = $response->status_line;
   my $type   = $response->header('Content-Type');
   my $content     = $response->content;
 
   $status =~ m/(\d+)/;

   return ($1, $type);
}
[download]

In this code, I'm taking an array, @gutenbooks, splitting out the etext id ($1) and the etext title ($2), and creating a hash of the 3 different formats of that work (pdb, html, txt).

For each link I create, I pass it through test_head(), and check to see if it returns a '200' status or not. If the link is a '200' (i.e. exists, and is valid), I create a clickable link to it. If the link is NOT '200', then I don't link to it (i.e. I don't create a link that the user can click, to get a 404 or missing document).

What I'd like to try to implement, is a way to take all of the links at once, pass them into some sub, and parallelize the HEAD check across them and return answers based on that check.

But here is where I'm stuck...

How do I take the single urls coming out of my match function, build a hash of them
How do I then pass that hash to "something", which can then check the validity (in some random order?)
How do I keep track of the responses returned from that check, maintaining integrity, so I can link/unlink the entry in the table I'm outputting?

I have no experience with LWP::Parallel, LWP::ParallelUA, LWP::Parallel::ForkManager and the like (passing references, callbacks, etc.)

Can some monk give me a strong nudge in the right direction?

The docs for these modules assume I am just statically definiing the urls I want to check... and I can't do that; everything will be coming out of a dynamic, ever-changing array.

Thanks.

In reply to Re: Speeding up/parallelizing hundreds of HEAD requests by hacker
in thread Speeding up/parallelizing hundreds of HEAD requests by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.