Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I abondoned the idea. To do the indexing I envisioned justice, I would have had to download the greater majority of PM's nodes. At the mandated rate of 1 every 5 seconds+, it would require 500 hours. Split that into 2 hour chunks of connect time and it becomes untenable.

Hence I've never bothered to extend the scripts beyond their simplist form:

PMDown.pl takes a filename containing a list of PM nodeid's to download:

BewareEven with 1 thread running, this will far exceed the download rate approved.

#! perl -slw use strict; use threads qw[ yield async ]; use Thread::Queue; use LWP::Simple; $|=1; our $THREADS ||= 4; my $Qwork = new Thread::Queue; my $Qresults = new Thread::Queue; sub work{ my $tid = threads->self->tid(); sleep 1 until $Qwork->pending; while( $Qwork->pending or ( sleep(1) and $Qwork->pending ) ) { my $work = $Qwork->dequeue; print "$tid checking: $work"; if( -e "c:/perlmonks/$work.xml" and not -z _ ) { $Qresults->enqueue( "$work returned exists not fetched" ); next; } print "$tid fetching: $work"; my $rc = getstore( "http://perlmonks.com/index.pl?node_id=$work&displaytype=x +ml", "c:/perlmonks/$work.xml" ); $Qresults->enqueue( "$work returned $rc" ); } return; } my @t = map{ threads->new( \&work ) } 1 .. $THREADS; open IN, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my $cResults = 0; while( <IN> ) { chomp; $Qwork->enqueue( $_ ) and $cResults++; sleep 1 while $Qwork->pending > 100; } $_->join for @t; my %summary; $Qresults->dequeue =~ m[(\S+) returned (\S+)] and push @{ $summary{ $2 } }, $1 while $Qresults->pending; for my $key ( sort keys %summary ) { print $key, scalar @{ $summary{ $key } }; } printf 'Enter to see details or ^C'; <STDIN>; for my $key ( sort keys %summary ) { print $key; my $details = join ' ', @{ $summary{ $key } }; $details =~ s[.{1,80} ][\n]g; print $details; }

ExtractWords.pl

#! perl -slw use G; my %words; while( <> ) { $words{ $_ }++ for m[\b([a-zA-Z][a-zA-Z']+[a-zA-Z])\b]g; } open WORDS, '>', 'words.dat' or die $!; print WORDS for sort keys %words; close WORDS;

IndexDocs.pl

#! perl -slw use strict; use G; $|=1; chomp( my @words = do{ open my $fh, '<words.dat'; <$fh> } ); print "loaded: " . @words . ' words'; local $/; my %index; @index{ @ARGV } = ('') x @ARGV; while( <> ) { chomp( my $file = lc ); 1+index( $file, $words[ $_ ] ) and vec( $index{ $ARGV }, $_, 1 ) = 1 for 0 .. $#words; } open INDEX, '>', 'index.dat' or die $!; print INDEX "$_(@{[ unpack '%b*', $index{ $_ } ]}) : [@{[ unpack 'b*', + $index{ $_ } ]}]" for sort keys %index; close INDEX;

Note: G.pm is Jenda's module that does wildcard ARGV expansion.

The result of processing is a file that looks like this:

.\171594.txt : all an and anonymous asked at back be better but by com + concerning contain create directories even excluding expression foll +owing for gone has have hours in index jun last list looking monks of + on over pl probably question renders replies round seekers simple th +anks that the this to want wisdom without would .\171599.txt : am and are at be being brothers but by com comes create + darkness directories doubt enlightenment etiquette help here if in i +ndex jun light list living me my no not of on order piece pl re repli +es reply seeking so someone strong sure tell that the thread to unsur +e until way weak will with without

But I manually filtered the intermediate words list.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

In reply to Re^2: Mass downloads. by BrowserUk
in thread Mass downloads. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-24 06:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found