Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Benign Web Miner

by perlmonkey2 (Beadle)
on Sep 30, 2006 at 17:00 UTC ( [id://575685]=perlquestion: print w/replies, xml ) Need Help??

perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

For a while now, I've been working on a problem for my academic research center. The problem flows like this. We start with a list of URLs. We would like to mine the text from those URLs and then follow the links found on those pages. The problem is we mine a lot of text, so disk fragmentation is a huge issue. Previously we were using Wget, but 50 Wget's writing to disk at the same time quickly put your NTFS mean file fragment size at 4KB. So a replacement for Wget was planned that would only have one file written to disk at any given moment.

The problem with existing Perl solutions is they don't have the functionality required. For instance, they need to be able to block while other instances are writing to disk, they should have the option to span hosts, accept a huge domain-don't-go-there list, to ignore extension types, to span hosts, and to set recursive depths. I.E. Most of the main features of Wget.

So far I've made a hash out of using Win32 threads and LWP::UserAgent, as the number of threads had to stay low or Perl.exe would die. In order to be nice to the servers, very few hits per minute are allowed. This means highest number of stable Win32 threads (50) goes VERY slowly. The solution was to have the threads share the lists of URLs to get and make sure each domain wasn't hit very often, but if there are a thousand domains, then 50 threads could move very quickly. This led to problems with Win32 threads::shared not working well with extremely large data structures (I had 50 miners writing to a queue where one disk writer wrote to disk).

So now I'm thinking that LWP::Pararrel::UserAgent will resolve the issue, as it will be single threaded, yet able to search many sites at the same time.

If anyone has any thoughts, ideas, or recommendations, I would appreciate it.

Replies are listed 'Best First'.
Re: Benign Web Miner
by BrowserUk (Patriarch) on Sep 30, 2006 at 18:08 UTC

    Sharing large volumes of data between threads, whether through threads::shared or tools (like Thread::Queue) that use the former, is simple not effective. It is a penalty of the ithread architecture that even if you only want to share data between 2 individual threads, every other thread in the process also gets a copy like it or not, use it or not.

    On the face of it, two simple changes to your Win32 threaded code might sort the problems out.

    1. Have each thread retrieve the data from the url to an in-memory, *non-shared* buffer, and use a single semaphore (ie. a shared scalar) and threads::shared::lock() to ensure that only one thread writes to disk at a time.
    2. Create yourself a tperl.exe with a greatly reduce stacksize reservation as described in Use more threads.. With this reduced to 64k, you should find you are able to run many, many more concurrent threads.

      I've had 3000 ithreads running and concurrently active, though they were not doing much at all. I would suggest that 100 or 200 would probably be enough to ensure that you are capable of using the full bandwidth available from your internet connection. Of course, the limitations are likely to be memory in which to hold the data prior to writing it to disc, rather than connection speed.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Benign Web Miner
by Arunbear (Prior) on Sep 30, 2006 at 19:04 UTC
    If you don't mind a non-Perl solution, you might consider Heritrix. It is capable of large scale crawling, is kind to hosts that it visits (if a host takes n seconds to respond, it will wait m*n seconds before hitting that host again; m is configurable but defaults to 5) and has extremely flexible crawl settings.
Re: Benign Web Miner
by grep (Monsignor) on Sep 30, 2006 at 20:35 UTC
    Is there a reason why you are not using a DB (postgreSQL being my pick for the amount of concurrent connections you're using) for a backend? This would alleviate the disk frag problem.

    If you wanted to continue using wget then create a separate partition holding the downloaded data temporarily until a writer writes to a DB would work.



    grep
    Mynd you, mønk bites Kan be pretti nasti...
Re: Benign Web Miner
by Anonymous Monk on Sep 30, 2006 at 18:26 UTC
    You may wish to look at open utility cURL. The website http://curl.haxx.se/ has sample perl programs such as curlmirror.pl to mirror many files. I have not used cURL but it looks like an interesting utility. Hope this helps.
Re: Benign Web Miner
by gam3 (Curate) on Oct 01, 2006 at 04:19 UTC
    use bsd or Linux.
    -- gam3
    A picture is worth a thousand words, but takes 200K.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://575685]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 00:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found