Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hello monks,

For a while now, I've been working on a problem for my academic research center. The problem flows like this. We start with a list of URLs. We would like to mine the text from those URLs and then follow the links found on those pages. The problem is we mine a lot of text, so disk fragmentation is a huge issue. Previously we were using Wget, but 50 Wget's writing to disk at the same time quickly put your NTFS mean file fragment size at 4KB. So a replacement for Wget was planned that would only have one file written to disk at any given moment.

The problem with existing Perl solutions is they don't have the functionality required. For instance, they need to be able to block while other instances are writing to disk, they should have the option to span hosts, accept a huge domain-don't-go-there list, to ignore extension types, to span hosts, and to set recursive depths. I.E. Most of the main features of Wget.

So far I've made a hash out of using Win32 threads and LWP::UserAgent, as the number of threads had to stay low or Perl.exe would die. In order to be nice to the servers, very few hits per minute are allowed. This means highest number of stable Win32 threads (50) goes VERY slowly. The solution was to have the threads share the lists of URLs to get and make sure each domain wasn't hit very often, but if there are a thousand domains, then 50 threads could move very quickly. This led to problems with Win32 threads::shared not working well with extremely large data structures (I had 50 miners writing to a queue where one disk writer wrote to disk).

So now I'm thinking that LWP::Pararrel::UserAgent will resolve the issue, as it will be single threaded, yet able to search many sites at the same time.

If anyone has any thoughts, ideas, or recommendations, I would appreciate it.


In reply to Benign Web Miner by perlmonkey2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-25 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found