Re^2: Processing large file using threads

Let me try to explain this a little bit better. My file contains 21 million URLs that our search engine has indexed on a Intranet. I have been tasked to check each URL against a "blacklist" file to see if the URL matches or not. I need to output a report that shows each host, #of URLs, # Not Blacklisted, % Not Blacklisted, # Blacklisted, & % Blacklisted.

I have an array of blacklisted regular expressions and couple of hashes for not blacklisted and blacklisted.

open(MYINFILE< $URLLIST) || die;
while ( my $url = <MYINFILE> ) {
  chomp($url);
  if ($url ne "") {
    my ($host)=GetHost($url);
    my ($blacklisted) = isBlackListed($url);
    if ($blacklisted) {
      $BLACKLISTED{$host}++;
    } else {
      $NOTBLACKLISTED{$host}++;
    }
  }
printReport

I was hoping that using threads and sharing the Arrays and Hashes would speed up the processing of this file. I may want to break up the file into several files, maybe 1 host per file and process each file independently. At the end, build a complete report from each output.

Comment on Re^2: Processing large file using threads

Replies are listed 'Best First'.
Re^3: Processing large file using threads by Joost (Canon) on May 08, 2007 at 18:13 UTC
I still don't see how this would take more than a couple of minutes (ok, maybe a couple of hours), provided your blacklist is in memory. Checking 21 million urls shouldn't really take all that much time, and reading them all from a file shouldn't take that long either; that's only about a Gigabyte of data. I am assuming you have significantly less hosts than urls, or you'd possibly need lot (i.e. more than 4 Gb) of system memory, with the algorithm you've outlined above. ps: why isn't all this data in a database? provided you've already linked/split up the hosts from the urls, you can do this kind of query in a single line of SQL. and it'll probably be pretty fast too. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: Processing large file using threads by clinton (Priest) on May 08, 2007 at 16:30 UTC
It sounds like the major bottleneck in the process is going to be reading data from the disk. So the best way to speed this up would be to break the file down into a few chunks, and to run this job on separate machines. You can just merge the data it produces at the end of the process.	[reply]
Re^4: Processing large file using threads by zentara (Cardinal) on May 08, 2007 at 16:44 UTC
I agree with this. There has been a few posts here in the past, where it was shown that the OS optimizes the processing of files, and it dosn't do much good to split the file and process the chunks in different threads of the same process. Disk IO will be the bottleneck. Different machines is a good idea. Or maybe if you were on a fast scsi system, and could put the different chunks on different scsi disks it would do some good. I'm not really a human, but I play one on earth. Cogito ergo sum a bum	[reply]
Re^3: Processing large file using threads by renodino (Curate) on May 08, 2007 at 21:02 UTC
Just a hunch, but... Is this line `my ($host)=GetHost($url);` [download] doing any DNS-ish stuff ? If so, I'd suspect that's your bottleneck, and you'll need to look into a smarter hostname cache solution. Threads will help, but probably not as much as you'd hope. Perl Contrarian & SQL fanboy	[reply] [d/l]