in reply to Processing large file using threads

Hi mjacobson,

I hope you don't literally mean days ... for 21,000,000 lines it would take you more than 57,000 years (57494.8 years) to process the file.  Once.

Having said that, it's not clear that you would benefit from using threads.  You would still need to send the data in each line to the thread doing the processing, and then collect the result(s) from that thread afterwards, and since you've now got multiple threads each vying for the CPU, it's quite possibly going to be even slower than a single process working on it.  Depending on what exactly you're doing, of course.

Having said that, can you explain a little further exactly what processing you're trying to do on each line (and how long it actually takes)?


s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
  • Comment on Re: Processing large file using threads

Replies are listed 'Best First'.
Re^2: Processing large file using threads
by mjacobson (Initiate) on May 08, 2007 at 15:32 UTC

    Let me try to explain this a little bit better. My file contains 21 million URLs that our search engine has indexed on a Intranet. I have been tasked to check each URL against a "blacklist" file to see if the URL matches or not. I need to output a report that shows each host, #of URLs, # Not Blacklisted, % Not Blacklisted, # Blacklisted, & % Blacklisted.

    I have an array of blacklisted regular expressions and couple of hashes for not blacklisted and blacklisted.

    open(MYINFILE< $URLLIST) || die;
    while ( my $url = <MYINFILE> ) {
      chomp($url);
      if ($url ne "") {
        my ($host)=GetHost($url);
        my ($blacklisted) = isBlackListed($url);
        if ($blacklisted) {
          $BLACKLISTED{$host}++;
        } else {
          $NOTBLACKLISTED{$host}++;
        }
      }
    printReport

    I was hoping that using threads and sharing the Arrays and Hashes would speed up the processing of this file. I may want to break up the file into several files, maybe 1 host per file and process each file independently. At the end, build a complete report from each output.

      I still don't see how this would take more than a couple of minutes (ok, maybe a couple of hours), provided your blacklist is in memory.

      Checking 21 million urls shouldn't really take all that much time, and reading them all from a file shouldn't take that long either; that's only about a Gigabyte of data. I am assuming you have significantly less hosts than urls, or you'd possibly need lot (i.e. more than 4 Gb) of system memory, with the algorithm you've outlined above.

      ps: why isn't all this data in a database? provided you've already linked/split up the hosts from the urls, you can do this kind of query in a single line of SQL. and it'll probably be pretty fast too.

      It sounds like the major bottleneck in the process is going to be reading data from the disk. So the best way to speed this up would be to break the file down into a few chunks, and to run this job on separate machines.

      You can just merge the data it produces at the end of the process.

        I agree with this. There has been a few posts here in the past, where it was shown that the OS optimizes the processing of files, and it dosn't do much good to split the file and process the chunks in different threads of the same process. Disk IO will be the bottleneck. Different machines is a good idea. Or maybe if you were on a fast scsi system, and could put the different chunks on different scsi disks it would do some good.

        I'm not really a human, but I play one on earth. Cogito ergo sum a bum
      Just a hunch, but...

      Is this line

      my ($host)=GetHost($url);
      doing any DNS-ish stuff ? If so, I'd suspect that's your bottleneck, and you'll need to look into a smarter hostname cache solution. Threads will help, but probably not as much as you'd hope.

      Perl Contrarian & SQL fanboy