mjacobson has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that has over 21 million lines in it. I have written a perl script that loops over each line and does some work on the line.

While this script is working, it will take days to process each line. Would like to see an example of doing this in threads if possible.

Thanks in advance for your help.

Replies are listed 'Best First'.
Re: Processing large file using threads
by liverpole (Monsignor) on May 08, 2007 at 15:10 UTC
    Hi mjacobson,

    I hope you don't literally mean days ... for 21,000,000 lines it would take you more than 57,000 years (57494.8 years) to process the file.  Once.

    Having said that, it's not clear that you would benefit from using threads.  You would still need to send the data in each line to the thread doing the processing, and then collect the result(s) from that thread afterwards, and since you've now got multiple threads each vying for the CPU, it's quite possibly going to be even slower than a single process working on it.  Depending on what exactly you're doing, of course.

    Having said that, can you explain a little further exactly what processing you're trying to do on each line (and how long it actually takes)?


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

      Let me try to explain this a little bit better. My file contains 21 million URLs that our search engine has indexed on a Intranet. I have been tasked to check each URL against a "blacklist" file to see if the URL matches or not. I need to output a report that shows each host, #of URLs, # Not Blacklisted, % Not Blacklisted, # Blacklisted, & % Blacklisted.

      I have an array of blacklisted regular expressions and couple of hashes for not blacklisted and blacklisted.

      open(MYINFILE< $URLLIST) || die;
      while ( my $url = <MYINFILE> ) {
        chomp($url);
        if ($url ne "") {
          my ($host)=GetHost($url);
          my ($blacklisted) = isBlackListed($url);
          if ($blacklisted) {
            $BLACKLISTED{$host}++;
          } else {
            $NOTBLACKLISTED{$host}++;
          }
        }
      printReport

      I was hoping that using threads and sharing the Arrays and Hashes would speed up the processing of this file. I may want to break up the file into several files, maybe 1 host per file and process each file independently. At the end, build a complete report from each output.

        I still don't see how this would take more than a couple of minutes (ok, maybe a couple of hours), provided your blacklist is in memory.

        Checking 21 million urls shouldn't really take all that much time, and reading them all from a file shouldn't take that long either; that's only about a Gigabyte of data. I am assuming you have significantly less hosts than urls, or you'd possibly need lot (i.e. more than 4 Gb) of system memory, with the algorithm you've outlined above.

        ps: why isn't all this data in a database? provided you've already linked/split up the hosts from the urls, you can do this kind of query in a single line of SQL. and it'll probably be pretty fast too.

        It sounds like the major bottleneck in the process is going to be reading data from the disk. So the best way to speed this up would be to break the file down into a few chunks, and to run this job on separate machines.

        You can just merge the data it produces at the end of the process.

        Just a hunch, but...

        Is this line

        my ($host)=GetHost($url);
        doing any DNS-ish stuff ? If so, I'd suspect that's your bottleneck, and you'll need to look into a smarter hostname cache solution. Threads will help, but probably not as much as you'd hope.

        Perl Contrarian & SQL fanboy
Re: Processing large file using threads
by renodino (Curate) on May 08, 2007 at 15:30 UTC
    1. Create a "master" Thread (usually the root thread).
    2. Create some (possibly configurable) number of child threads
    3. (Here's the tricky part). You've got a couple of alternatives:
      • a) on thread->create(), pass the fileno() to each child thread, along with an offset and skip count. Each child thread then does an open(INF, '<&', $fileno) on the fileno, reads/discards skip count lines, then iteratively read/process/ skip until EOF
      • b) alternately, create 2 Thread::Queues (one from master to children, the other from children to master). Master reads each line from file and posts to downstream queue; children grab (randomly) a line off the queue, process it, then post a response to the upstream queue.

    As ever, TIMTOWTDI. (b) is probably simpler, but (a) is more deterministic. Both can be mimiced using a process based approach.

    I've successfully used (a) for ETL tools, but if your file is binary/random access, it can get complicated to skip records. Also, if the children are writing to an output file, (b) might be easier to let a single master thread do the writing instead of coordinating writes betwen children.


    Perl Contrarian & SQL fanboy
Re: Processing large file using threads
by clinton (Priest) on May 08, 2007 at 15:07 UTC
    This sounds like it is a one-off.

    Assuming that you don't need to maintain state (ie remember what has come before in the file) and that you are processing each line independently, it may be easier to just divide the file into (eg) 5, and run the same script on each sub-file, on separate computers if need be.

    It'd certainly be a lot simpler than debugging a threaded application.

Re: Processing large file using threads
by BrowserUk (Patriarch) on May 08, 2007 at 21:30 UTC

    When your existing script is running, what percentage of cpu does top/Program Manager show it is consuming?

    Does the system where you are running this program have multiple cpus? If not, is there are multiple cpu system available on which it could be run?

    The answers to those questions dictate whether there is any mileage in improving the performance of your script using threads.

    It woudl also be much simpler to provide a thread solution, if that is going to be beneficial, if you posted the entire script, along with a few lines showing the format of (each of) the input file(s).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Processing large file using threads
by Joost (Canon) on May 08, 2007 at 15:03 UTC
      maybe you should split it to many small text files.