Processing large file using threads

mjacobson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Processing large file using threads by liverpole (Monsignor) on May 08, 2007 at 15:10 UTC
Hi mjacobson, I hope you don't literally mean days ... for 21,000,000 lines it would take you more than 57,000 years (57494.8 years) to process the file. Once. Having said that, it's not clear that you would benefit from using threads. You would still need to send the data in each line to the thread doing the processing, and then collect the result(s) from that thread afterwards, and since you've now got multiple threads each vying for the CPU, it's quite possibly going to be even slower than a single process working on it. Depending on what exactly you're doing, of course. Having said that, can you explain a little further exactly what processing you're trying to do on each line (and how long it actually takes)? s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply]
Re^2: Processing large file using threads by mjacobson (Initiate) on May 08, 2007 at 15:32 UTC
Let me try to explain this a little bit better. My file contains 21 million URLs that our search engine has indexed on a Intranet. I have been tasked to check each URL against a "blacklist" file to see if the URL matches or not. I need to output a report that shows each host, #of URLs, # Not Blacklisted, % Not Blacklisted, # Blacklisted, & % Blacklisted. I have an array of blacklisted regular expressions and couple of hashes for not blacklisted and blacklisted. open(MYINFILE< $URLLIST) \|\| die; while ( my $url = <MYINFILE> ) { chomp($url); if ($url ne "") { my ($host)=GetHost($url); my ($blacklisted) = isBlackListed($url); if ($blacklisted) { $BLACKLISTED{$host}++; } else { $NOTBLACKLISTED{$host}++; } } printReport I was hoping that using threads and sharing the Arrays and Hashes would speed up the processing of this file. I may want to break up the file into several files, maybe 1 host per file and process each file independently. At the end, build a complete report from each output.	[reply]
Re^3: Processing large file using threads by Joost (Canon) on May 08, 2007 at 18:13 UTC
I still don't see how this would take more than a couple of minutes (ok, maybe a couple of hours), provided your blacklist is in memory. Checking 21 million urls shouldn't really take all that much time, and reading them all from a file shouldn't take that long either; that's only about a Gigabyte of data. I am assuming you have significantly less hosts than urls, or you'd possibly need lot (i.e. more than 4 Gb) of system memory, with the algorithm you've outlined above. ps: why isn't all this data in a database? provided you've already linked/split up the hosts from the urls, you can do this kind of query in a single line of SQL. and it'll probably be pretty fast too. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: Processing large file using threads by clinton (Priest) on May 08, 2007 at 16:30 UTC
It sounds like the major bottleneck in the process is going to be reading data from the disk. So the best way to speed this up would be to break the file down into a few chunks, and to run this job on separate machines. You can just merge the data it produces at the end of the process.	[reply]
Re^4: Processing large file using threads by zentara (Cardinal) on May 08, 2007 at 16:44 UTC
Re^3: Processing large file using threads by renodino (Curate) on May 08, 2007 at 21:02 UTC
Just a hunch, but... Is this line `my ($host)=GetHost($url);` [download] doing any DNS-ish stuff ? If so, I'd suspect that's your bottleneck, and you'll need to look into a smarter hostname cache solution. Threads will help, but probably not as much as you'd hope. Perl Contrarian & SQL fanboy	[reply] [d/l]
Re: Processing large file using threads by renodino (Curate) on May 08, 2007 at 15:30 UTC
Create a "master" Thread (usually the root thread). Create some (possibly configurable) number of child threads (Here's the tricky part). You've got a couple of alternatives: a) on thread->create(), pass the fileno() to each child thread, along with an offset and skip count. Each child thread then does an open(INF, '<&', $fileno) on the fileno, reads/discards skip count lines, then iteratively read/process/ skip until EOF b) alternately, create 2 Thread::Queues (one from master to children, the other from children to master). Master reads each line from file and posts to downstream queue; children grab (randomly) a line off the queue, process it, then post a response to the upstream queue. As ever, TIMTOWTDI. (b) is probably simpler, but (a) is more deterministic. Both can be mimiced using a process based approach. I've successfully used (a) for ETL tools, but if your file is binary/random access, it can get complicated to skip records. Also, if the children are writing to an output file, (b) might be easier to let a single master thread do the writing instead of coordinating writes betwen children. Perl Contrarian & SQL fanboy	[reply]
Re: Processing large file using threads by clinton (Priest) on May 08, 2007 at 15:07 UTC
This sounds like it is a one-off. Assuming that you don't need to maintain state (ie remember what has come before in the file) and that you are processing each line independently, it may be easier to just divide the file into (eg) 5, and run the same script on each sub-file, on separate computers if need be. It'd certainly be a lot simpler than debugging a threaded application.	[reply]
Re: Processing large file using threads by BrowserUk (Patriarch) on May 08, 2007 at 21:30 UTC
When your existing script is running, what percentage of cpu does top/Program Manager show it is consuming? Does the system where you are running this program have multiple cpus? If not, is there are multiple cpu system available on which it could be run? The answers to those questions dictate whether there is any mileage in improving the performance of your script using threads. It woudl also be much simpler to provide a thread solution, if that is going to be beneficial, if you posted the entire script, along with a few lines showing the format of (each of) the input file(s). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: Processing large file using threads by Joost (Canon) on May 08, 2007 at 15:03 UTC
I have a file that has over 21 million lines in it ... it will take days to process each line Did you really mean that it will take almost 60,000 years to run the code? Or did you mean it will take days to process each file? Also, why do you think using threads will be useful? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: Processing large file using threads by Anonymous Monk on May 08, 2007 at 22:42 UTC
maybe you should split it to many small text files.	[reply]
Re^3: Processing large file using threads by Joost (Canon) on May 08, 2007 at 23:34 UTC
Yes, if we split it over 60,000 computers it should only take a year! Did you mean to reply to something else? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]