how to split huge file reading into multiple threads

sagarika has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to split huge file reading into multiple threads by moritz (Cardinal) on Aug 23, 2011 at 13:03 UTC
First you need to determine where your bottleneck is. If it's disc IO speed, splitting the problem into multiple threads doens't help at all, it might even make things worse. Then the only thing you can do is to buy faster discs. If CPU is the bottleneck, it might make sense to investigate threads or processes. Use Devel::NYTProf (on a reduced data sample, but please make it big enough that it doesn't fit into the buffer cache) to find out what steps take the most time. If it's readline() or so, don't even think of threads. Perl 6 - second systems done right	[reply]
Re: how to split huge file reading into multiple threads by zentara (Cardinal) on Aug 23, 2011 at 12:58 UTC
Can we use threads in such a way that, multiple threads will be acting on the million record file so that the time is reduced to few minutes. ? Probably not, since all the threads will be hitting a bottleneck of trying to read the same file at the same time. The limiting factor is how fast your hard drive is. See How do you parallelize STDIN for large file processing? and Is Using Threads Slower Than Not Using Threads? for examples. There may be some improvements gained if you could use a program like split to break your huge file into smaller chunks, and place them on separate hard drives, then let your parallel processes work on them. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^2: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 09:31 UTC
'split' is a good utility and splits the files easily and fast. I can now split the file into some 5-6 files and then can create threads on that. However, the files that all these threads will be writing data to are the same. - How can I expedite it ?	[reply]
Re^3: how to split huge file reading into multiple threads by zentara (Cardinal) on Aug 30, 2011 at 11:07 UTC
However, the files that all these threads will be writing data to are the same. - How can I expedite it ? Maybe write the results to separate files, and merge the results when finished? Or maybe use separate dbm files, see merging dbm files I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^3: how to split huge file reading into multiple threads by zentara (Cardinal) on Aug 30, 2011 at 22:35 UTC
It dawned on me there is another way to collect the output from the threads. You can open a filehandle in the main thread, pass it's fileno to the thread, then let the thread write to the dup'd filehandle. See [threads] Open a file in one thread and allow others to write to it for the technique. Anyways, you could open 1 filehandle for each thread, for that thread to report results back to the main thread. Pass the fileno of that filehandle to each thread at creation time. In the main thread, setup an IO::Select object to watch all the filehandles. Have the main thread open a private filehandle for the final output file, and as IO::Select reads the various data from each thread, it writes the output to the output file. This would allow the threads to write without worrying about locking, while the main thread's select loop would actually handle the writing, and possibly sorting, the data out to file. I don't know how it would work speedwise, as select will block if one thread reports alot of data, but this might be minimized by using large filehandle buffers. That is what I would try first. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re^4: how to split huge file reading into multiple threads by BrowserUk (Patriarch) on Aug 30, 2011 at 23:08 UTC
Re^5: how to split huge file reading into multiple threads by zentara (Cardinal) on Aug 31, 2011 at 11:21 UTC
Some notes below your chosen depth have not been shown here
Re: how to split huge file reading into multiple threads by BrowserUk (Patriarch) on Aug 23, 2011 at 14:01 UTC
I have a huge file of millions of record.... But it takes huge time around 2+ hours How many millions? Perl can process 4 million records in 2.5 seconds: `perl -MTime::HiRes=time -E"BEGIN{$t=time()}" -nle"++$n }{ printf qq[$n records in %f seconds\n], time-$t" 250MB.CSV 4194304 records in 2.518000 seconds` [download] So, how about you post your code and let us help you fix it? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 09:39 UTC
Thanks! Yes. even I am under the impression that perl as designed for pattern extraction and report; would be fast and reliable for text processing. Please see my (above) reply to:"AR" of dated:"Aug 30, 2011 at 09:05 UTC" what my code does. Thank you for extending the hand. However, as of now, I completely think that there is not really much I am doing (as in processing the records) that would consume the time. I dont want other monks to get mis-directed by pasting the code.	[reply]
Re^3: how to split huge file reading into multiple threads by BrowserUk (Patriarch) on Aug 30, 2011 at 09:57 UTC
I dont want other monks to get mis-directed by pasting the code. Let "other monks" look after themselves. If your code is taking 2 1/2 hours to process 20 million records against 600 records stored in a hash, then it is your code that has problems. Should we try and guess what mistakes you are making? Are you for instance, treating the hash as an array? Or re-opening the output files for every record you write? Post the code and we won't have to make such guesses. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 02, 2011 at 08:34 UTC
Re^5: how to split huge file reading into multiple threads by BrowserUk (Patriarch) on Sep 02, 2011 at 09:06 UTC
Some notes below your chosen depth have not been shown here
Re: how to split huge file reading into multiple threads by AR (Friar) on Aug 23, 2011 at 12:58 UTC
Can you show us a stripped-down copy of your single-threaded working code? I have a couple of single-threaded scripts that regularly parse gigabytes of text files in a few minutes. Maybe there's something in your current script that could be fixed. If your script is being delayed by disk reads and writes, then multi-threading will not help you.	[reply]
Re^2: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 09:05 UTC
Here is what my code does:- I have a file having 20+ millions of lines/records. ( say a.txt) I have another file having around 600 lines/records (say b.txt) These lines have some categories. So, a category is matching to more than one line/records. now; what my code does is: 1. Create a Hash out of b.txt ( key = category ; value=some mandatory part of the records ). 2. Read every record from a.txt and check if it matches with any of the mandatory part of the records ; if yes, create a file of that category and dump that entire line/record into that category. So, every record (of 20+ millions) is getting compared with some (roughly saying) 600 odd records ( if we consider the match found would be the last record - in worst case ) And thats where the whole processing/looping is happening. Please help. how can I expedite the process ?	[reply]
Re^3: how to split huge file reading into multiple threads by AR (Friar) on Aug 30, 2011 at 12:30 UTC
Please show some code. We can help you best if you show a stripped down, but working, version of your code with sample data. Maybe your problem is that you're opening files over and over again when you should be keeping them open. I can't tell from your description of the code.	[reply]
Re^4: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 02, 2011 at 06:36 UTC
Re^5: how to split huge file reading into multiple threads by roboticus (Chancellor) on Sep 02, 2011 at 12:17 UTC
Some notes below your chosen depth have not been shown here
Re^3: how to split huge file reading into multiple threads by GrandFather (Saint) on Aug 30, 2011 at 10:00 UTC
As Corion suggests it is hard to offer much by way of constructive advice without something concrete to play with. However, it may be that you can leverage regular expressions in some fashion to speed up the matching phase of the process. I can't provide much more focused advice without some information about the nature of the matching. True laziness is hard work	[reply]
Re^3: how to split huge file reading into multiple threads by Corion (Patriarch) on Aug 30, 2011 at 09:43 UTC
This is not code we can download and run for ourselves. Please reduce your problem to a program of about 20 lines, and also post about 20 lines of representative data.	[reply]
Re: how to split huge file reading into multiple threads by locked_user sundialsvc4 (Abbot) on Aug 23, 2011 at 14:17 UTC
My intuitive guess, based on your task-description, is that your algorithm is probably memory-based, and what is therefore actually happening is “classic thrashing.” In this case, threads won’t help at all. Consider ways to use disk-based sorting to manage the files. Or, put the data into an SQLite database (disk file...) and use its indexing and querying capability. The bottom line is ... don’t do anything “in memory.” That means: no hashes, no lists, no “potentially big things in memory” at all. An appropriate redesign should not blink at all at “millions of records.” But we do know that the classic performance-curve caused by thrashing is ... not linear, but exponential ... degradation. When you say, “2+ hours,” that’s what it fairly screams to me. Easy test: fire up the program and use a separate system monitor to watch the swap I/O rate, and the percentage of time spent in page faults. If it is, as I suspect it will be, “huge,” then there’s your answer.
Re^2: how to split huge file reading into multiple threads by onelesd (Pilgrim) on Aug 23, 2011 at 17:54 UTC
If you are on linux, try running `iostat -mx 5` while your ~~threaded~~ non-threaded perl script is processing the input. It will show you where the time is being spent. If most of your time is being spent in %iowait rather than %user, then you need to do something to reduce IO to reduce execution time, and as the previous post said, threads will not help you if this is the case. `avg-cpu: %user %nice %system %iowait %steal %idle 0.05 0.00 0.10 0.05 0.00 99.80 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz + avgqu-sz await svctm %util hda 0.00 5.80 0.20 2.00 0.00 0.03 29.82 + 0.01 4.45 2.36 0.52` [download]	[reply] [d/l] [select]
Re^3: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 10:22 UTC
I used "iostat -mx 5" and found that %iowait is 0.60; whereas the %user is: 51.xx. That means its not the io stuff taking time. That means, Threads can be used and can reduce the time of execution. Happy to get to know that. Please suggest (based on other my comments and post; how can I use threads) ?	[reply]
Re^4: how to split huge file reading into multiple threads by onelesd (Pilgrim) on Aug 30, 2011 at 18:23 UTC
Re^2: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 10:19 UTC
Can not use things like SQLite. Thats how my boss wants it to be. Can you please suggest a system monitor to watch the swap I/O ?	[reply]
Re: how to split huge file (16.7 million lines in; 600 output files; 132 seconds) by BrowserUk (Patriarch) on Sep 02, 2011 at 12:23 UTC
#! perl -slw use strict; our $NBUF //= 5000; our $IBUF //= 2e6; my $start = time; my @outFHs; my @outBufs; my $n = 0; my( $o, $buf ) = 0; open DISK, '<', $ARGV[0] or die $!; while( read( DISK, $buf, $IBUF, $o ) ) { open RAM, '<', \$buf; while( my $line = <RAM> ) { unless( $line =~ /\n$/ ) { $buf = $line; $o = length $buf; next; } ++$n; my $key = substr( $line, 7, 3 ) % 600; if( push( @{ $outBufs[ $key ] }, $line ) > $NBUF ) { unless( defined $outFHs[ $key ] ) { open $outFHs[ $key ], '>', "$key.out" or die $!; } print { $outFHs[ $key ] } @{ $outBufs[ $key ] }; @{ $outBufs[ $key ] } = (); } } } print { $outFHs[ $_ ] } @{ $outBufs[ $_ ] } for 0 .. $#outBufs; close $_ for @outFHs; close DISK; printf "Took %d seconds for $n records\n", time() - $start, $n; __END__ C:\test>Ibufd.pl 1GB.csv Took 132 seconds for 16777216 records [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: how to split huge file reading into multiple threads by Anonymous Monk on Aug 24, 2011 at 05:26 UTC
Maybe? It depends, if your bottleneck is purely disk i/o, then no, but things like RAID might help. On the other hand, if your bottleneck is in processing the data, you can use one thread to read in records and hand those off to threads that process the data.	[reply]