Re: how to split huge file reading into multiple threads
by moritz (Cardinal) on Aug 23, 2011 at 13:03 UTC
|
First you need to determine where your bottleneck is. If it's disc IO speed, splitting the problem into multiple threads doens't help at all, it might even make things worse. Then the only thing you can do is to buy faster discs.
If CPU is the bottleneck, it might make sense to investigate threads or processes.
Use Devel::NYTProf (on a reduced data sample, but please make it big enough that it doesn't fit into the buffer cache) to find out what steps take the most time. If it's readline() or so, don't even think of threads.
| [reply] |
Re: how to split huge file reading into multiple threads
by zentara (Cardinal) on Aug 23, 2011 at 12:58 UTC
|
Can we use threads in such a way that, multiple threads will be acting on the million record file so that the time is reduced to few minutes. ? Probably not, since all the threads will be hitting a bottleneck of trying to read the same file at the same time. The limiting factor is how fast your hard drive is. See How do you parallelize STDIN for large file processing? and Is Using Threads Slower Than Not Using Threads? for examples.
There may be some improvements gained if you could use a program like split to break your huge file into smaller chunks, and place them on separate hard drives, then let your parallel processes work on them.
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
|
|
It dawned on me there is another way to collect the output from the threads. You can open a filehandle in the main thread, pass it's fileno to the thread, then let the thread write to the dup'd filehandle. See [threads] Open a file in one thread and allow others to write to it for the technique.Anyways, you could open 1 filehandle for each thread, for that thread to report results back to the main thread. Pass the fileno of that filehandle to each thread at creation time. In the main thread, setup an IO::Select object to watch all the filehandles. Have the main thread open a private filehandle for the final output file, and as IO::Select reads the various data from each thread, it writes the output to the output file. This would allow the threads to write without worrying about locking, while the main thread's select loop would actually handle the writing, and possibly sorting, the data out to file. I don't know how it would work speedwise, as select will block if one thread reports alot of data, but this might be minimized by using large filehandle buffers. That is what I would try first.
| [reply] |
|
|
|
|
|
Re: how to split huge file reading into multiple threads
by BrowserUk (Patriarch) on Aug 23, 2011 at 14:01 UTC
|
perl -MTime::HiRes=time -E"BEGIN{$t=time()}"
-nle"++$n }{ printf qq[$n records in %f seconds\n], time-$t" 250MB.CSV
4194304 records in 2.518000 seconds
So, how about you post your code and let us help you fix it?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
|
Thanks!
Yes. even I am under the impression that perl as designed for pattern extraction and report; would be fast and reliable for text processing.
Please see my (above) reply to:"AR" of dated:"Aug 30, 2011 at 09:05 UTC" what my code does.
Thank you for extending the hand. However, as of now, I completely think that there is not really much I am doing (as in processing the records) that would consume the time.
I dont want other monks to get mis-directed by pasting the code.
| [reply] |
|
|
I dont want other monks to get mis-directed by pasting the code.
Let "other monks" look after themselves.
If your code is taking 2 1/2 hours to process 20 million records against 600 records stored in a hash, then it is your code that has problems. Should we try and guess what mistakes you are making?
Are you for instance, treating the hash as an array? Or re-opening the output files for every record you write?
Post the code and we won't have to make such guesses.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
Re: how to split huge file reading into multiple threads
by AR (Friar) on Aug 23, 2011 at 12:58 UTC
|
Can you show us a stripped-down copy of your single-threaded working code? I have a couple of single-threaded scripts that regularly parse gigabytes of text files in a few minutes. Maybe there's something in your current script that could be fixed.
If your script is being delayed by disk reads and writes, then multi-threading will not help you.
| [reply] |
|
|
Here is what my code does:-
I have a file having 20+ millions of lines/records. ( say a.txt)
I have another file having around 600 lines/records (say b.txt) These lines have some categories. So, a category is matching to more than one line/records.
now; what my code does is:
1. Create a Hash out of b.txt ( key = category ; value=some mandatory part of the records ).
2. Read every record from a.txt and check if it matches with any of the mandatory part of the records ; if yes, create a file of that category and dump that entire line/record into that category.
So, every record (of 20+ millions) is getting compared with some (roughly saying) 600 odd records ( if we consider the match found would be the last record - in worst case )
And thats where the whole processing/looping is happening.
Please help. how can I expedite the process ?
| [reply] |
|
|
Please show some code. We can help you best if you show a stripped down, but working, version of your code with sample data.
Maybe your problem is that you're opening files over and over again when you should be keeping them open. I can't tell from your description of the code.
| [reply] |
|
|
|
|
|
|
|
As Corion suggests it is hard to offer much by way of constructive advice without something concrete to play with. However, it may be that you can leverage regular expressions in some fashion to speed up the matching phase of the process. I can't provide much more focused advice without some information about the nature of the matching.
True laziness is hard work
| [reply] |
|
|
| [reply] |
Re: how to split huge file reading into multiple threads
by locked_user sundialsvc4 (Abbot) on Aug 23, 2011 at 14:17 UTC
|
My intuitive guess, based on your task-description, is that your algorithm is probably memory-based, and what is therefore actually happening is “classic thrashing.” In this case, threads won’t help at all.
Consider ways to use disk-based sorting to manage the files. Or, put the data into an SQLite database (disk file...) and use its indexing and querying capability. The bottom line is ... don’t do anything “in memory.” That means: no hashes, no lists, no “potentially big things in memory” at all.
An appropriate redesign should not blink at all at “millions of records.” But we do know that the classic performance-curve caused by thrashing is ... not linear, but exponential ... degradation. When you say, “2+ hours,” that’s what it fairly screams to me.
Easy test: fire up the program and use a separate system monitor to watch the swap I/O rate, and the percentage of time spent in page faults. If it is, as I suspect it will be, “huge,” then there’s your answer.
| |
|
|
If you are on linux, try running iostat -mx 5 while your threaded non-threaded perl script is processing the input. It will show you where the time is being spent. If most of your time is being spent in %iowait rather than %user, then you need to do something to reduce IO to reduce execution time, and as the previous post said, threads will not help you if this is the case.
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.10 0.05 0.00 99.80
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz
+ avgqu-sz await svctm %util
hda 0.00 5.80 0.20 2.00 0.00 0.03 29.82
+ 0.01 4.45 2.36 0.52
| [reply] [d/l] [select] |
|
|
I used "iostat -mx 5" and found that %iowait is 0.60; whereas the %user is: 51.xx.
That means its not the io stuff taking time. That means, Threads can be used and can reduce the time of execution.
Happy to get to know that.
Please suggest (based on other my comments and post; how can I use threads) ?
| [reply] |
|
|
|
|
| [reply] |
Re: how to split huge file (16.7 million lines in; 600 output files; 132 seconds)
by BrowserUk (Patriarch) on Sep 02, 2011 at 12:23 UTC
|
#! perl -slw
use strict;
our $NBUF //= 5000;
our $IBUF //= 2e6;
my $start = time;
my @outFHs;
my @outBufs;
my $n = 0;
my( $o, $buf ) = 0;
open DISK, '<', $ARGV[0] or die $!;
while( read( DISK, $buf, $IBUF, $o ) ) {
open RAM, '<', \$buf;
while( my $line = <RAM> ) {
unless( $line =~ /\n$/ ) {
$buf = $line;
$o = length $buf;
next;
}
++$n;
my $key = substr( $line, 7, 3 ) % 600;
if( push( @{ $outBufs[ $key ] }, $line ) > $NBUF ) {
unless( defined $outFHs[ $key ] ) {
open $outFHs[ $key ], '>', "$key.out" or die $!;
}
print { $outFHs[ $key ] } @{ $outBufs[ $key ] };
@{ $outBufs[ $key ] } = ();
}
}
}
print { $outFHs[ $_ ] } @{ $outBufs[ $_ ] } for 0 .. $#outBufs;
close $_ for @outFHs;
close DISK;
printf "Took %d seconds for $n records\n", time() - $start, $n;
__END__
C:\test>Ibufd.pl 1GB.csv
Took 132 seconds for 16777216 records
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: how to split huge file reading into multiple threads
by Anonymous Monk on Aug 24, 2011 at 05:26 UTC
|
Maybe? It depends, if your bottleneck is purely disk i/o, then no, but things like RAID might help. On the other hand, if your bottleneck is in processing the data, you can use one thread to read in records and hand those off to threads that process the data.
| [reply] |