Re: How to optimize a regex on a large file read line by line ?

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ? by marioroy (Prior) on Apr 18, 2016 at 05:21 UTC
Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache. Update: Added the 'm' modifier to the regex operation. Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE. Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading. Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors. The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that. use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n"; [download]	[reply] [d/l]
Re^3: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 18, 2016 at 05:32 UTC
How do you handle a chunk that ends in the middle of the pattern? I did it by completing the partial line (see code line with comment "finish partial line").	[reply]
Re^4: How to optimize a regex on a large file read line by line ? by marioroy (Prior) on Apr 18, 2016 at 05:38 UTC
Yes, that is likely to happen when slurping a chunk. MCE handles that automatically by reading till the end of line.	[reply]
Re^3: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 18, 2016 at 10:20 UTC
Thanks for the timings. If possible, would you please also get a time for the grep+wc on your machine so we can tell how both these solutions compare to it.	[reply]
Re^2: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 18, 2016 at 01:59 UTC
The code is incomplete cause a match could span two chunks. You need to `seek` back the longest possible match (here 8) before reading the next chunk. Actually the correct number is something like `min ( p ,m )` With p = chunksize - `pos` and m = length of longest possible match Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^3: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 18, 2016 at 02:27 UTC
The match is only in one line, that's the purpose of the line `$_ .= <$fh> // '';` [download] It completes a partial line.	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 18, 2016 at 09:21 UTC
Ahh! You are combining `read` with `readline` ... `$_ .= <$fh> // ''; # finish partial line` That's a good trick! (As long as a line doesn't become bigger than memory, but that's hardly the case here.) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]