Re^4: Optimise file line by line parsing, substitute SPLIT

Replies are listed 'Best First'.
Re^5: Optimise file line by line parsing, substitute SPLIT by BrowserUk (Patriarch) on Jun 03, 2013 at 14:54 UTC
Do you want me paste code where split() taking more {blah} I want you to post code -- directly comparable to the OPs -- where doing something takes longer than doing nothing. But, if you really want to play, show me code that filters a 2 million line x 11 TAB separated fields, file on the value of a field whose number and filter value I supply on the command line, more quickly than: `#! perl -slw use strict; use Time::HiRes qw[ time ]; our $FNO //= 6; our $V //= 500; my $start = time; my @filtered; while( <> ) { my @fields = split( "\t", $_ ); $fields[ $FNO ] == $V and push @filtered,$_; } printf "Took %f seconds\n", time() - $start; printf "Kept %u records\n", scalar @filtered; __END__ C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.072147 seconds Kept 2005 records C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.021369 seconds Kept 2005 records` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. /blockquote	[reply] [d/l]
Re^6: Optimise file line by line parsing, substitute SPLIT by vsespb (Chaplain) on Jun 03, 2013 at 15:09 UTC
I thought your point whas that OP is actually do nothing with data (read=nothing, read+split=nothing too), and he's going to read every word on every page soon, then split time will be insignificant. But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split. Otherwise I agree that split is can't be really optimized, just like I wrote above	[reply]
Re^7: Optimise file line by line parsing, substitute SPLIT by BrowserUk (Patriarch) on Jun 03, 2013 at 16:03 UTC
But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split. No. As a measure of the time taken to do the splits, his benchmark is fine. What is wrong is his apparent expectation that locating 26 million tab characters; copying 28 million strings and making 28 million assignments would (or should) take less than 8 seconds it does. 80 million fairly complex operations in 8 seconds is 1 every 10th of a microsecond. And is pretty damn good. The only ways to reduce that amount of time are:: Overlap the IO and processing. 8 - 1.3 = 6.7 seconds assuming perfect overlap which is pretty much impossible. 2009.3 = 1860 -v- 200 6.7 = 1340 28% as a target; but achieving it would be very hard. Run (some of) the 200+ processes in parallel. Doing 2 at a time would be a 50% gain. 4 at a time 75%. Much better targets and actually pretty close to achievable; but required careful programming to avoid disk thrash. Do less work. Adding a single line to my code above: `next unless /$V/;` [download] Can get a 90% savings for some cases: `C:\test>1036737 -V=500 < numbers.tsv Took 19.138550 seconds ## without pre-filter Kept 2005 records C:\test>1036737 -V=500 < numbers.tsv Took 1.755853 seconds ## with pre-filter Kept 2005 records` [download] But that saving is negated and actually worse for less specific searches: `C:\test>1036737 -V=5 < numbers.tsv Took 18.765492 seconds ## Without pre-filter Kept 1944 records C:\test>1036737 -V=5 < numbers.tsv Took 20.232294 seconds ## With pre-filter Kept 1944 records` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]


The stupid question is the question not asked
	PerlMonks