Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^5: Optimise file line by line parsing, substitute SPLIT

by BrowserUk (Patriarch)
on Jun 03, 2013 at 14:54 UTC ( [id://1036785]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Optimise file line by line parsing, substitute SPLIT
in thread Optimise file line by line parsing, substitute SPLIT

Do you want me paste code where split() taking more {blah}

I want you to post code -- directly comparable to the OPs -- where doing something takes longer than doing nothing.

But, if you really want to play, show me code that filters a 2 million line x 11 TAB separated fields, file on the value of a field whose number and filter value I supply on the command line, more quickly than:

#! perl -slw use strict; use Time::HiRes qw[ time ]; our $FNO //= 6; our $V //= 500; my $start = time; my @filtered; while( <> ) { my @fields = split( "\t", $_ ); $fields[ $FNO ] == $V and push @filtered,$_; } printf "Took %f seconds\n", time() - $start; printf "Kept %u records\n", scalar @filtered; __END__ C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.072147 seconds Kept 2005 records C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.021369 seconds Kept 2005 records

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
/blockquote

Replies are listed 'Best First'.
Re^6: Optimise file line by line parsing, substitute SPLIT
by vsespb (Chaplain) on Jun 03, 2013 at 15:09 UTC

    I thought your point whas that OP is actually do nothing with data (read=nothing, read+split=nothing too), and he's going to read every word on every page soon, then split time will be insignificant.

    But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

    Otherwise I agree that split is can't be really optimized, just like I wrote above

      But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

      No. As a measure of the time taken to do the splits, his benchmark is fine.

      What is wrong is his apparent expectation that locating 26 million tab characters; copying 28 million strings and making 28 million assignments would (or should) take less than 8 seconds it does. 80 million fairly complex operations in 8 seconds is 1 every 10th of a microsecond. And is pretty damn good.

      The only ways to reduce that amount of time are::

      • Overlap the IO and processing.

        8 - 1.3 = 6.7 seconds assuming perfect overlap which is pretty much impossible.

        200*9.3 = 1860 -v- 200 * 6.7 = 1340

        28% as a target; but achieving it would be very hard.

      • Run (some of) the 200+ processes in parallel.

        Doing 2 at a time would be a 50% gain. 4 at a time 75%.

        Much better targets and actually pretty close to achievable; but required careful programming to avoid disk thrash.

      • Do less work.

        Adding a single line to my code above:

        next unless /$V/;

        Can get a 90% savings for some cases:

        C:\test>1036737 -V=500 < numbers.tsv Took 19.138550 seconds ## without pre-filter Kept 2005 records C:\test>1036737 -V=500 < numbers.tsv Took 1.755853 seconds ## with pre-filter Kept 2005 records

        But that saving is negated and actually worse for less specific searches:

        C:\test>1036737 -V=5 < numbers.tsv Took 18.765492 seconds ## Without pre-filter Kept 1944 records C:\test>1036737 -V=5 < numbers.tsv Took 20.232294 seconds ## With pre-filter Kept 1944 records

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1036785]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-29 11:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found