Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: Threads From Hell #2: How To Parse A Very Huge File

by marioroy (Prior)
on May 24, 2015 at 12:57 UTC ( [id://1127582]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Threads From Hell #2: How To Parse A Very Huge File
in thread Threads From Hell #2: How To Search A Very Huge File [SOLVED]

MCE::Grep is not the tool for this. Calling a code block (once per line) is the reason for the overhead. The testing was done on a CentOS VM. However, the MCE::Loop example showed a way that allows Perl to run faster.

Surely, one is not likely to force a character count. I was mainly comparing the wc binary against the Perl script.

  • Comment on Re^3: Threads From Hell #2: How To Parse A Very Huge File

Replies are listed 'Best First'.
Re^4: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Patriarch) on May 24, 2015 at 14:22 UTC
    the MCE::Loop exampl

    I don't see any mention MCE::Loop in either of your posts?

    The testing was done on a CentOS VM

    I see. And on what hardware that allows you to read 2GiB at 16Gbits/second?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      The testing was done on a late 2013 MacBook Pro model (Haswell Core i7) at 2.6 GHz with 1600 MHz memory. Am running Parallels Desktop 9.0. The grep/wc commands and Perl scripts read the file likely residing in OS level file cache from repeated testing.

        the file likely residing in OS level file cache from repeated testing.

        Indeed.

        That's why I used a 10GB file for my testing. I've only got 8GB of ram, so there's no way for the file to get read from cache on subsequent tests.

        In the real world where the file being searched is coming off a disk or SSD, there is no benefit to multi-tasking grep.

        Even in the extremely rare case of grepping the same file multiple times, although your numbers:

        show the a reduction in elapsed time, the cpu usage is actually 2.527/2.127 *100 = 19% higher.

        If the user is (for the sake of a term) an end-user, who types the command and hits enter, the 1 second or so saving is probably less time than it took him to decide what to type and type it; and certainly less than he will take to decide what to do with the information it produces.

        On the other hand, if the user is a sysadmin guy trying to balance the needs of many processes across a farm of servers, using that extra 19% of cpu resource is probably a bad thing.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1127582]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-20 10:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found