Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Speed and memory issue with large files

by firmament (Novice)
on Mar 19, 2010 at 16:34 UTC ( [id://829651]=perlquestion: print w/replies, xml ) Need Help??

firmament has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm using the following primitive script to take a look at a particular line in a very large (3 gb) file. It takes forever (more than 30-40 minutes) though, so I'm wondering if anyone has any suggestions as to how to make this faster?

#!/usr/bin/perl + use Tie::File; + + $infile = '/path/to/myfile.xml'; tie @array, 'Tie::File', $infile or die $!; + print $array[64366480];

Doing the equivalent in AWK took 3 minutes, but as I need to expand this I'd prefer if I could stay with Perl.

On a sidenote I ran out of memory if I used a simple while loop, which surprised me as that shouldn't slurp the file as say a foreach?

Thanks in advance.

Replies are listed 'Best First'.
Re: Speed and memory issue with large files
by BrowserUk (Patriarch) on Mar 19, 2010 at 16:44 UTC

    Tie::File does not work well on huge files. The following finds and prints the 20 millionth line of a 40 million line 3GB file in 12 seconds:

    c:\test>wc -l syssort 40000000 syssort c:\test>dir syssort 19/12/2009 13:47 3,160,000,000 syssort perl -le"$t=time;scalar<>for 1..20e6;print scalar<>;print time()-$t" s +yssort 49_992_005_J1 chr9 97768833 97768867 ATTTTCTTCAATTA +CATTTCCAATGCTATCCCAAA + 35 12

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Tie::File does not work well on huge files.

      Indeed. It memorizes the byte position of the start of every line it has encountered in order to jump to a specific line quickly. This adds up, and that functionality isn't needed here (since there's no need to jump back).

      Contrary to what the documentation implies, this memory usage cannot be limited.

        You're at it again. Not only have you changed the content of this node without attribution, you've also changed the entire tone and meaning of it. You really are underhand.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      A reply falls below the community's threshold of quality. You may see it by logging in.
      Thanks a bunch!
Re: Speed and memory issue with large files
by toolic (Bishop) on Mar 19, 2010 at 17:05 UTC
    On a sidenote I ran out of memory if I used a simple while loop, which surprised me as that shouldn't slurp the file as say a foreach?
    A while loop seems fine to me. It should not slurp all the contents into memory. This prints the 10 millionth line ($.) of a 187M line (2GB) file for me in about 3 seconds:
    while (<>) { if ($. == 10_000_000) { print; last; } }
Re: Speed and memory issue with large files
by eff_i_g (Curate) on Mar 19, 2010 at 17:18 UTC
    How about tail +64366480 file | head -1?

    You mentioned staying with Perl, but this could be called from Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://829651]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-29 11:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found