in reply to Speed and memory issue with large files

Tie::File does not work well on huge files. The following finds and prints the 20 millionth line of a 40 million line 3GB file in 12 seconds:

c:\test>wc -l syssort 40000000 syssort c:\test>dir syssort 19/12/2009 13:47 3,160,000,000 syssort perl -le"$t=time;scalar<>for 1..20e6;print scalar<>;print time()-$t" s +yssort 49_992_005_J1 chr9 97768833 97768867 ATTTTCTTCAATTA +CATTTCCAATGCTATCCCAAA + 35 12

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

Replies are listed 'Best First'.
Re^2: Speed and memory issue with large files
by ikegami (Patriarch) on Mar 19, 2010 at 17:21 UTC

    Tie::File does not work well on huge files.

    Indeed. It memorizes the byte position of the start of every line it has encountered in order to jump to a specific line quickly. This adds up, and that functionality isn't needed here (since there's no need to jump back).

    Contrary to what the documentation implies, this memory usage cannot be limited.

      You're at it again. Not only have you changed the content of this node without attribution, you've also changed the entire tone and meaning of it. You really are underhand.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^2: Speed and memory issue with large files
by firmament (Novice) on Mar 19, 2010 at 16:53 UTC
    Thanks a bunch!