comment on

Hi monks -

I'm writing a script that makes one pass through a very large (up to 3 GB) file, and needs to keep track of the first and last characters in a sliding window of constant small size (e.g. 500).

Walking the filehandle with getc(FH) (actually, two handles - one trailing the other by 500 bytes) is pretty slow. Line by line is a lot faster if I don't try to look at individual characters...but then I've got to index into the line with substr, which slows things down even more. Reading each line into a character array with split is dog slow.

The format of the file I'm hitting allows me to grab very large chunks (10-50 thousand characters each) by setting $/=">", because ">" appears at regular, distant intervals. But indexing into the string captured from the file in that waywith substr is amazingly slow. Again, pushing the string to an array (where indexing is fast) with split is even worse.

Any suggestions about how to take advantage of the fact that Perl is perfectly fast at reading in large hunks of the file into a string, without having to index into that string with substr?

In short, can you beat the time I get reading a huge file by using:

  open (FH, "<ciona_fasta_two");
  until (eof(FH)) {
    $ch = getc(FH);
    if ($ch eq 'x'){
      #do something;
    }
  }
  close FH;

(It should be possible to beat it by *a lot*)

Thanks, Travis

In reply to character-by-character in a huge file by mushnik

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.