character-by-character in a huge file

mushnik has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: character-by-character in a huge file by Zaxo (Archbishop) on Apr 08, 2004 at 17:51 UTC
You can read your whole window at once by setting $/ to a reference to it's literal constant width. `{ open (FH, "<ciona_fasta_two") or die $!; local ($/, $_) = (\500); while (<FH>) { my ($first, $last) = (substr($_,0,1), substr($_,-1,1)); # do other stuff } close FH or die $!; }` [download] After Compline, Zaxo	[reply] [d/l]
Re: Re: character-by-character in a huge file by bart (Canon) on Apr 09, 2004 at 11:29 UTC
That's not a sliding window.	[reply]
Re: Re: Re: character-by-character in a huge file by Zaxo (Archbishop) on Apr 09, 2004 at 15:58 UTC
It is with seek. After Compline, Zaxo	[reply]
Re: character-by-character in a huge file by kvale (Monsignor) on Apr 08, 2004 at 18:20 UTC
Consider the informal benchmark `use Benchmark qw\|:all\|; my @a = (1..1_000_000); my $b; my $c = 'x' x 1_000_000; my $d; cmpthese( 5, { ary => sub { for my $i (0..1_000_000-501) { $b = $a[$i] - $a[$i+500]; } }, substr => sub { for my $i (0..1_000_000-501) { $d = substr($c,$i,1) . substr($c,$i+500,1); } }, unpack => sub { my @b = unpack 'c', $c }, split => sub { my @b = split //, $c } } );` [download] On my arthritic PIII machine, I get Benchmark: timing 5 iterations of ary, split, substr, unpack... ary: 8 wallclock secs ( 8.62 usr + 0.00 sys = 8.62 CPU) @ 0 +.58/s (n=5) split: 26 wallclock secs (24.93 usr + 0.33 sys = 25.26 CPU) @ 0 +.20/s (n=5) substr: 11 wallclock secs (11.64 usr + 0.00 sys = 11.64 CPU) @ 0 +.43/s (n=5) unpack: 7 wallclock secs ( 6.14 usr + 0.03 sys = 6.17 CPU) @ 0 +.81/s (n=5) s/iter split substr ary unpack split 5.05 -- -54% -66% -76% substr 2.33 117% -- -26% -47% ary 1.72 193% 35% -- -28% unpack 1.23 309% 89% 40% -- [download] This says that, for the ops between elements $i and $i+500 I picked, an array rep is faster than a string rep, but the overhead of breaking a string into an array will negate that advantage. The best you could do on my machine with these algorithms and ops it is to move along the string at about 500_000 chars/sec. Another solution that has not been mentioned yet is using a regexp. From perlretut: `while ($dna =~ /(?=A.{500}T)./g) { print "Got an AT match at position ", pos $dna, "\n"; }` [download] Update* Fixed regexp. -Mark	[reply] [d/l] [select]
Re: character-by-character in a huge file by Itatsumaki (Friar) on Apr 08, 2004 at 20:48 UTC
This is a fasta-formatted sequence file, right? You're best bet for doing this is might be the BioPerl library. They have both a regular fasta-parser as well as a special indexed parser for genome-sized files. The latter works really well if you need to get subsequences from a file. For instance, I use the indexed-version when I want to extract all mRNA sequences from a single chromosome. Here is the code for using the indexed-version. It processes every mRNA on every chromosome in a typical eukaryotic genome in about 30 minutes on a P4. # set up chromosome sequence object my $file = $chromosome.'.fa'; my $seqio = Bio::SeqIO->new( -format => 'largefasta', -file => $file ); # set up range parameters my $strand = '+'; my $seq_start = 1_000_000; my $seq_end = 2_000_000; my $downstream= 10_000; # handle sequences on both strands if ($strand eq '+') { $seq_start = $tx_start - $upstream; $seq_end = $tx_start + $downstream; if ($seq_start < 0) { next(); } if ($seq_end > $length) { next(); } $seq_mRNA = $seq_chr->trunc($seq_start,$seq_end); } elsif ($strand eq '-') { $seq_start = $tx_end + $upstream; $seq_end = $tx_end - $downstream; if ($seq_end < 0) { next(); } if ($seq_start > $length) { next(); } $seq_mRNA = $seq_chr->trunc($seq_end,$seq_start)->revcom(); } else { warn "No strand: $strand!\n"; next(); } # write seq to a file my $outfile = $mRNA.'.fasta'; open(OUT, '>', $outfile); print OUT ">$mRNA\n", $seq_mRNA->seq(); close(OUT); [download] And there are always the usual caveats with BioPerl -- great user support group, fairly big initial learning curve, highly object-oriented library.... -Tats	[reply] [d/l]
Re: character-by-character in a huge file by ccn (Vicar) on Apr 08, 2004 at 17:49 UTC
`$/ = \10000000; open (FH, "<ciona_fasta_two"); while (<FH>) { my $i = -1; while( -1 < ($i = index($_, 'x', $i+1))) { #do something; } } close FH;` [download]	[reply] [d/l]
Re: character-by-character in a huge file by hardburn (Abbot) on Apr 08, 2004 at 17:43 UTC
Try writing a quick program that reads the file in whatever way your main program does. Just read it in and nothing else. Keep track of the run time. Then compare it to the runtime of the real program. I bet there isn't a great deal of difference. The reason is that simply reading 3GB of data using any method is going to be quite slow. ---- `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l]
Re: Re: character-by-character in a huge file by Anomynous Monk (Scribe) on Apr 08, 2004 at 18:00 UTC
I have to disagree; that code will execute about 14 ops per character (or about 42000000000 ops total); even the overhead of perl's op dispatch loop will be huge, much less the actual code of the ops themselves. Simulating just the basic overhead with `perl -we'1 for 1..14000000000'` (3 ops per loop iteration) runs about 36 minutes on my system; this is substantially longer than it would take to read through 3Gb.	[reply] [d/l]
Re: character-by-character in a huge file by Anomynous Monk (Scribe) on Apr 08, 2004 at 18:17 UTC
I think you need to say more about what you need to do with the first and last characters. Can you ignore 500-char ranges spanning the > marks? Are you looking for some literal first character? If the answers are yes, something like this should go pretty quickly, using a regex to go through the buffer one character at a time: `$/ = ">"; while (my $line = <>) { while ($line =~ m/x(?=.{498}(.))/gs) { my $from_x_500_chars = $1; } }` [download] If you do try different methods suggested, it would be nice if you posted a summary of results.	[reply] [d/l]
Re: character-by-character in a huge file by ambrus (Abbot) on Apr 08, 2004 at 18:51 UTC
Just an idea, I'm not sure whether you can use it. If you make a copy of a large string you've read, the two copies will have separate `pos`-itions, thus, you can traverse both with regexps without the need for substr'ing it. For example: `$string1= "hello, world\n"; $string2= $string1; pos($string1)= 4; whil +e ($string1=~/(.)/gs) {$win_right= $1; $string2=~/(.)/gs; $win_left= +$1; print " "x$n++, "$win_left...$win_right\n"}` [download] prints: `h...o e..., l... l...w o...o ,...r ...l w...d o...` [download] This will of course have difficulties on what to do at the boundary of two blocks you've read. Update: spelling. Whether, whether, whether, whether, I'll have to learn this eventually.	[reply] [d/l] [select]
Re: character-by-character in a huge file by eric256 (Parson) on Apr 08, 2004 at 17:35 UTC
I'm a bet confused by your definition there. Do you need to remember the whole window of 500 characters? or just the first and last? Have you tried keeping an array and pushing single characters onto one end and then off the other? That seems to me to be the easiest way to keep track of a window of characters. I don't think that character by character comparisons on a huge file will ever be very fast. If you use an array though you can read by line and then split it. That way you can take advantage of perls file IO buffering. ___________ Eric Hodges	[reply]
Re: Re: character-by-character in a huge file by Anonymous Monk on Apr 08, 2004 at 17:52 UTC
Have you tried keeping an array and pushing single characters onto one end and then off the other? That seems to me to be the easiest way to keep track of a window of characters. No.	[reply]
Re: character-by-character in a huge file by bart (Canon) on Apr 09, 2004 at 12:26 UTC
It makes no sense to read a large file a byte at a time, on a system where you use perl. 3GB is a bit much to hold in memory at once, at least it is for me, but a buffer of a few k to a few tens of k should work really well. Also, there's no need for the two handles... use the same buffer for both. My idea is to fill the buffer, look for the start character, and if you find it too close to the end of the block, read some more and append it to the buffer. `read`() even supports this operation out of the box. Sample code (not tested well): my $windowsize = 500; use constant BLOCKLENGTH => 4096; my $buffer = ""; my $offset = 0; while(my $r = read FH, $buffer, BLOCKLENGTH) { my $i = -1; until(($i = index($buffer, "x", ++$i)) < 0) { printf "found 'x' at %d+%d\n", $offset, $i; if($i+$windowsize > length $buffer) { # get rid of what we no longer need, or we might end up wi +th a buffer holding the whole huge file: $offset += $i; $buffer = substr $buffer, $i; $i = 0; # append a new buffer (assuming BLOCKLENGTH >= $windowsize +): $r = read FH, $buffer, BLOCKLENGTH, length $buffer; last if $windowsize > length $buffer; #not long enough } # do something here... printf "offset for 'x' is %d, found '%s' at %d\n", $offset + $i, substr($buffer, $i + $windowsize - 1, 1), $offset+$i+$windowsize-1; } } continue { $offset += length $buffer; } [download]	[reply] [d/l]
Re: Re: character-by-character in a huge file by ambrus (Abbot) on Apr 10, 2004 at 10:15 UTC
One character each time, why not? Perl already buffers the file so getc should be faster than a lot of magic with substr (or unpack).	[reply]
Re: Re: Re: character-by-character in a huge file by bart (Canon) on Apr 10, 2004 at 12:26 UTC
It depends on the frequency of the matches. I'm convinced that one call to `index`() on a string of 10k, with a negative result, or just a few matches, is a lot faster than 10000 calls to `getc`() and the same number of `eq` tests. It's a matter of doing the same task in C, or in Perl. C usually wins hands down.	[reply]
Re: character-by-character in a huge file by BrowserUk (Patriarch) on Apr 10, 2004 at 06:38 UTC
Please see Optimising processing for large data files. for some techniques for reducing your runtimes by 96%, if your application doesn't lend itself to using Bioperl libraries. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: character-by-character in a huge file by mushnik (Acolyte) on Apr 13, 2004 at 05:59 UTC
Thanks for your thorough consideration of my problem. More detail than I expected to receive...which is a good thing. A quick aside: as you noted, it's been suggested that I use bioperl to tackle my problem. I use bioperl nearly every day in many various ways...when it's tools fit my need. In this case, bioperl offers nothing relevant (fasta format is trivial, and there's no facility in bioperl to "read character by character, really quickly"). Sadly, the optimizations you've suggested don't work nearly as well for me as they do for you. The first part of the benchmark code (below) deals with this problem. For simplicity, I've removed the bit about sliding window. It's something I need to deal with, but it's not really at the heart of my problem. In this test, I just check to see how long it takes to get access to every character. The heart of the problem is that I need to read the value of each character in the file one-at-a-time (the "why" is directly related to the sliding window: I can either recalculate the character counts of the window each time I slide over one space, or I can just consider the character being left behind by the slide, and the new character being added). But accessing each individual character from a file is dreadfully slower in perl than just slurping the file off disk. This is shown with the second part of the benchmark: ... (running Redhat Linux 9.0. perl 5.8.0, testing with a 2MB file) #!/usr/bin/perl -sl use strict; use Benchmark qw\|:all\|; my $ch; my $filename = "ciona_fasta_two"; $/ = ">"; cmpthese( 10, { getc => sub { open (FH, "<$filename"); until (eof(FH)) { $ch = getc(FH); } close FH; }, slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, raw_slurp_substr => sub { open (FH, '<:raw',"$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_regex => sub { open (FH, "<$filename"); while ( <FH>){ while ( /(.)/g){ } #the char is in $1 } close FH; }, raw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, nonraw_sysread_onechar => sub { open (FH, '<:raw',"$filename"); while ( sysread( FH, $ch, 1 ) ) {} close FH; }, raw_sysread_buffer => sub { my ($read, $buf); open (FH, '< :raw', "$filename"); while ( $read = sysread( FH, $buf, 1004096) ) {#faster than 1 +4096 for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, nonraw_sysread_buffer => sub { my ($read, $buf); open (FH, "<$filename"); while ( $read = sysread( FH, $buf, 100*4096) ) { for ( 1 .. $read ) { $ch = substr( $buf, $_, 1 ); } } close FH; }, } ); cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, } ); -----> RESULTS -----> s/iter raw_sysread_onechar nonraw_sysread_onechar +raw_slurp_substr getc nonraw_sysread_buffer slurp_regex raw_sysread_b +uffer slurp_substr raw_sysread_onechar 2.97 -- -0% -4% -19% -46% -51% -53% -70% nonraw_sysread_onechar 2.95 0% -- -3% -19% -46% -51% -52% -70% raw_slurp_substr 2.85 4% 4% -- -16% -44% -49% -51% -69 +% getc 2.40 24% 23% 19% -- -33% -40% -41% -63% + nonraw_sysread_buffer 1.60 86% 85% 79% 50% -- -10% -12% -45% slurp_regex 1.45 105% 104% 97% 66% 11% -- -3% -39% raw_sysread_buffer 1.41 111% 110% 103% 70% 13% 3% -- -37% slurp_substr 0.886 235% 234% 222% 171% 80% 63% 59% -- Rate slurp_substr slurp_length slurp..regex slurp_substr 1.14/s -- -96% -97% slurp_length 31.2/s 2634% -- -12% slurp_simpleregex 35.7/s 3025% 14% -- [download] As you can see from the first test, adding the "raw" option doesn't help, and sysread generally slows things down doesn't seem to help at all. In general, the best performance I can get is by slurping a bunch of content with the standard $line=<FH> syntax, then indexing into the string using substr. But as you can see from the 2nd benchmark, the performance of this option is actually pretty terrible. I can read in the entire contents of the file (or even test with regex, without assigning individual characters to an accesible variable) in 1/30th the time it takes to look at each character in the file. That's much worse than what I've seen (but haven't tested here) in C. I still hold out hope that there's a faster option - and based on your complete treatment of the topic in your original post, hope you'll either a) see what I'm doing wrong, or b) come up with another speedup idea. Thanks again, Travis	[reply] [d/l]
Re: Re: character-by-character in a huge file by BrowserUk (Patriarch) on Apr 13, 2004 at 12:14 UTC
a) see what I'm doing wrong, The first thing you are doing wrong is that you are comparing apples and oranges. Take your 2nd benchmark. `cmpthese( 10, { slurp_substr => sub { open (FH, "<$filename"); my $i = 0; while ( <FH>) { while ($ch = substr($_,$i++,1)){ } } close FH; }, slurp_simpleregex => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $_ =~ /(.)$/; } close FH; }, slurp_length => sub { my $len=0; open (FH, "<$filename"); while ( <FH>){ $len += length($_); } close FH; }, });` [download] Slurp_substr() This reads the whole file, record-by-record, and then appears to set the (global) variable $ch to each character in each record. But, your setting the variable `$i` outside the main loop; incrementing it for each char in the record; but never resetting it. Hence, for the second and subsequent records, `$i` will have whatever value it had at the end of the previous record. If the first record is longer than the rest, it will do nothing for any record other than the first. Both `slurp_substr()` and `raw_slurp_substr()` routines in the 1st benchmark are similarly afflicted. slurp_simpleregex() Your regex says put the last character of each record into $1. Your simply ignoring every character except the last in each record. slurp_length() This is the most mysterious of all. You read each record in the file and accumulate the lengths of those records into the variable `$len`. You never access any of the characters? The first rule of benchmarking is that you need to ensure that you are comparing like with like. The second rule is that you need to make sure that what you are benchmarking is useful and relevant to your final program. In these tests, you are doing neither. That's much worse than what I've seen (but haven't tested here) in C. If your point is that Perl is slower than C. Your right. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: Re: character-by-character in a huge file by mushnik (Acolyte) on Apr 13, 2004 at 15:49 UTC
Re: Re: Re: Re: character-by-character in a huge file by BrowserUk (Patriarch) on Apr 13, 2004 at 16:29 UTC
Some notes below your chosen depth have not been shown here
Re: character-by-character in a huge file by thor (Priest) on Apr 08, 2004 at 20:46 UTC
Perhaps I'm seeing things that aren't there, but isn't fasta some sort of gene sequencing format? If so, why re-invent the wheel? thor	[reply]