mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm doing something like...
while ($newline = <FILEHANDLE>){ if ($newline = /^>/) { if ($i == $number) { #initialize a bunch of stuff and continue } else { $i++; #and then go back to whatever the last line that =~ /^>/ and +go again } } $stuff = $newline; &play_with($stuff); }

So, as I read through a huge (500 Mb) text file line by line, each time I hit a line starting with a '>', I want to initialize a few variables, then play with the text that follows. That's easy enough.

But how do I then backtrack back up to the last line starting with '>' above my current position? And, specifically, how do I do it $i times between each '>line'? I tried a redo within the while loop, but I don't think I was doing it right.

Better explanation available if needed.

Thanks.
Matt

Replies are listed 'Best First'.
Re: How do I backtrack while reading a file line-by-line?
by ikegami (Patriarch) on Oct 13, 2006 at 18:59 UTC
    tell to save your spot, seek to return to it.
      Thanks ikegami, I hadn't seen tell or seek previously, and once I read up on them, they proved to be exactly what I was looking for.

      That subroutine works beautifully now!

      Matt

Re: How do I backtrack while reading a file line-by-line?
by grep (Monsignor) on Oct 13, 2006 at 19:35 UTC
    You can also Tie the file to an array. This also has the advantage of being able to control memory usage and has a read cache.

    Then all you need is a c-style for loop a $save scalar and a %done hash.

    Similar to:

    my @array = qw/ foo bar baz blah bar blah baz/; my $save = 0; my %done; for (my $x = 0; $x <= $#array; $x++) { $save = $x if ($array[$x] eq 'bar' ); print "X:$x SAVE:$save $array[$x]\n"; if ( $array[$x] eq 'blah' and !defined($done{$x}) ) { $done{$x}++; $x = $save; } }


    grep
    One dead unjugged rabbit fish later

      That section on memory usage is very misleading. Tie::File keeps the index of every encountered lines (i.e. every lines up to the highest one read/written) in memory. In other words, if you do $tied[-1] or push @tied, ..., the index of every line in the file is loaded into memory (if they haven't already been loaded).

      Tie::File is still a very useful module.

        from the POD:
        memory - This is an upper limit on the amount of memory that Tie::File will consume at any time while managing the file. This is used for two things: managing the read cache and managing the deferred write buffer

        I didn't find that misleading. It says to me that only chunks of the file data are loaded into memory. In fact, I assumed that it loaded a full index of the lines at instantiation.

        If the OP knows about how much data an average (or the largest) backtrack is, the read cache could optimized for memory usage/speed. Plus you get a layer of abstraction to hide any nastiness.



        grep
        One dead unjugged rabbit fish later
Re: How do I backtrack while reading a file line-by-line?
by madbombX (Hermit) on Oct 13, 2006 at 19:37 UTC
    Every time to you come to a line that starts with '>', you could push it onto an array. Then refer back to the array each time you want to access previous lines that started with '>'. I don't know how many times you come across lines that (since it can create quite a large array).

    That being said, to add onto ikegami's idea, you can use tell to tell you where the line is, push that on an array. Then when you want to go back X number of times, then you can always seek to the line ($lines[-1] .. $lines[-4]).

      Keeping your own buffer also has the advantage of working on something that's not seekable (e.g. a network socket, or a pipe from another program).

      Unfortunately, I am reading in files that contain genome data, at the lines starting with '>' correspond to the start of a new chromosome. So, a ~500 Mb file will contain less than 50 lines starting with '>'. So, reading everything inbetween them into the buffer almost defeats the purpose of the buffer itself.

      Thanks anyway tho.
      Matt

Re: How do I backtrack while reading a file line-by-line?
by BrowserUk (Patriarch) on Oct 13, 2006 at 21:41 UTC

    Sounds very much like you're trying to read a Fasta format sequence file?

    You could use Bio::SeqIO, or if that is giving you problems you might try my crude Fasta load routine. It's the last code snippet in Re^5: Memory Usage in Regex On Large Sequence. That post/thread also shows a problem with the cpan module along with one reason why it's performance is not so good. Though that might have been fixed by now.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I am indeed trying to read Fasta files.

      In fact, what I'm doing is creating a search function that, given a variable number of user-input DNA sequences (such as amino acid motifs, or transcription factor binding sites), it searches a user specified Fasta file for all hits, either totally, or within $interval bases of each other, and then outputs all the hits both as .html format and as .fasta format, and the .html would have all the matches highlighted in various colors.

      So far, I haven't read up at all on modules, so I suppose that's the next step in my Perl learning curve.

      Thanks for the pointer. I'll definitely check it out.
      Matt

Re: How do I backtrack while reading a file line-by-line?
by holli (Abbot) on Oct 13, 2006 at 19:55 UTC
    see redo in perlfunc.


    holli, /regexed monk/
Re: How do I backtrack while reading a file line-by-line?
by blazar (Canon) on Oct 14, 2006 at 10:17 UTC

    Nothing to do with your question, but...

    while ($newline = <FILEHANDLE>){
    use strict; use warnings;

    and then

    while (my $newline = <FILEHANDLE>){
    if ($newline = /^>/) {

    This is most probably not what you want, since you're assigning to $newline. You want

    if ($newline =~ /^>/) {

    instead.

    $stuff = $newline; &play_with($stuff);

    Unless play_with() modifies its argument, you may want to pass $newline directly to it, without passing through an intermediate variable. But more importantly, the &-form of sub call is now obsolete and likely not to do what one may think, so unless you do know, don't!

      Thanks.

      I didn't know about the & form being deprecated, so i will get rid of that.

      And &play_with($stuff) does indeed modify $stuff, so I guess I am doing the right thing there, although I can't take credit for doing it on purpose. ;)

      Thanks for the info though.
      Matt

        Oh, heck.

        When I wrote the code block in the original post, the

        if($newline = /^>/) {
        was NOT deliberate. just a typo. In my actual code, it reads:
        if($newline =~ /^>/) {