pearly has asked for the wisdom of the Perl Monks concerning the following question:

hi monks, i have 2 files, i need to extract some data from one file based on another file. the data file is around 1GB. Since, its too big to read line by line, i got this idea to read it at their positions. I used tell and seek functions. But i have problems in getting the desired output. check out my code.
file1: SID834.56 AGAAGTCGTACGATCA SID164.26 AGTCGATCATTATATATTCGCTAG SID4.56 AGCTAGCGATCGATCCCCCCCCCCCCCCCC SID5764.12 CGATCGATC SID564.12 ACGAATATGATAC file2: cluster number 1: (reads count:2) SID834.56 SID564.12 cluster number 2: (reads count:2) SID164.26 SID5764.12 cluster number 3: (reads count:1) SID4.56 code: #!/usr/bin/perl -w use strict; use warnings; open(FH1,$ARGV[0]) or die "can not open\n"; open(FH2,$ARGV[1]) or die "can not open\n"; my @indx; while(<FH1>){ my ($id,$seq)=split("\t",$_); push(@indx, "$id\t".tell FH1); } while(<FH2>){ if($_=~m/^clus/){ my $clushead=$_; print "\n$clushead"; } else{ $_=~s/\t//g;$_=~s/\n//g; my $tes=$_; my @hit=grep(/$tes/,@indx); my $sca="@hit"; my ($id1,$pos)=split("\t",$sca); print sysseek (FH1,$pos,0),"\n" or die "seek:$!"; } } desired results: cluster number 1: (reads count:2) SID834.56 AGAAGTCGTACGATCA SID564.12 ACGAATATGATAC cluster number 2: (reads count:2) SID164.26 AGTCGATCATTATATATTCGCTAG SID5764.12 CGATCGATC cluster number 3: (reads count:1) SID4.56 AGCTAGCGATCGATCCCCCCCCCCCCCCCC results which i get now: cluster number 1: (reads count:2) 27 146 cluster number 2: (reads count:2) 62 122 cluster number 3: (reads count:1) 101
why is the seek function not fetching the content but the position? thanks !!!

Replies are listed 'Best First'.
Re: seeking help for seek function
by moritz (Cardinal) on Mar 25, 2010 at 07:35 UTC
    Why do you use sysseek instead of seek? All the functions beginning with sys are for unbuffered IO, whereas readline is buffered IO. The docs warn against mixing those.
    Perl 6 - links to (nearly) everything that is Perl 6.
      seek prints 1 (true), only sysseek prints the position. can you please tell me how i can print the line from the file if i use seek?

        If you want to know the position, use tell. If you want to set the position, use seek.

      seek prints 1 (true), only sysseek prints the position. can you please tell me how i can print the line from the file if i use seek?
        sysseek and seek do not print anything, it is you calling print. Those functions set the file position, they do not read the file (as others have said).

        Looking at your code - forgive me if I am wrong here - you seem to be getting the file positions from the first file and using those same positions to find records in the second file. Unless each file has corresponding records of exactly the same length, then that will not work. tell and sysseek give the current byte offset position in the current file, that position will not (usually) apply to another file unless it is exactly the same format.
Re: seeking help for seek function
by ikegami (Patriarch) on Mar 25, 2010 at 07:11 UTC

    seek and sysseek just move the file pointer. If you want the data that follows, you need to read it.

      i used readline function below seek, like this:
      $buffer = readline( *FH1 ); print("$buffer");
      but it still doesnt give the right sequence.
        You use tell for SID834.56 when the file pointer is here:
        SID834.56 AGAAGTCGTACGATCA [*]SID164.26 AGTCGATCATTATATATTCGCTAG
        You want to use tell for SID834.56 when the file pointer is here:
        SID834.56 [*]AGAAGTCGTACGATCA SID164.26 AGTCGATCATTATATATTCGCTAG
        That's not easy to do, but it's easy and acceptable to use tell for SID834.56 when the file pointer is here:
        [*]SID834.56 AGAAGTCGTACGATCA SID164.26 AGTCGATCATTATATATTCGCTAG

        (i.e. before you read the line)

Re: seeking help for seek function
by BrowserUk (Patriarch) on Mar 25, 2010 at 14:02 UTC
    the data file is around 1GB. Since, its too big to read line by line,

    I think the above is your biggest mistake. It doesn't matter how big the file is, so long as the individual lines aren't >2GB, then you can read the file line by line.

    I think that all your seek/tell stuff is just a distraction from your real problem. This might be a true case of the XY problem.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      hi, thank you very much for your advice. As you said, i tried a different method to solve the problem and succeeded in it, its very fast too :) here is how i did it !!!
      #!/usr/bin/perl -w use strict; use warnings; open(FH1,$ARGV[0]) or die "can not open\n"; open(FH2,$ARGV[1]) or die "can not open\n"; my @indx; while(<FH1>){ my ($id,$seq)=split("\t",$_); push(@indx, $id,$seq); } my %hashseq=@indx; while(<FH2>){ if($_=~m/^clus/){ my $clushead=$_; print "$clushead"; } else{ $_=~s/\t//g;$_=~s/\n//g; my $tes=$_; print $tes,"\t",$hashseq{"$tes"}; } }
      Thank you very much once again :) (p.s: sorry for posting twice. forgot to login previously.)
      hi, thank you very much for your advice. As you said, i tried a different method to solve the problem and succeeded in it, its very fast too :) here is how i did it !!!
      #!/usr/bin/perl -w use strict; use warnings; open(FH1,$ARGV[0]) or die "can not open\n"; open(FH2,$ARGV[1]) or die "can not open\n"; my @indx; while(<FH1>){ my ($id,$seq)=split("\t",$_); push(@indx, $id,$seq); } my %hashseq=@indx; while(<FH2>){ if($_=~m/^clus/){ my $clushead=$_; print "$clushead"; } else{ $_=~s/\t//g;$_=~s/\n//g; my $tes=$_; print $tes,"\t",$hashseq{"$tes"}; } }
      Thank you very much once again :)