Seeking in a file

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I'm going through one to many file(s) that contain chromosome information. I am processing this by chromosome, therefore when I reach a new chromosome in the file, I need to rewind one line in the file in order to not skip processing the first line of the next chromosome. I've tried this several different ways using FileHandle and IO::File, but none of them seem to produce the desired output.

Here's the IO::File version of the code:

use IO::File;
use IO::Seekable;
use Data::Dumper;
use strict;
use warnings;

my @files = qw( test_input.txt);

my @fileHndls;
foreach my $file (@files) {
  if ( ! -s $file ) {
           die "File $file does not exist!  Check the command line or 
+the pedigree file for errors!\n";
   }

   my $fh = new IO::File;

   if ( $file =~ /.*\.bz2/ ) {
           if ( $fh->open("bzcat $file |")) {
                   push @fileHndls, $fh;
           } else {
                   die "Could not uncompress file $file on "
                   . "the fly!\n";
           }
   } elsif ( $file =~ /.*\.gz/ ) {
           if ( $fh->open("gunzip -c $file |")) {
                   push @fileHndls, $fh;
           } else {
                   die "Could not uncompress file $file on "
                   . "the fly!\n";
           }
   } else {
           if ( $fh->open("< $file")) {
                   push @fileHndls, $fh;
           } else {
                   die "Could not open file $file!\n";
           }
   }
}

my @chromosomes = qw( chr1 chr2 chr3 chr4 chr5 chr6
                      chr7 chr8 chr9 chr10 chr11 chr12
                      chr13 chr14 chr15 chr16 chr17
                      chr18 chr19 chr20 chr21 chr22
                      chrX chrY chrM );

foreach my $cChrom (@chromosomes) {
  print "processing chrom $cChrom\n";
  my ($cIndex) =  grep { $chromosomes[$_] eq $cChrom } 0..$#chromosome
+s;
  print "filehandles for $cChrom:\n";
  print Dumper \@fileHndls;
  my $fileHndlsReturn = process($cChrom, $cIndex, \@fileHndls);
  @fileHndls = @$fileHndlsReturn;
}

exit;

sub process {
  my $currChrom = shift;
  my $currIndex = shift;
  my $fileHandlesRef = shift;
  my @fileHandles = @$fileHandlesRef;
  my @newFileHandles;

  for (my $i = 0; $i <= $#fileHandles; $i++) {
   print "  processing file $i\n";
    my $fh = $fileHandles[$i];
    print Dumper $fh;
    while (1) {
      last if ( $fh->eof() );
      $_ = <$fh>;
      next if ($_ =~ /^#|^\s*$|^>locus/ );
      print "$_";
      my @fields = split;
      if ( $fields[3] ne $currChrom ) {
        my ($chrIndex) = grep { $chromosomes[$_] eq $fields[3] } 0..$#
+chromosomes;
        print "chrIndex $chrIndex currIndex $currIndex\n";
        if ($chrIndex > $currIndex) {
          print "skipping rest of file because greater than current ch
+romosome $currChrom\n";
          $fh->seek(-1,1);
          $newFileHandles[$i] = $fh;
          last;
        } else {
          print "skipping chrom cuz not current.  currChrom $currChrom
+  line chromosome: $fields[3]\n";
          next;
        }
      }
    } # end while
  } # end foreach fh

  return \@newFileHandles;
}
[download]

I've also uploaded this at: here
A test input file is located at: here

Thanks for your help.

Comment on Seeking in a file Download Code

Replies are listed 'Best First'.
Re: Seeking in a file by choroba (Cardinal) on Jun 02, 2012 at 00:05 UTC
`seek` changes the position by bytes, not lines. You can remember the position of the previous line (see tell and then seek to `$position, 0`).	[reply] [d/l] [select]
Re^2: Seeking in a file by spazm (Monk) on Jun 02, 2012 at 00:25 UTC
Is seeking backwards in a piped stream going to work (reliably)? This seems like a basic map-reduce paradigm. It would be nice to have a method that would "just work(tm)" for pre sorted data like this. Something like an iterator over the group of file handles that stores a one line buffer. Hadoop::Streaming::Reducer::Input does something like this for a single filehandle, but its terrible clumsy code (sorry 'bout that). Hadoop::Streaming::Reducer::Input source.	[reply]
Re^3: Seeking in a file by choroba (Cardinal) on Jun 02, 2012 at 00:29 UTC
I fear seeking does not work in a piped stream at all. You have to remember tha last line so you can return to it later.	[reply]
Re: Seeking in a file by jwkrahn (Abbot) on Jun 02, 2012 at 01:17 UTC
`if ( ! -s $file ) { die "File $file does not exist! Check the command line or +the pedigree file for errors!\ +n"; }` [download] Either your error message is wrong or you are using the wrong file test operator. File existence is tested with the `-e` operator. You are testing that the file exists and has zero size. `if ( $file =~ /.\.bz2/ ) { ... } elsif ( $file =~ /.\.gz/ ) {` [download] you are not testing that the last characters in the file name are either '.bz2' or '.gz' (the file extention), you are testing that those strings are anywhere in the file name. You need to anchor the patterns to the end of the file name: `if ( $file =~ /\.bz2\z/ ) { ... } elsif ( $file =~ /\.gz\z/ ) {` [download]	[reply] [d/l] [select]


Keep It Simple, Stupid
	PerlMonks