Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Seeking in a file

by Ineffectual (Scribe)
on Jun 01, 2012 at 23:41 UTC ( [id://973887]=perlquestion: print w/replies, xml ) Need Help??

Ineffectual has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I'm going through one to many file(s) that contain chromosome information. I am processing this by chromosome, therefore when I reach a new chromosome in the file, I need to rewind one line in the file in order to not skip processing the first line of the next chromosome. I've tried this several different ways using FileHandle and IO::File, but none of them seem to produce the desired output.

Here's the IO::File version of the code:
use IO::File; use IO::Seekable; use Data::Dumper; use strict; use warnings; my @files = qw( test_input.txt); my @fileHndls; foreach my $file (@files) { if ( ! -s $file ) { die "File $file does not exist! Check the command line or +the pedigree file for errors!\n"; } my $fh = new IO::File; if ( $file =~ /.*\.bz2/ ) { if ( $fh->open("bzcat $file |")) { push @fileHndls, $fh; } else { die "Could not uncompress file $file on " . "the fly!\n"; } } elsif ( $file =~ /.*\.gz/ ) { if ( $fh->open("gunzip -c $file |")) { push @fileHndls, $fh; } else { die "Could not uncompress file $file on " . "the fly!\n"; } } else { if ( $fh->open("< $file")) { push @fileHndls, $fh; } else { die "Could not open file $file!\n"; } } } my @chromosomes = qw( chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM ); foreach my $cChrom (@chromosomes) { print "processing chrom $cChrom\n"; my ($cIndex) = grep { $chromosomes[$_] eq $cChrom } 0..$#chromosome +s; print "filehandles for $cChrom:\n"; print Dumper \@fileHndls; my $fileHndlsReturn = process($cChrom, $cIndex, \@fileHndls); @fileHndls = @$fileHndlsReturn; } exit; sub process { my $currChrom = shift; my $currIndex = shift; my $fileHandlesRef = shift; my @fileHandles = @$fileHandlesRef; my @newFileHandles; for (my $i = 0; $i <= $#fileHandles; $i++) { print " processing file $i\n"; my $fh = $fileHandles[$i]; print Dumper $fh; while (1) { last if ( $fh->eof() ); $_ = <$fh>; next if ($_ =~ /^#|^\s*$|^>locus/ ); print "$_"; my @fields = split; if ( $fields[3] ne $currChrom ) { my ($chrIndex) = grep { $chromosomes[$_] eq $fields[3] } 0..$# +chromosomes; print "chrIndex $chrIndex currIndex $currIndex\n"; if ($chrIndex > $currIndex) { print "skipping rest of file because greater than current ch +romosome $currChrom\n"; $fh->seek(-1,1); $newFileHandles[$i] = $fh; last; } else { print "skipping chrom cuz not current. currChrom $currChrom + line chromosome: $fields[3]\n"; next; } } } # end while } # end foreach fh return \@newFileHandles; }


I've also uploaded this at: here
A test input file is located at: here

Thanks for your help.

Replies are listed 'Best First'.
Re: Seeking in a file
by choroba (Cardinal) on Jun 02, 2012 at 00:05 UTC
    seek changes the position by bytes, not lines. You can remember the position of the previous line (see tell and then seek to $position, 0).
      Is seeking backwards in a piped stream going to work (reliably)?

      This seems like a basic map-reduce paradigm. It would be nice to have a method that would "just work(tm)" for pre sorted data like this.

      Something like an iterator over the group of file handles that stores a one line buffer. Hadoop::Streaming::Reducer::Input does something like this for a single filehandle, but its terrible clumsy code (sorry 'bout that).

      Hadoop::Streaming::Reducer::Input source.

        I fear seeking does not work in a piped stream at all. You have to remember tha last line so you can return to it later.
Re: Seeking in a file
by jwkrahn (Abbot) on Jun 02, 2012 at 01:17 UTC
    if ( ! -s $file ) { die "File $file does not exist! Check the command line or +the pedigree file for errors!\ +n"; }

    Either your error message is wrong or you are using the wrong file test operator.    File existence is tested with the -e operator.    You are testing that the file exists and has zero size.



    if ( $file =~ /.*\.bz2/ ) { ... } elsif ( $file =~ /.*\.gz/ ) {

    you are not testing that the last characters in the file name are either '.bz2' or '.gz' (the file extention), you are testing that those strings are anywhere in the file name.    You need to anchor the patterns to the end of the file name:

    if ( $file =~ /\.bz2\z/ ) { ... } elsif ( $file =~ /\.gz\z/ ) {

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://973887]
Approved by Illuminatus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-24 19:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found