Dear monks,
I work on the field of bioinformatics, where Perl is the "official" language and many people with low or moderate knowledge of programming have to deal with it. This "lack of programming knowledge" in some of us is a strong extra motivation for trying to make simple things simple. One of our "daily" tasks is to code small scripts that deals with data formats that are standard in this field, and any improvement in the manage of these data formats would safe a lot of time in coding this kind of scripts.
For this reason I was wondering about the possibility of using Tie::File to manage files containing this kind of "complex" records of data.
For example, suppose that we have a file containing this data:
seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon
logically these data can be split in "seq_1 features" (lines 1-4), "seq_2 features" (lines 5-6) and "seq_3 features" (lines 7-9). In order to parse this, wouldn't be nice to write something like ...?:
use Tie::File; tie my @data, 'Tie::File', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);
Getting the following output:
Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part
This kind of parsing is not possible with Tie::File, because in Tie::File you have to specify a record separator ('recsep') to read records from the file, and I can't imagine a single record separator that would do this job.
Looking at the Tie::File source code you can see that the process of obtaining new records from the file is as follows:
sub _read_record { my $self = shift; my $rec; { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the subroutine... }
But if you modify the code to delegate the actual reading to another subroutine...:
sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; my $rec = $self->get_next_rec(); ## Rest of the subroutine... }
...you can provide any other way of reading the file just by "overriding" the "sub get_next_rec" . For example:
package Tie::File::GFF; use strict; use warnings; use base "Tie::File"; sub get_next_rec { ## Any way of reading records of a file, for example: my $self = shift; my $fh = $self->{fh}; return undef if (eof $fh); my ($last_seen,$last_pos,$rec); while (<$fh>){ my @f = split /\t/; $last_seen = $f[0] if (! defined $last_seen); if ($f[0] eq $last_seen){ $rec.=$_; $last_pos = int (tell $fh); return $rec if (eof $fh); next; } else { seek $fh, $last_pos, 0; return $rec; } } } 1;
Now you can use this new package:
use Tie::File::GFF; tie my @data, 'Tie::File::GFF', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);
Getting the expected output:
Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part
Since Tie::File is optimized to run faster, you may be worry about that extra function call could slow down the default behavior of the entire module. But this can be solved asking if the caller package is the current ('Tie::File::other_package') package or not:
sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; $self =~ /^(.+)=/; my $_caller_pack = $1; my $rec; if ($_caller_pack ne __PACKAGE__){ $rec = $self->get_next_rec(); } else { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the sub }
I don't know if you think that this technique is a "good practice" or "smell code". Any comment would be highly appreciated.
Thanks for your attention
citromatik
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC:Hacking Tie::File to read complex data
by Jenda (Abbot) on Jun 15, 2007 at 10:46 UTC | |
|
Re: RFC:Hacking Tie::File to read complex data
by ff (Hermit) on Jun 14, 2007 at 23:31 UTC | |
by citromatik (Curate) on Jun 15, 2007 at 10:29 UTC | |
by ff (Hermit) on Jun 15, 2007 at 12:33 UTC | |
by citromatik (Curate) on Jun 15, 2007 at 13:16 UTC | |
|
Re: RFC:Hacking Tie::File to read complex data
by rhesa (Vicar) on Jun 15, 2007 at 12:43 UTC | |
by citromatik (Curate) on Jun 15, 2007 at 13:55 UTC | |
by rhesa (Vicar) on Jun 15, 2007 at 15:39 UTC | |
by citromatik (Curate) on Jun 15, 2007 at 16:07 UTC |