Dear monks,

I work on the field of bioinformatics, where Perl is the "official" language and many people with low or moderate knowledge of programming have to deal with it. This "lack of programming knowledge" in some of us is a strong extra motivation for trying to make simple things simple. One of our "daily" tasks is to code small scripts that deals with data formats that are standard in this field, and any improvement in the manage of these data formats would safe a lot of time in coding this kind of scripts.

For this reason I was wondering about the possibility of using Tie::File to manage files containing this kind of "complex" records of data.

For example, suppose that we have a file containing this data:

seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon

logically these data can be split in "seq_1 features" (lines 1-4), "seq_2 features" (lines 5-6) and "seq_3 features" (lines 7-9). In order to parse this, wouldn't be nice to write something like ...?:

use Tie::File; tie my @data, 'Tie::File', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);

Getting the following output:

Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part

This kind of parsing is not possible with Tie::File, because in Tie::File you have to specify a record separator ('recsep') to read records from the file, and I can't imagine a single record separator that would do this job.

Looking at the Tie::File source code you can see that the process of obtaining new records from the file is as follows:

sub _read_record { my $self = shift; my $rec; { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the subroutine... }

But if you modify the code to delegate the actual reading to another subroutine...:

sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; my $rec = $self->get_next_rec(); ## Rest of the subroutine... }

...you can provide any other way of reading the file just by "overriding" the "sub get_next_rec" . For example:

package Tie::File::GFF; use strict; use warnings; use base "Tie::File"; sub get_next_rec { ## Any way of reading records of a file, for example: my $self = shift; my $fh = $self->{fh}; return undef if (eof $fh); my ($last_seen,$last_pos,$rec); while (<$fh>){ my @f = split /\t/; $last_seen = $f[0] if (! defined $last_seen); if ($f[0] eq $last_seen){ $rec.=$_; $last_pos = int (tell $fh); return $rec if (eof $fh); next; } else { seek $fh, $last_pos, 0; return $rec; } } } 1;

Now you can use this new package:

use Tie::File::GFF; tie my @data, 'Tie::File::GFF', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);

Getting the expected output:

Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part

Since Tie::File is optimized to run faster, you may be worry about that extra function call could slow down the default behavior of the entire module. But this can be solved asking if the caller package is the current ('Tie::File::other_package') package or not:

sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; $self =~ /^(.+)=/; my $_caller_pack = $1; my $rec; if ($_caller_pack ne __PACKAGE__){ $rec = $self->get_next_rec(); } else { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the sub }

I don't know if you think that this technique is a "good practice" or "smell code". Any comment would be highly appreciated.

Thanks for your attention

citromatik


In reply to RFC:Hacking Tie::File to read complex data by citromatik

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.