Dear monks,

I work on the field of bioinformatics, where Perl is the "official" language and many people with low or moderate knowledge of programming have to deal with it. This "lack of programming knowledge" in some of us is a strong extra motivation for trying to make simple things simple. One of our "daily" tasks is to code small scripts that deals with data formats that are standard in this field, and any improvement in the manage of these data formats would safe a lot of time in coding this kind of scripts.

For this reason I was wondering about the possibility of using Tie::File to manage files containing this kind of "complex" records of data.

For example, suppose that we have a file containing this data:

seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon

logically these data can be split in "seq_1 features" (lines 1-4), "seq_2 features" (lines 5-6) and "seq_3 features" (lines 7-9). In order to parse this, wouldn't be nice to write something like ...?:

use Tie::File; tie my @data, 'Tie::File', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);

Getting the following output:

Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part

This kind of parsing is not possible with Tie::File, because in Tie::File you have to specify a record separator ('recsep') to read records from the file, and I can't imagine a single record separator that would do this job.

Looking at the Tie::File source code you can see that the process of obtaining new records from the file is as follows:

sub _read_record { my $self = shift; my $rec; { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the subroutine... }

But if you modify the code to delegate the actual reading to another subroutine...:

sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; my $rec = $self->get_next_rec(); ## Rest of the subroutine... }

...you can provide any other way of reading the file just by "overriding" the "sub get_next_rec" . For example:

package Tie::File::GFF; use strict; use warnings; use base "Tie::File"; sub get_next_rec { ## Any way of reading records of a file, for example: my $self = shift; my $fh = $self->{fh}; return undef if (eof $fh); my ($last_seen,$last_pos,$rec); while (<$fh>){ my @f = split /\t/; $last_seen = $f[0] if (! defined $last_seen); if ($f[0] eq $last_seen){ $rec.=$_; $last_pos = int (tell $fh); return $rec if (eof $fh); next; } else { seek $fh, $last_pos, 0; return $rec; } } } 1;

Now you can use this new package:

use Tie::File::GFF; tie my @data, 'Tie::File::GFF', "datafile"; print "Rec$_\n$data[$_]\n\n" for (0..$#data);

Getting the expected output:

Rec1 seq_1 1 33 gene seq_1 1 20 part seq_1 21 27 part seq_1 28 33 part Rec2 seq_2 1 80 gene seq_2 1 80 part Rec3 seq_3 1 55 gene seq_3 1 30 part seq_3 31 50 part

Since Tie::File is optimized to run faster, you may be worry about that extra function call could slow down the default behavior of the entire module. But this can be solved asking if the caller package is the current ('Tie::File::other_package') package or not:

sub get_next_rec { my $self = shift; local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } sub _read_record { my $self = shift; $self =~ /^(.+)=/; my $_caller_pack = $1; my $rec; if ($_caller_pack ne __PACKAGE__){ $rec = $self->get_next_rec(); } else { local $/ = $self->{recsep}; my $fh = $self->{fh}; $rec = <$fh>; } ## Rest of the sub }

I don't know if you think that this technique is a "good practice" or "smell code". Any comment would be highly appreciated.

Thanks for your attention

citromatik

Replies are listed 'Best First'.
Re: RFC:Hacking Tie::File to read complex data
by Jenda (Abbot) on Jun 15, 2007 at 10:46 UTC
Re: RFC:Hacking Tie::File to read complex data
by ff (Hermit) on Jun 14, 2007 at 23:31 UTC
    Well, if "any" comment is okay.... :-)

    I hear that the need is for simplification and the following strays from that need by going "native" and abandoning the interface concept. But the need that is presented and the output that seems acceptable creates an irresistable urge to write some throw-away code that can be adapted for the specific situation, ignoring the idea of Tie::File and its optimizations/ease of use...

    How about restructuring the output just a bit? That is, is the term "Rec1" as important as the fact that you are showing a bunch of data about "seq_1"? And if so, wouldn't it be nice if "seq_1" were your header and you didn't have to print "seq_1" on each line? You could go on to sort the individual 'seq' records if that makes a difference.

    seq_1: 1 33 gene 1 20 exon 21 27 exon 28 33 exon seq_2: 1 80 gene 1 80 exon seq_3: 1 55 gene 1 30 exon 31 50 exon
    via the following snippet. Note that to access a file of such data rather than use __DATA__, you just need to put the proper filename in for $in_file and then uncomment the two $INFILE lines and comment out the while DATA line. (Untested.)
    #!/usr/bin/perl -w use strict; my %seq_info; my $in_file = 'datafile.txt'; #open $INFILE, '<', $in_file or die "Could not open '$in_file': $!\n" +; #while ( <$INFILE> ) while ( <DATA> ) { my @record = split( /\s+/, $_ ); push @{ $seq_info{ $record[0] } }, [ @record[1..3] ]; } foreach ( sort keys %seq_info ) { print_seq_chunk( $_, $seq_info{$_} ); print "\n"; } sub print_seq_chunk { my ( $seq_id, $seq_info_ar ) = @_; print $seq_id, ":\n"; printf " %3d %3d %6.6s\n", @$_ foreach @$seq_info_ar; } __DATA__ seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon

      Hi ff. First of all thanks a lot for the feedback! (It is not easy to read and go deep in such long posts) :-)

      In my root post I gave a dummy example of input and output to show the concept of the interface solution. Clearly, there is no need to hack a complex module to parse the example data. (In the same way that you don't need to use Tie::File to simply read data from a file and output it to STDOUT).

      But lets try another example a bit more interesting where the tied interface could give a very simple solution: Suppose that I want to get a random record. With the interface solution you can do it even with a simple perl one-liner!:

      perl -e 'use Tie::File::GFF; tie my @arr,'Tie::File::GFF',"infile"; pr +int $arr[int rand ($#arr)];'

      Try to do it "native" and lets compare the number of lines needed!

      Thanks again for your feedback!!

      Cheers

      citromatik

        I must admit, the lazy side of me did not want to do the mental hacking that your module change suggested when the required solution seemed so simple in the first place. I can imagine that you have some rather large datasets that, for performance reasons, you would rather access via iteration (Tie) than by sucking everything into memory. That gives a little better justification for fooling around with Tie.

        Where a module encapsulates several steps, and your inexperienced user knows exactly what to expect from said module, by all means use it. The trick comes when you change the rules (i.e. change the module) that the inexperienced user knows. By adapting the module for the special formats of the bioinformatics world, aren't you requiring an extra level of understanding? Whereas by being self-sufficient and learning the basics of Perl, "the 'official' language of bioinformatics", isn't the inexperienced user better positioned to handle whatever data processing need arises? Or have the minimal foundation necessary to glue in an appropriate Bio module from CPAN?

        For getting a random line, while it may be wasteful of computer resources, it's certainly straightforward to simply do:

        #open INFILE, '<', 'outfile.txt' or die "Could not open 'outfile.txt': + $!\n"; #my @seq_info = <INFILE>; #close INFILE; my @seq_info = <DATA>; print $seq_info[int rand ($#seq_info)]; __DATA__ seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon
Re: RFC:Hacking Tie::File to read complex data
by rhesa (Vicar) on Jun 15, 2007 at 12:43 UTC

    Others have already commented on your example data, so I'll limit myself to some thoughts on your "good practice" question.

    Your idea to break up a method into several smaller ones, each with a distinct task, is good practice. It encourages subclassing by making it easier to override very specific pieces of behavior. Now, whether you'd be able to get this particular refactoring accepted into Tie::File is another problem altogether, but the idea is sound :-)

    So you're definitely on the right track. It's a shame you then cave in to "premature optimization". Thinking one additional method call would ruin performance is, IMHO, misguided. I don't have a benchmark handy, but I wouldn't be surprised if Perl's built-in method dispatching would be just as fast as your caller package check (what with the regular expression and all). Besides, the overhead of one method call will likely be swamped by the IO calls (not to mention the tie interface itself).

    The way you think to solve the issue has its problems too:

    1. You break inheritance with the check on the object's class name: with your code, it's now impossible to subclass that particular method and benefit from its features
    2. You now have code duplication: the code in the else() branch belongs in the superclass, not here

    Side note: a better way to get the object's class name would be my $pkg = ref $self;. But you should almost never do that: it's usually much better to verify what an object can do than what class it is.

      Hi rhesa, thanks for your comments!

      <quote>Thinking one additional method call would ruin performance is, IMHO, misguided</quote>

      Yes, I agree with that... until I found some inline methods in Tie::File source code with comments like "inlining read_record() would make this loop five times faster"

      I've already done a Benchmark and found that in fact my inherited version in sensible faster than the original Tie::File (I suppose that it is because lines are grouped into records and the indexing, etc... is faster). The benchmark code and results are shown below.

      <quote>You break inheritance with the check on the object's class name: with your code, it's now impossible to subclass that particular method and benefit from its features</quote>

      Sorry, I don't understand this point

      Thanks!

      citromatik

      Benchmark

      citromatik

        I found some inline methods in Tie::File source code with comments like "inlining read_record() would make this loop five times faster"

        I noticed those too. At first I thought: "It's telling that Dominus didn't actually do the inlining", and I assumed that he had good reasons for that1. And I imagine you are glad too he didn't do it, or you would have had to override _fill_offsets() as well, copying most of the code. On the other hand, the last update to Tie::File was in 2003, so maybe he just didn't get around to it, and lost interest.

        Sorry, I don't understand this point [about subclassing. rr]
        I'd like to retract that point. I misread your code, and thought you had if( $_caller_pack eq __PACKAGE__ ). You use ne there, which inlines the get_next_rec only for that particular class, so that's perfectly reasonable. Had it been eq then subclasses would have gotten the inline version, and would have been unable to override get_next_rec(). I apologise for the confusion.

        Your benchmark looks impressive, but I can't tell if it's because of your special record reading code, or because of your inlining. Is it really just because of the method call overhead?

        Note 1: one reason being that _read_record() gets called in several places, so inlining it in that one spot would mean code duplication, which is always a maintenance problem.