RFC:Hacking Tie::File to read complex data

Dear monks,

I work on the field of bioinformatics, where Perl is the "official" language and many people with low or moderate knowledge of programming have to deal with it. This "lack of programming knowledge" in some of us is a strong extra motivation for trying to make simple things simple. One of our "daily" tasks is to code small scripts that deals with data formats that are standard in this field, and any improvement in the manage of these data formats would safe a lot of time in coding this kind of scripts.

For this reason I was wondering about the possibility of using Tie::File to manage files containing this kind of "complex" records of data.

For example, suppose that we have a file containing this data:

seq_1    1    33    gene
seq_1    1    20    exon
seq_1    21    27    exon
seq_1    28    33    exon
seq_2    1    80    gene
seq_2    1    80    exon
seq_3    1    55    gene
seq_3    1    30    exon
seq_3    31    50    exon
[download]

logically these data can be split in "seq_1 features" (lines 1-4), "seq_2 features" (lines 5-6) and "seq_3 features" (lines 7-9). In order to parse this, wouldn't be nice to write something like ...?:

use Tie::File;

tie my @data, 'Tie::File', "datafile";
print "Rec$_\n$data[$_]\n\n" for (0..$#data);
[download]

Getting the following output:

Rec1
seq_1    1    33    gene
seq_1    1    20    part
seq_1    21    27    part
seq_1    28    33    part

Rec2
seq_2    1    80    gene
seq_2    1    80    part

Rec3
seq_3    1    55    gene
seq_3    1    30    part
seq_3    31    50    part
[download]

This kind of parsing is not possible with Tie::File, because in Tie::File you have to specify a record separator ('recsep') to read records from the file, and I can't imagine a single record separator that would do this job.

Looking at the Tie::File source code you can see that the process of obtaining new records from the file is as follows:

sub _read_record {
  my $self = shift;
  my $rec;
  { local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

## Rest of the subroutine...

}
[download]

But if you modify the code to delegate the actual reading to another subroutine...:

sub get_next_rec
  {
    my $self = shift;
    local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

sub _read_record {
  my $self = shift;

  my $rec = $self->get_next_rec();

## Rest of the subroutine...

}
[download]

...you can provide any other way of reading the file just by "overriding" the "sub get_next_rec" . For example:

package Tie::File::GFF;

use strict;
use warnings;
use base "Tie::File";

sub get_next_rec
  {

  ## Any way of reading records of a file, for example:

    my $self = shift;
    my $fh = $self->{fh};
    return undef if (eof $fh);
    my ($last_seen,$last_pos,$rec);
    while (<$fh>){
      my @f = split /\t/;
      $last_seen = $f[0] if (! defined $last_seen);
      if ($f[0] eq $last_seen){
    $rec.=$_;
    $last_pos = int (tell $fh);
    return $rec if (eof $fh);
    next;
      } else {
    seek $fh, $last_pos, 0;
    return $rec;
      }
    }
  }

1;
[download]

Now you can use this new package:

use Tie::File::GFF;

tie my @data, 'Tie::File::GFF', "datafile";
print "Rec$_\n$data[$_]\n\n" for (0..$#data);
[download]

Getting the expected output:

Rec1
seq_1    1    33    gene
seq_1    1    20    part
seq_1    21    27    part
seq_1    28    33    part

Rec2
seq_2    1    80    gene
seq_2    1    80    part

Rec3
seq_3    1    55    gene
seq_3    1    30    part
seq_3    31    50    part
[download]

Since Tie::File is optimized to run faster, you may be worry about that extra function call could slow down the default behavior of the entire module. But this can be solved asking if the caller package is the current ('Tie::File::other_package') package or not:

sub get_next_rec
  {
    my $self = shift;
    local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

sub _read_record {
  my $self = shift;

  $self =~ /^(.+)=/;
  my $_caller_pack = $1;
  my $rec;
  if ($_caller_pack ne __PACKAGE__){
    $rec = $self->get_next_rec();
  } else

  { local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

## Rest of the sub

}
[download]

I don't know if you think that this technique is a "good practice" or "smell code". Any comment would be highly appreciated.

Thanks for your attention

citromatik

Comment on RFC:Hacking Tie::File to read complex data Select or Download Code

Replies are listed 'Best First'.

Re: RFC:Hacking Tie::File to read complex data
by Jenda (Abbot) on Jun 15, 2007 at 10:46 UTC

I think this data should rather be handled by DBI+DBD::AnyData. You'll define the format and then be able to group and sort the data in any way you need.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re: RFC:Hacking Tie::File to read complex data
by ff (Hermit) on Jun 14, 2007 at 23:31 UTC

I hear that the need is for simplification and the following strays from that need by going "native" and abandoning the interface concept. But the need that is presented and the output that seems acceptable creates an irresistable urge to write some throw-away code that can be adapted for the specific situation, ignoring the idea of Tie::File and its optimizations/ease of use...

How about restructuring the output just a bit? That is, is the term "Rec1" as important as the fact that you are showing a bunch of data about "seq_1"? And if so, wouldn't it be nice if "seq_1" were your header and you didn't have to print "seq_1" on each line? You could go on to sort the individual 'seq' records if that makes a difference.

seq_1:
    1    33     gene
    1    20     exon
   21    27     exon
   28    33     exon

seq_2:
    1    80     gene
    1    80     exon

seq_3:
    1    55     gene
    1    30     exon
   31    50     exon
[download]

__DATA__

$in_file

$INFILE

while DATA

#!/usr/bin/perl -w
use strict;

my %seq_info;
my $in_file = 'datafile.txt';
#open $INFILE, '<', $in_file or die "Could not open '$in_file':  $!\n"
+;

#while ( <$INFILE> ) 
while ( <DATA> ) 
{
    my @record = split( /\s+/, $_ );
    push @{ $seq_info{ $record[0] }  }, [ @record[1..3] ];
}

foreach ( sort keys %seq_info ) {
    print_seq_chunk( $_, $seq_info{$_} );
    print "\n";
}

sub print_seq_chunk {
    my ( 
        $seq_id,
        $seq_info_ar
            ) = @_;

    print $seq_id, ":\n";
    printf "  %3d   %3d   %6.6s\n", @$_
        foreach @$seq_info_ar;
}


__DATA__
seq_1    1    33    gene
seq_1    1    20    exon
seq_1    21    27    exon
seq_1    28    33    exon
seq_2    1    80    gene
seq_2    1    80    exon
seq_3    1    55    gene
seq_3    1    30    exon
seq_3    31    50    exon
[download]

[reply]
[d/l]
[select]

Re^2: RFC:Hacking Tie::File to read complex data

by citromatik (Curate) on Jun 15, 2007 at 10:29 UTC

Hi ff. First of all thanks a lot for the feedback! (It is not easy to read and go deep in such long posts) :-)

In my root post I gave a dummy example of input and output to show the concept of the interface solution. Clearly, there is no need to hack a complex module to parse the example data. (In the same way that you don't need to use Tie::File to simply read data from a file and output it to STDOUT).

But lets try another example a bit more interesting where the tied interface could give a very simple solution: Suppose that I want to get a random record. With the interface solution you can do it even with a simple perl one-liner!:

perl -e 'use Tie::File::GFF; tie my @arr,'Tie::File::GFF',"infile"; pr
+int $arr[int rand ($#arr)];'
[download]

Try to do it "native" and lets compare the number of lines needed!

Thanks again for your feedback!!

Cheers

citromatik

[reply]
[d/l]

Re^3: RFC:Hacking Tie::File to read complex data

by ff (Hermit) on Jun 15, 2007 at 12:33 UTC

Where a module encapsulates several steps, and your inexperienced user knows exactly what to expect from said module, by all means use it. The trick comes when you change the rules (i.e. change the module) that the inexperienced user knows. By adapting the module for the special formats of the bioinformatics world, aren't you requiring an extra level of understanding? Whereas by being self-sufficient and learning the basics of Perl, "the 'official' language of bioinformatics", isn't the inexperienced user better positioned to handle whatever data processing need arises? Or have the minimal foundation necessary to glue in an appropriate Bio module from CPAN?

For getting a random line, while it may be wasteful of computer resources, it's certainly straightforward to simply do:

#open INFILE, '<', 'outfile.txt' or die "Could not open 'outfile.txt':
+  $!\n";
#my @seq_info = <INFILE>;
#close INFILE;

my @seq_info = <DATA>;
print $seq_info[int rand ($#seq_info)];


__DATA__
seq_1    1    33    gene
seq_1    1    20    exon
seq_1    21    27    exon
seq_1    28    33    exon
seq_2    1    80    gene
seq_2    1    80    exon
seq_3    1    55    gene
seq_3    1    30    exon
seq_3    31    50    exon
[download]

[reply]
[d/l]

Re^4: RFC:Hacking Tie::File to read complex data

by citromatik (Curate) on Jun 15, 2007 at 13:16 UTC

Re: RFC:Hacking Tie::File to read complex data
by rhesa (Vicar) on Jun 15, 2007 at 12:43 UTC

Others have already commented on your example data, so I'll limit myself to some thoughts on your "good practice" question.

Your idea to break up a method into several smaller ones, each with a distinct task, is good practice. It encourages subclassing by making it easier to override very specific pieces of behavior. Now, whether you'd be able to get this particular refactoring accepted into Tie::File is another problem altogether, but the idea is sound :-)

So you're definitely on the right track. It's a shame you then cave in to "premature optimization". Thinking one additional method call would ruin performance is, IMHO, misguided. I don't have a benchmark handy, but I wouldn't be surprised if Perl's built-in method dispatching would be just as fast as your caller package check (what with the regular expression and all). Besides, the overhead of one method call will likely be swamped by the IO calls (not to mention the tie interface itself).

The way you think to solve the issue has its problems too:

You break inheritance with the check on the object's class name: with your code, it's now impossible to subclass that particular method and benefit from its features
You now have code duplication: the code in the else() branch belongs in the superclass, not here

Side note: a better way to get the object's class name would be my $pkg = ref $self;. But you should almost never do that: it's usually much better to verify what an object can do than what class it is.

[reply]
[d/l]

Re^2: RFC:Hacking Tie::File to read complex data

by citromatik (Curate) on Jun 15, 2007 at 13:55 UTC

Hi rhesa, thanks for your comments!

<quote>Thinking one additional method call would ruin performance is, IMHO, misguided</quote>

Tie::File

"inlining read_record() would make this loop five times faster"

I've already done a Benchmark and found that in fact my inherited version in sensible faster than the original Tie::File (I suppose that it is because lines are grouped into records and the indexing, etc... is faster). The benchmark code and results are shown below.

<quote>You break inheritance with the check on the object's class name: with your code, it's now impossible to subclass that particular method and benefit from its features</quote>

Sorry, I don't understand this point

Thanks!

citromatik

Benchmark