comment on

Dear monks,

I work on the field of bioinformatics, where Perl is the "official" language and many people with low or moderate knowledge of programming have to deal with it. This "lack of programming knowledge" in some of us is a strong extra motivation for trying to make simple things simple. One of our "daily" tasks is to code small scripts that deals with data formats that are standard in this field, and any improvement in the manage of these data formats would safe a lot of time in coding this kind of scripts.

For this reason I was wondering about the possibility of using Tie::File to manage files containing this kind of "complex" records of data.

For example, suppose that we have a file containing this data:

seq_1    1    33    gene
seq_1    1    20    exon
seq_1    21    27    exon
seq_1    28    33    exon
seq_2    1    80    gene
seq_2    1    80    exon
seq_3    1    55    gene
seq_3    1    30    exon
seq_3    31    50    exon
[download]

logically these data can be split in "seq_1 features" (lines 1-4), "seq_2 features" (lines 5-6) and "seq_3 features" (lines 7-9). In order to parse this, wouldn't be nice to write something like ...?:

use Tie::File;

tie my @data, 'Tie::File', "datafile";
print "Rec$_\n$data[$_]\n\n" for (0..$#data);
[download]

Getting the following output:

Rec1
seq_1    1    33    gene
seq_1    1    20    part
seq_1    21    27    part
seq_1    28    33    part

Rec2
seq_2    1    80    gene
seq_2    1    80    part

Rec3
seq_3    1    55    gene
seq_3    1    30    part
seq_3    31    50    part
[download]

This kind of parsing is not possible with Tie::File, because in Tie::File you have to specify a record separator ('recsep') to read records from the file, and I can't imagine a single record separator that would do this job.

Looking at the Tie::File source code you can see that the process of obtaining new records from the file is as follows:

sub _read_record {
  my $self = shift;
  my $rec;
  { local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

## Rest of the subroutine...

}
[download]

But if you modify the code to delegate the actual reading to another subroutine...:

sub get_next_rec
  {
    my $self = shift;
    local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

sub _read_record {
  my $self = shift;

  my $rec = $self->get_next_rec();

## Rest of the subroutine...

}
[download]

...you can provide any other way of reading the file just by "overriding" the "sub get_next_rec" . For example:

package Tie::File::GFF;

use strict;
use warnings;
use base "Tie::File";

sub get_next_rec
  {

  ## Any way of reading records of a file, for example:

    my $self = shift;
    my $fh = $self->{fh};
    return undef if (eof $fh);
    my ($last_seen,$last_pos,$rec);
    while (<$fh>){
      my @f = split /\t/;
      $last_seen = $f[0] if (! defined $last_seen);
      if ($f[0] eq $last_seen){
    $rec.=$_;
    $last_pos = int (tell $fh);
    return $rec if (eof $fh);
    next;
      } else {
    seek $fh, $last_pos, 0;
    return $rec;
      }
    }
  }

1;
[download]

Now you can use this new package:

use Tie::File::GFF;

tie my @data, 'Tie::File::GFF', "datafile";
print "Rec$_\n$data[$_]\n\n" for (0..$#data);
[download]

Getting the expected output:

Rec1
seq_1    1    33    gene
seq_1    1    20    part
seq_1    21    27    part
seq_1    28    33    part

Rec2
seq_2    1    80    gene
seq_2    1    80    part

Rec3
seq_3    1    55    gene
seq_3    1    30    part
seq_3    31    50    part
[download]

Since Tie::File is optimized to run faster, you may be worry about that extra function call could slow down the default behavior of the entire module. But this can be solved asking if the caller package is the current ('Tie::File::other_package') package or not:

sub get_next_rec
  {
    my $self = shift;
    local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

sub _read_record {
  my $self = shift;

  $self =~ /^(.+)=/;
  my $_caller_pack = $1;
  my $rec;
  if ($_caller_pack ne __PACKAGE__){
    $rec = $self->get_next_rec();
  } else

  { local $/ = $self->{recsep};
    my $fh = $self->{fh};
    $rec = <$fh>;
  }

## Rest of the sub

}
[download]

I don't know if you think that this technique is a "good practice" or "smell code". Any comment would be highly appreciated.

Thanks for your attention

citromatik

In reply to RFC:Hacking Tie::File to read complex data by citromatik

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.