Re: RFC:Hacking Tie::File to read complex data

Well, if "any" comment is okay.... :-)

I hear that the need is for simplification and the following strays from that need by going "native" and abandoning the interface concept. But the need that is presented and the output that seems acceptable creates an irresistable urge to write some throw-away code that can be adapted for the specific situation, ignoring the idea of Tie::File and its optimizations/ease of use...

How about restructuring the output just a bit? That is, is the term "Rec1" as important as the fact that you are showing a bunch of data about "seq_1"? And if so, wouldn't it be nice if "seq_1" were your header and you didn't have to print "seq_1" on each line? You could go on to sort the individual 'seq' records if that makes a difference.

seq_1:
    1    33     gene
    1    20     exon
   21    27     exon
   28    33     exon

seq_2:
    1    80     gene
    1    80     exon

seq_3:
    1    55     gene
    1    30     exon
   31    50     exon
[download]

via the following snippet. Note that to access a file of such data rather than use __DATA__, you just need to put the proper filename in for $in_file and then uncomment the two $INFILE lines and comment out the while DATA line. (Untested.)

#!/usr/bin/perl -w
use strict;

my %seq_info;
my $in_file = 'datafile.txt';
#open $INFILE, '<', $in_file or die "Could not open '$in_file':  $!\n"
+;

#while ( <$INFILE> ) 
while ( <DATA> ) 
{
    my @record = split( /\s+/, $_ );
    push @{ $seq_info{ $record[0] }  }, [ @record[1..3] ];
}

foreach ( sort keys %seq_info ) {
    print_seq_chunk( $_, $seq_info{$_} );
    print "\n";
}

sub print_seq_chunk {
    my ( 
        $seq_id,
        $seq_info_ar
            ) = @_;

    print $seq_id, ":\n";
    printf "  %3d   %3d   %6.6s\n", @$_
        foreach @$seq_info_ar;
}


__DATA__
seq_1    1    33    gene
seq_1    1    20    exon
seq_1    21    27    exon
seq_1    28    33    exon
seq_2    1    80    gene
seq_2    1    80    exon
seq_3    1    55    gene
seq_3    1    30    exon
seq_3    31    50    exon
[download]

Comment on Re: RFC:Hacking Tie::File to read complex data Select or Download Code

Replies are listed 'Best First'.
Re^2: RFC:Hacking Tie::File to read complex data by citromatik (Curate) on Jun 15, 2007 at 10:29 UTC
Hi ff. First of all thanks a lot for the feedback! (It is not easy to read and go deep in such long posts) :-) In my root post I gave a dummy example of input and output to show the concept of the interface solution. Clearly, there is no need to hack a complex module to parse the example data. (In the same way that you don't need to use Tie::File to simply read data from a file and output it to STDOUT). But lets try another example a bit more interesting where the tied interface could give a very simple solution: Suppose that I want to get a random record. With the interface solution you can do it even with a simple perl one-liner!: `perl -e 'use Tie::File::GFF; tie my @arr,'Tie::File::GFF',"infile"; pr +int $arr[int rand ($#arr)];'` [download] Try to do it "native" and lets compare the number of lines needed! Thanks again for your feedback!! Cheers citromatik	[reply] [d/l]
Re^3: RFC:Hacking Tie::File to read complex data by ff (Hermit) on Jun 15, 2007 at 12:33 UTC
I must admit, the lazy side of me did not want to do the mental hacking that your module change suggested when the required solution seemed so simple in the first place. I can imagine that you have some rather large datasets that, for performance reasons, you would rather access via iteration (Tie) than by sucking everything into memory. That gives a little better justification for fooling around with Tie. Where a module encapsulates several steps, and your inexperienced user knows exactly what to expect from said module, by all means use it. The trick comes when you change the rules (i.e. change the module) that the inexperienced user knows. By adapting the module for the special formats of the bioinformatics world, aren't you requiring an extra level of understanding? Whereas by being self-sufficient and learning the basics of Perl, "the 'official' language of bioinformatics", isn't the inexperienced user better positioned to handle whatever data processing need arises? Or have the minimal foundation necessary to glue in an appropriate Bio module from CPAN? For getting a random line, while it may be wasteful of computer resources, it's certainly straightforward to simply do: `#open INFILE, '<', 'outfile.txt' or die "Could not open 'outfile.txt': + $!\n"; #my @seq_info = <INFILE>; #close INFILE; my @seq_info = <DATA>; print $seq_info[int rand ($#seq_info)]; __DATA__ seq_1 1 33 gene seq_1 1 20 exon seq_1 21 27 exon seq_1 28 33 exon seq_2 1 80 gene seq_2 1 80 exon seq_3 1 55 gene seq_3 1 30 exon seq_3 31 50 exon` [download]	[reply] [d/l]
Re^4: RFC:Hacking Tie::File to read complex data by citromatik (Curate) on Jun 15, 2007 at 13:16 UTC
<quote>By adapting the module for the special formats of the bioinformatics world, aren't you requiring an extra level of understanding?</quote> I don't think so. Using an array to interface a file by records is something that I find very easy to understand and to work with. <quote>Whereas by being self-sufficient and learning the basics of Perl, "the 'official' language of bioinformatics", isn't the inexperienced user better positioned to handle whatever data processing need arises?</quote> Sure, I totally agree, but you can (for example) use an object oriented module without knowing a bit about object orientation. This doesn't mean that you will not do your work better if you know the basis of object orientation. I mean that I find this interface very simple to use (as the Tie::File module itself), but this doesn't mean that I don't have to learn other ways of doing it. BTW, I find many people working on bioinformatics that only wants to learn enough Perl to make things work (my boss, for example :) ). Thanks for your comments! citromatik	[reply]