http://qs1969.pair.com?node_id=762080

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two files. A 'sequence' file, which looks like this.
>ONE IRLA >TWO REFT >THREE HTED
and a set of correponding 'co-ordinate' files. The name of the co-ordinate file that corresponds to a particular sequence in the sequence file is highlighted by '>' a line above the actual sequence. Each of these files is made up of entries. Each entry is numbered. Some co-ordinate files start with entry number 1 but not all. The one below (the co-ordinate file named 'ONE'), for example, starts with an entry numbered 12.
12 14.620 35.834 -16.759 1.00 11.04 13 15.922 36.983 -19.044 1.00 11.22 14 14.326 37.148 -21.240 1.00 11.40 15 11.528 38.248 -23.343 1.00 12.44
This corresponds to the first sequence in the sequence file
IRLA
I need to assign each letter in that sequence with the corresponding entry in the corresponding co-ordinate file. You read the sequence file from left to right with each letter needing to be mapped to an numbered entry in the co-ordinate file. So .. for the first letter 'R' in the sequence, it has to be mapped to entry number 13 in the co-ordinate file - the second letter in the sequence 'R', it corresponds to the second entry in the co-ordinate file, which in this case is entry 13. The resulting text file should look like this
I 12 R 13 L 14 A 15
I'm not sure what the best way of going about writing a perl script for this is. Any advice/hints much appreciated. The real examples are obviously far larger than the test case presented here

Replies are listed 'Best First'.
Re: mapping data between files
by MidLifeXis (Monsignor) on May 05, 2009 at 21:33 UTC

    It looks like you have a few steps to do to solve this.

    • identify the line from the sequence file, and split it into component pieces
    • Read the number of lines from the co-ordinate file and assign to each of the component pieces

    The problem seems to be ill defined. For example, you refer to "entry number 1 but not all". Do you mean "entry number ONE but not all"? What happens in the case of "not all"? Is there a problem domain into which this problem fits? If so, there may be a module that reads these types of files already.

    Update: This could also be afternoon fog. Is the information on the lines like >ONE indicating the name of a file to read?

    --MidLifeXis

    The tomes, scrolls etc are dusty because they reside in a dusty old house, not because they're unused. --hangon in this post

      Thanks for your quick response! >ONE etc does indeed indicate the name of a file to read

      For example - the 'ONE' co-ordinate file corresponds to the sequence IRLA in the sequence file

      So ... in the above test case you have one sequence file but 3 corresponding co-ordinate files

Re: mapping data between files
by citromatik (Curate) on May 06, 2009 at 08:04 UTC

    If I understood correctly, you want something like:

    • Read and parse the sequences file
    • For each sequence, open a file that is named after the sequence header
    • From that file, get the first number of each line
    • Associate each letter in the sequence with those numbers
    use strict; use warnings; my ($seqfile) = @ARGV; #seqfile is the file with the sequences. open my $seqfh, "<", $seqfile or die $!; { local $/ = "\n>"; while (my $nextseq = <$seqfh>){ chop $nextseq unless eof $seqfh; substr ($nextseq,0,1,"") if $. == 1; my ($name,@seq) = split /\n/,$nextseq; my @aas = split //,join "",@seq; my @coords = getCoords ($name); print "$aas[$_] $coords[$_]\n" for(0..$#aas); } } sub getCoords { my ($fname) = @_; local $/="\n"; my @ns; open my $fh, "<", $fname or die $!; while (my $line = <$fh>){ chomp $line; my @ff = split /\s+/,$line; push @ns, (split /\s+/,$line)[0]; } close $fh; return @ns; }

    citromatik