Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a file containing lines such as the following :
Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MAD2_ANAVT EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MBT3_ANAVT ELEVEN 134 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73

I will use the characteristic ID, to describe the following:
Q3KIL4_PSEPF Q3M236_ANAVT Q3M236_ANAVT Q3M5F5_ANAVT Q3M5F5_ANAVT Q3M7Z1_ANAVT Q3MAD2_ANAVT Q3MAD2_ANAVT Q3MBT3_ANAVT Q3MBT3_ANAVT Q3MBT3_ANAVT Q3MD63_ANAVT

and the characteristic EVALUE, to describe the following:
6.30E-064 5.30E-060 8.60E-065 7.40E-049 1.70E-055 1.20E-028 3.30E-104 2.30E-059 1.10E-029 1.60E-065 2.70E-044 1.20E-073
So we are interested mainly in the pair ID->EVALUE.
All the data are separated with tabs.
I want to do 2 things:
1) Create a file that has lines that begin with different ids, i.e combine the lines that begin with the same id, and get:
Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 T +HREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 +FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 + EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 +TEN 326 559 1 277 221.6 1.6e-65 ELEVEN 134 + 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73

2)Create a file that holds only one id per line and, if I have more than one lines that begin with the same id, hold the line with the smallest evalue, i.e get the file :
Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73

Please, give me some help how to begin and what to use, I am newbie in Perl...
Thank you all in advance!

Replies are listed 'Best First'.
Re: how to combine?
by FunkyMonk (Bishop) on Oct 14, 2007 at 12:50 UTC
    my $EDITOR doesn't do tabs, but I think this will be ok:
    my ( %merged, %evalues, %lowest ); while ( <DATA> ) { chomp; my ( $evalue ) = m/\s(\S+)$/; my ( $k, $v ) = split /\s+/, $_, 2; $merged{$k} = $merged{$k} ? "$merged{$k}\t$v" : $v; if ( ! exists $evalues{$k} || $evalue < $evalues{$k} ) { $evalues{$k} = $evalue; $lowest{$k} = $_; } } open my $f1, ">", "file1~" or die $!; print $f1 "$_\t$merged{$_}\n" for sort keys %merged; open my $f2, ">", "file2~" or die $!; print $f2 "$lowest{$_}\n" for sort keys %lowest; __DATA__ Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MAD2_ANAVT EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MBT3_ANAVT ELEVEN 134 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73

    I'll leave it up to you to read from a file instead of __DATA__

Re: how to combine?
by graff (Chancellor) on Oct 14, 2007 at 15:23 UTC
    I can't really improve on FunkyMonk's solution, but it might be worthwhile to show how a data structure (hash-of-hashes-of-hashes, aka HoHoH) could be used. The following assumes that the data could include rows that are complete duplicates, and/or rows having the same key and the same "evalue", but different values in the middle columns.

    The output will exclude duplicate data (that's what hashes are for); in the case of distinct rows having the same key and evalue, all rows will be included in the "combined" file, and the one with the (ascii-betically) lowest value in column two will be printed to the "lowest" file (you might want to modify that, by controlling how the sort is done in the innermost "for" loop).

    #!/usr/bin/perl use strict; use warnings; my %data; while ( <DATA> ) { my ( $key, $fields, $evalue ) = ( /^(\S+)\s+(.*?\s(\S+))\s*$/ ); $data{$key}{$evalue}{$fields} = undef; } open my $f1, ">", "combined.data" or die $!; open my $f2, ">", "lowest.data" or die $!; for my $key ( sort keys %data ) { my $combined = ""; my $printed_lowest = 0; for my $evalue ( sort {$a<=>$b} keys %{$data{$key}} ) { for ( sort keys %{$data{$key}{$evalue}} ) { $combined .= "\t$_"; print $f2 "$key\t$_\n" unless ($printed_lowest++); } } print $f1 "$key$combined\n"; } __DATA__ Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MAD2_ANAVT EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MBT3_ANAVT ELEVEN 134 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73
Re: how to combine?
by GrandFather (Saint) on Oct 14, 2007 at 20:17 UTC

    You have a couple of answers, but some documentation references may help for the future. Files that contain tab separated data are best processed using one of the modules that understand CSV such as Text::CSV or Text::xSV.

    Then there is the question of what is a suitable data structure. If "unique element" pops into your head in relation to some aspect of teh data, then you should immediately think "hash". If "sort this stuff by that key" is a factor then think 'Schwartzian transform' (see replies to What is "Schwarzian Transform" (aka Schwartzian)).

    and we all need a good LOL from time to time, but in Perl that's something quite different than you might expect - see perllol. Note that LoL is closely related to HoH, HoL, LoH and those are all top of the heap for a pile of other interesting Perl data structures. Master LoL and the others should just drop out of the heap as you need them.


    Perl is environmentally friendly - it saves trees
Re: how to combine?
by Cop (Initiate) on Oct 14, 2007 at 15:21 UTC

    This is the best thing Perl does, parse text file, understand it, and process it.

    So show effort and let some one here help you debug. At this point, you showed absolutely no effort.