how to combine?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a file containing lines such as the following :

Q3KIL4_PSEPF    ONE    134    380    1    252    216.3    6.3e-64
Q3M236_ANAVT    TWO    107    563    1    468    203.2    5.3e-60
Q3M236_ANAVT    THREE    250    494    1    277    219.1    8.6e-65
Q3M5F5_ANAVT    FOUR    296    608    1    355    166.2    7.4e-49
Q3M5F5_ANAVT    FIVE    299    584    1    304    188.2    1.7e-55
Q3M7Z1_ANAVT    SIX    51    181    1    140    99.0    1.2e-28
Q3MAD2_ANAVT    SEVEN    107    508    1    468    350.1    3.3e-104
Q3MAD2_ANAVT    EIGHT    230    457    1    277    201.1    2.3e-59
Q3MBT3_ANAVT    NINE    203    606    1    468    102.5    1.1e-29
Q3MBT3_ANAVT    TEN    326    559    1    277    221.6    1.6e-65
Q3MBT3_ANAVT    ELEVEN    134    333    1    234    -334.1    2.7e-44
Q3MD63_ANAVT    TWELVE    173    491    1    355    248.5    1.2e-73
[download]

I will use the characteristic ID, to describe the following:

Q3KIL4_PSEPF
Q3M236_ANAVT
Q3M236_ANAVT
Q3M5F5_ANAVT
Q3M5F5_ANAVT
Q3M7Z1_ANAVT
Q3MAD2_ANAVT
Q3MAD2_ANAVT
Q3MBT3_ANAVT
Q3MBT3_ANAVT
Q3MBT3_ANAVT
Q3MD63_ANAVT
[download]

and the characteristic EVALUE, to describe the following:

6.30E-064
5.30E-060
8.60E-065
7.40E-049
1.70E-055
1.20E-028
3.30E-104
2.30E-059
1.10E-029
1.60E-065
2.70E-044
1.20E-073
[download]

So we are interested mainly in the pair ID->EVALUE.
All the data are separated with tabs.
I want to do 2 things:
1) Create a file that has lines that begin with different ids, i.e combine the lines that begin with the same id, and get:

Q3KIL4_PSEPF    ONE    134    380    1    252    216.3    6.3e-64
Q3M236_ANAVT    TWO    107    563    1    468    203.2    5.3e-60    T
+HREE    250    494    1    277    219.1    8.6e-65
Q3M5F5_ANAVT    FOUR    296    608    1    355    166.2    7.4e-49    
+FIVE    299    584    1    304    188.2    1.7e-55
Q3M7Z1_ANAVT    SIX    51    181    1    140    99.0    1.2e-28
Q3MAD2_ANAVT    SEVEN    107    508    1    468    350.1    3.3e-104  
+  EIGHT    230    457    1    277    201.1    2.3e-59
Q3MBT3_ANAVT    NINE    203    606    1    468    102.5    1.1e-29    
+TEN    326    559    1    277    221.6    1.6e-65    ELEVEN    134   
+ 333    1    234    -334.1    2.7e-44
Q3MD63_ANAVT    TWELVE    173    491    1    355    248.5    1.2e-73
[download]

2)Create a file that holds only one id per line and, if I have more than one lines that begin with the same id, hold the line with the smallest evalue, i.e get the file :

Q3KIL4_PSEPF    ONE    134    380    1    252    216.3    6.3e-64
Q3M236_ANAVT    THREE    250    494    1    277    219.1    8.6e-65
Q3M5F5_ANAVT    FIVE    299    584    1    304    188.2    1.7e-55
Q3M7Z1_ANAVT    SIX    51    181    1    140    99.0    1.2e-28
Q3MAD2_ANAVT    SEVEN    107    508    1    468    350.1    3.3e-104
Q3MBT3_ANAVT    TEN    326    559    1    277    221.6    1.6e-65
Q3MD63_ANAVT    TWELVE    173    491    1    355    248.5    1.2e-73
[download]

Please, give me some help how to begin and what to use, I am newbie in Perl...
Thank you all in advance!

Comment on how to combine? Select or Download Code

Replies are listed 'Best First'.
Re: how to combine? by FunkyMonk (Bishop) on Oct 14, 2007 at 12:50 UTC
my $EDITOR doesn't do tabs, but I think this will be ok: my ( %merged, %evalues, %lowest ); while ( <DATA> ) { chomp; my ( $evalue ) = m/\s(\S+)$/; my ( $k, $v ) = split /\s+/, $_, 2; $merged{$k} = $merged{$k} ? "$merged{$k}\t$v" : $v; if ( ! exists $evalues{$k} \|\| $evalue < $evalues{$k} ) { $evalues{$k} = $evalue; $lowest{$k} = $_; } } open my $f1, ">", "file1~" or die $!; print $f1 "$_\t$merged{$_}\n" for sort keys %merged; open my $f2, ">", "file2~" or die $!; print $f2 "$lowest{$_}\n" for sort keys %lowest; __DATA__ Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MAD2_ANAVT EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MBT3_ANAVT ELEVEN 134 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73 [download] I'll leave it up to you to read from a file instead of `__DATA__`	[reply] [d/l] [select]
Re: how to combine? by graff (Chancellor) on Oct 14, 2007 at 15:23 UTC
I can't really improve on FunkyMonk's solution, but it might be worthwhile to show how a data structure (hash-of-hashes-of-hashes, aka HoHoH) could be used. The following assumes that the data could include rows that are complete duplicates, and/or rows having the same key and the same "evalue", but different values in the middle columns. The output will exclude duplicate data (that's what hashes are for); in the case of distinct rows having the same key and evalue, all rows will be included in the "combined" file, and the one with the (ascii-betically) lowest value in column two will be printed to the "lowest" file (you might want to modify that, by controlling how the sort is done in the innermost "for" loop). #!/usr/bin/perl use strict; use warnings; my %data; while ( <DATA> ) { my ( $key, $fields, $evalue ) = ( /^(\S+)\s+(.?\s(\S+))\s$/ ); $data{$key}{$evalue}{$fields} = undef; } open my $f1, ">", "combined.data" or die $!; open my $f2, ">", "lowest.data" or die $!; for my $key ( sort keys %data ) { my $combined = ""; my $printed_lowest = 0; for my $evalue ( sort {$a<=>$b} keys %{$data{$key}} ) { for ( sort keys %{$data{$key}{$evalue}} ) { $combined .= "\t$_"; print $f2 "$key\t$_\n" unless ($printed_lowest++); } } print $f1 "$key$combined\n"; } __DATA__ Q3KIL4_PSEPF ONE 134 380 1 252 216.3 6.3e-64 Q3M236_ANAVT TWO 107 563 1 468 203.2 5.3e-60 Q3M236_ANAVT THREE 250 494 1 277 219.1 8.6e-65 Q3M5F5_ANAVT FOUR 296 608 1 355 166.2 7.4e-49 Q3M5F5_ANAVT FIVE 299 584 1 304 188.2 1.7e-55 Q3M7Z1_ANAVT SIX 51 181 1 140 99.0 1.2e-28 Q3MAD2_ANAVT SEVEN 107 508 1 468 350.1 3.3e-104 Q3MAD2_ANAVT EIGHT 230 457 1 277 201.1 2.3e-59 Q3MBT3_ANAVT NINE 203 606 1 468 102.5 1.1e-29 Q3MBT3_ANAVT TEN 326 559 1 277 221.6 1.6e-65 Q3MBT3_ANAVT ELEVEN 134 333 1 234 -334.1 2.7e-44 Q3MD63_ANAVT TWELVE 173 491 1 355 248.5 1.2e-73 [download]	[reply] [d/l]
Re: how to combine? by GrandFather (Saint) on Oct 14, 2007 at 20:17 UTC
You have a couple of answers, but some documentation references may help for the future. Files that contain tab separated data are best processed using one of the modules that understand CSV such as Text::CSV or Text::xSV. Then there is the question of what is a suitable data structure. If "unique element" pops into your head in relation to some aspect of teh data, then you should immediately think "hash". If "sort this stuff by that key" is a factor then think 'Schwartzian transform' (see replies to What is "Schwarzian Transform" (aka Schwartzian)). and we all need a good LOL from time to time, but in Perl that's something quite different than you might expect - see perllol. Note that LoL is closely related to HoH, HoL, LoH and those are all top of the heap for a pile of other interesting Perl data structures. Master LoL and the others should just drop out of the heap as you need them. Perl is environmentally friendly - it saves trees	[reply]
Re: how to combine? by Cop (Initiate) on Oct 14, 2007 at 15:21 UTC
This is the best thing Perl does, parse text file, understand it, and process it. So show effort and let some one here help you debug. At this point, you showed absolutely no effort.	[reply]