Combining 3 files

garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi perl monks, I would like to combine the information from 3 tab-delimited files. Some of the entries are duplicated, so I would like to select one of these from each file based on specific criteria

eg:

File1:

>3841    29    58.786702607127    GAGTAGTTCATAATAAAGAGGAGGCTGGT    
>3841    28    58.3143841442903    AGTAGTTCATAATAAAGAGGAGGCTGGT    
>486    26    59.8238809443041    CATTTTTCCTGAGCGTTTTCCTGAGC    
>486    25    59.1450485783588    ATTTTTCCTGAGCGTTTTCCTGAGC    
>486    24    58.9556227674582    TTTTTCCTGAGCGTTTTCCTGAGC    
>486    23    58.6081353592492    TTTTCCTGAGCGTTTTCCTGAGC    
>486    22      58.2296488444454    TTTCCTGAGCGTTTTCCTGAGC    
>136    25    Tm=59.8064145079347    CGAACAGAGGTCTCATGGAGAAACG
[download]

File2:

>3841    29    58.5812405420724    GAGTAGTTCATAATAAAGAGGAGGCTGGA    
>3841    28    58.0989000498791    AGTAGTTCATAATAAAGAGGAGGCTGGA    
>486    26    58.9706961902307    CATTTTTCCTGAGCGTTTTCCTGAGT    
>486    25    58.2353328662615    ATTTTTCCTGAGCGTTTTCCTGAGT    
>486    24    58.0079403206259    TTTTTCCTGAGCGTTTTCCTGAGT    
>136    25    59.2231929175504    CGAACAGAGGTCTCATGGAGAAACA    
>253    36    59.6147860412319    CAGAGATGATTTGTGCATTATAATTGTAATTTGGGT
[download]

File3:

>3841    26    58.289789463114    CCAGGTTATTTATTTCAGCGGGAACT    
>486    23    58.6732344878087    GCAAATGGCTCTAAGGATCAGCC    
>294    21    58.8403250231655    GTCGGAGCTCTCTCAGAACCC    
>253    25    58.3051993710611    CACTCGAGTTGCAGTTATGTTCCTC    
>287    21    59.5292339759331    TCCTTAGCCAGACGAACACGC    
>544    21    59.5408471700017    TACAGCAGGTCAACCCGTTCG    
>856    19    58.7421506440351    GGTGAGGATGTCGCCCTCA
[download]

So the script would search through File 3 line by line and compare column1 (the >number with the entries in column1 of files 1 and 2). If there are duplicate >number entries, a single >contig entry from files 1 and 2 would be selected based on the relative value to the number in column 3.

For example the first entry of File 3 is >3841 with a value of 58.289789463114

File 1 has two entries for >3841 and the second has a value in column 3, closest to the File 3 entry (58.3143841442903) so this would be selected.

File 2 also has two entries for >3841 and in this case the second entry would be selected as the value in column 3 (58.0989000498791) is closest to the column 3 value in File3.

The script would then print out a tab de-limited file with the following format:

>3841   AGTAGTTCATAATAAAGAGGAGGCTGGT   AGTAGTTCATAATAAAGAGGAGGCTGGA   
+CCAGGTTATTTATTTCAGCGGGAACT
>486   TTTTCCTGAGCGTTTTCCTGAGC   CATTTTTCCTGAGCGTTTTCCTGAGT   GCAAATGG
+CTCTAAGGATCAGCC
[download]

Any help from the monks on this would be appreciated.

Comment on Combining 3 files Select or Download Code

Replies are listed 'Best First'.
Re: Combining 3 files by Sewi (Friar) on Jun 23, 2011 at 08:47 UTC
First of all, please post your code together with your problem. The Monks usually help fixing problems, not write your scripts for you. Depending on the amount of data, you may use a hash to preload the data: `for my $nr (1..2) { for my $line (read_file('file'.$nr)) { my @cols = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@cols)}},\@cols; } }` [download] This code reads both file1 and file2 (outer for loop) line by line (inner for loop), splits the lines into columns and stores the data in a tree referenced by file number and contig-key (first column). It's using File::Slurp and I suggest that you look at the tree using Data::Dumper. Next, compare it to your third file. Given you read and splitted the file already using a while (<$fh>) loop or using read_file: `# expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; }` [download] This block sorts the preloaded data sets using the difference to the comparison value of the current row and adds the alpha-key to the @result list which has been preloaded with the key and the file3 alpha string. You could easily print out the @result data tab-delimited using join(). This is no complete script, but the code samples should give you an idea how to handle your data, merging them is now easy. If you got too much data to reasonable load it into memory, think about using a database (maybe SQLite) to handle the problem, it might be better than a pure perl solution. Try Padre - the free Perl IDE	[reply] [d/l] [select]
Re^2: Combining 3 files by garyboyd (Acolyte) on Jun 23, 2011 at 14:52 UTC
Thanks for your help its most appreciated. I understand the first block of code, but I'm not sure I fully understand the second block in the while (<$fh3>) loop. Could you explain in more detail? I have been playing around with the script and Data::Dumper and I get an error message: Use of uninitialized value in hash element at combine_primer_lists.pl line 41, <$fh3> line 1. Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 46, <$fh3> line 1. #!/usr/bin/perl #22/06/2011 # Usage: perl combine_primer_lists.pl use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my %dataset; #open my $fh1, "<Primer-For1" or die $!; #open my $fh2, "<Primer-For2" or die $!; open (my $fh3, '<', "Primer-Rev1") or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; #print Dumper(\$nr); #print Dumper (\@col); } } while (<$fh3>){ #print Dumper (\$fh3); # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { #print Dumper (\@data) push @results,(sort { #print Dumper (\@results) my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b = -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } } [download]	[reply] [d/l]
Re^2: Combining 3 files by garyboyd (Acolyte) on Jun 24, 2011 at 08:22 UTC
Hmmm...still struggling with this one and getting an error : Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 42, <INFILE> line 1. relating to this line: `} @{$dataset->{$col[0]}})[0]->[2];` Could you explain what this line is doing? #!/usr/bin/perl #22/06/2011 use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my @dataset; my $a; my $b; my @fields; my %out; open INFILE, "<Primer-Rev1" or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; } } while (<INFILE>){ @col = split(/\t+/, $_); chomp (@col); my ($header, $length, $tm, $sequence) = @col[0..3]; # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b = -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } } [download]	[reply] [d/l] [select]
Re^3: Combining 3 files by Anonymous Monk on Jun 24, 2011 at 10:06 UTC
That is easy, the line is confusing :) It is also confusing $dataset for a hashref perltidy `my @results = ( $col[0], $col[3] ); for my $dataset (@data) { push @results, ( sort { my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b = -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } )[0]->[2]; }` [download] is like `my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my ( $first ) = sort { my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b = -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } push @results, $first->[2]; }` [download] is like `my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my @beef = @{ $dataset->{ $col[0] } }; @beef = sort { my $diff_a = $col[2] - $a->[1]; $diff_a = -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b = -1 if $diff_b < 0; $diff_a <=> $diff_b; } @beef; push @results, $beef[0]->[2]; }` [download] $a and $b are globals, they're how sort works Is it still confusing?	[reply] [d/l] [select]
Re^4: Combining 3 files by garyboyd (Acolyte) on Jun 27, 2011 at 09:11 UTC
Re^5: Combining 3 files by Anonymous Monk on Jun 27, 2011 at 10:27 UTC
Some notes below your chosen depth have not been shown here
Re: Combining 3 files by Anonymous Monk on Jun 23, 2011 at 09:12 UTC
Hi perl monks, I would like to combine the information from 3 tab-delimited files. See what I said here Re^3: How-To integrate information from two hash tables? :)	[reply]