garyboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hi perl monks, I would like to combine the information from 3 tab-delimited files. Some of the entries are duplicated, so I would like to select one of these from each file based on specific criteria

eg:

File1:

>3841 29 58.786702607127 GAGTAGTTCATAATAAAGAGGAGGCTGGT >3841 28 58.3143841442903 AGTAGTTCATAATAAAGAGGAGGCTGGT >486 26 59.8238809443041 CATTTTTCCTGAGCGTTTTCCTGAGC >486 25 59.1450485783588 ATTTTTCCTGAGCGTTTTCCTGAGC >486 24 58.9556227674582 TTTTTCCTGAGCGTTTTCCTGAGC >486 23 58.6081353592492 TTTTCCTGAGCGTTTTCCTGAGC >486 22 58.2296488444454 TTTCCTGAGCGTTTTCCTGAGC >136 25 Tm=59.8064145079347 CGAACAGAGGTCTCATGGAGAAACG

File2:

>3841 29 58.5812405420724 GAGTAGTTCATAATAAAGAGGAGGCTGGA >3841 28 58.0989000498791 AGTAGTTCATAATAAAGAGGAGGCTGGA >486 26 58.9706961902307 CATTTTTCCTGAGCGTTTTCCTGAGT >486 25 58.2353328662615 ATTTTTCCTGAGCGTTTTCCTGAGT >486 24 58.0079403206259 TTTTTCCTGAGCGTTTTCCTGAGT >136 25 59.2231929175504 CGAACAGAGGTCTCATGGAGAAACA >253 36 59.6147860412319 CAGAGATGATTTGTGCATTATAATTGTAATTTGGGT

File3:

>3841 26 58.289789463114 CCAGGTTATTTATTTCAGCGGGAACT >486 23 58.6732344878087 GCAAATGGCTCTAAGGATCAGCC >294 21 58.8403250231655 GTCGGAGCTCTCTCAGAACCC >253 25 58.3051993710611 CACTCGAGTTGCAGTTATGTTCCTC >287 21 59.5292339759331 TCCTTAGCCAGACGAACACGC >544 21 59.5408471700017 TACAGCAGGTCAACCCGTTCG >856 19 58.7421506440351 GGTGAGGATGTCGCCCTCA

So the script would search through File 3 line by line and compare column1 (the >number with the entries in column1 of files 1 and 2). If there are duplicate >number entries, a single >contig entry from files 1 and 2 would be selected based on the relative value to the number in column 3.

For example the first entry of File 3 is >3841 with a value of 58.289789463114

File 1 has two entries for >3841 and the second has a value in column 3, closest to the File 3 entry (58.3143841442903) so this would be selected.

File 2 also has two entries for >3841 and in this case the second entry would be selected as the value in column 3 (58.0989000498791) is closest to the column 3 value in File3.

The script would then print out a tab de-limited file with the following format:

>3841 AGTAGTTCATAATAAAGAGGAGGCTGGT AGTAGTTCATAATAAAGAGGAGGCTGGA +CCAGGTTATTTATTTCAGCGGGAACT >486 TTTTCCTGAGCGTTTTCCTGAGC CATTTTTCCTGAGCGTTTTCCTGAGT GCAAATGG +CTCTAAGGATCAGCC

Any help from the monks on this would be appreciated.

Replies are listed 'Best First'.
Re: Combining 3 files
by Sewi (Friar) on Jun 23, 2011 at 08:47 UTC

    First of all, please post your code together with your problem. The Monks usually help fixing problems, not write your scripts for you.

    Depending on the amount of data, you may use a hash to preload the data:

    for my $nr (1..2) { for my $line (read_file('file'.$nr)) { my @cols = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@cols)}},\@cols; } }

    This code reads both file1 and file2 (outer for loop) line by line (inner for loop), splits the lines into columns and stores the data in a tree referenced by file number and contig*-key (first column). It's using File::Slurp and I suggest that you look at the tree using Data::Dumper.

    Next, compare it to your third file. Given you read and splitted the file already using a while (<$fh>) loop or using read_file:

    # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; }

    This block sorts the preloaded data sets using the difference to the comparison value of the current row and adds the alpha-key to the @result list which has been preloaded with the key and the file3 alpha string.
    You could easily print out the @result data tab-delimited using join().

    This is no complete script, but the code samples should give you an idea how to handle your data, merging them is now easy.

    If you got too much data to reasonable load it into memory, think about using a database (maybe SQLite) to handle the problem, it might be better than a pure perl solution.

      Thanks for your help its most appreciated. I understand the first block of code, but I'm not sure I fully understand the second block in the while (<$fh3>) loop. Could you explain in more detail?

      I have been playing around with the script and Data::Dumper and I get an error message:

      Use of uninitialized value in hash element at combine_primer_lists.pl line 41, <$fh3> line 1.

      Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 46, <$fh3> line 1.

      #!/usr/bin/perl #22/06/2011 # Usage: perl combine_primer_lists.pl use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my %dataset; #open my $fh1, "<Primer-For1" or die $!; #open my $fh2, "<Primer-For2" or die $!; open (my $fh3, '<', "Primer-Rev1") or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; #print Dumper(\$nr); #print Dumper (\@col); } } while (<$fh3>){ #print Dumper (\$fh3); # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { #print Dumper (\@data) push @results,(sort { #print Dumper (\@results) my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } }

      Hmmm...still struggling with this one and getting an error :

      Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 42, <INFILE> line 1.

      relating to this line:

          } @{$dataset->{$col[0]}})[0]->[2];

      Could you explain what this line is doing?

      #!/usr/bin/perl #22/06/2011 use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my @dataset; my $a; my $b; my @fields; my %out; open INFILE, "<Primer-Rev1" or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; } } while (<INFILE>){ @col = split(/\t+/, $_); chomp (@col); my ($header, $length, $tm, $sequence) = @col[0..3]; # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } }

        That is easy, the line is confusing :)

        It is also confusing $dataset for a hashref

        perltidy

        my @results = ( $col[0], $col[3] ); for my $dataset (@data) { push @results, ( sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } )[0]->[2]; }
        is like
        my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my ( $first ) = sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } push @results, $first->[2]; }
        is like
        my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my @beef = @{ $dataset->{ $col[0] } }; @beef = sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @beef; push @results, $beef[0]->[2]; }

        $a and $b are globals, they're how sort works

        Is it still confusing?

Re: Combining 3 files
by Anonymous Monk on Jun 23, 2011 at 09:12 UTC