in reply to Combining 3 files

First of all, please post your code together with your problem. The Monks usually help fixing problems, not write your scripts for you.

Depending on the amount of data, you may use a hash to preload the data:

for my $nr (1..2) { for my $line (read_file('file'.$nr)) { my @cols = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@cols)}},\@cols; } }

This code reads both file1 and file2 (outer for loop) line by line (inner for loop), splits the lines into columns and stores the data in a tree referenced by file number and contig*-key (first column). It's using File::Slurp and I suggest that you look at the tree using Data::Dumper.

Next, compare it to your third file. Given you read and splitted the file already using a while (<$fh>) loop or using read_file:

# expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; }

This block sorts the preloaded data sets using the difference to the comparison value of the current row and adds the alpha-key to the @result list which has been preloaded with the key and the file3 alpha string.
You could easily print out the @result data tab-delimited using join().

This is no complete script, but the code samples should give you an idea how to handle your data, merging them is now easy.

If you got too much data to reasonable load it into memory, think about using a database (maybe SQLite) to handle the problem, it might be better than a pure perl solution.

Replies are listed 'Best First'.
Re^2: Combining 3 files
by garyboyd (Acolyte) on Jun 23, 2011 at 14:52 UTC

    Thanks for your help its most appreciated. I understand the first block of code, but I'm not sure I fully understand the second block in the while (<$fh3>) loop. Could you explain in more detail?

    I have been playing around with the script and Data::Dumper and I get an error message:

    Use of uninitialized value in hash element at combine_primer_lists.pl line 41, <$fh3> line 1.

    Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 46, <$fh3> line 1.

    #!/usr/bin/perl #22/06/2011 # Usage: perl combine_primer_lists.pl use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my %dataset; #open my $fh1, "<Primer-For1" or die $!; #open my $fh2, "<Primer-For2" or die $!; open (my $fh3, '<', "Primer-Rev1") or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; #print Dumper(\$nr); #print Dumper (\@col); } } while (<$fh3>){ #print Dumper (\$fh3); # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { #print Dumper (\@data) push @results,(sort { #print Dumper (\@results) my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } }
Re^2: Combining 3 files
by garyboyd (Acolyte) on Jun 24, 2011 at 08:22 UTC

    Hmmm...still struggling with this one and getting an error :

    Can't use an undefined value as an ARRAY reference at combine_primer_lists.pl line 42, <INFILE> line 1.

    relating to this line:

        } @{$dataset->{$col[0]}})[0]->[2];

    Could you explain what this line is doing?

    #!/usr/bin/perl #22/06/2011 use strict; use warnings; use File::Slurp; use Data::Dumper; my @data; my @col; my @dataset; my $a; my $b; my @fields; my %out; open INFILE, "<Primer-Rev1" or die $!; open my $outfh, '>', "outputfile.txt" or die $!; for my $nr (1..2) { for my $line (read_file('Primer-For'.$nr)) { my @col = split(/\t/,$line); push @{$data[$nr - 1]->{shift(@col)}},\@col; } } while (<INFILE>){ @col = split(/\t+/, $_); chomp (@col); my ($header, $length, $tm, $sequence) = @col[0..3]; # expecting file3 line in @col my @results = ($col[0],$col[3]); for my $dataset (@data) { push @results,(sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{$dataset->{$col[0]}})[0]->[2]; } }

      That is easy, the line is confusing :)

      It is also confusing $dataset for a hashref

      perltidy

      my @results = ( $col[0], $col[3] ); for my $dataset (@data) { push @results, ( sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } )[0]->[2]; }
      is like
      my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my ( $first ) = sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @{ $dataset->{ $col[0] } } push @results, $first->[2]; }
      is like
      my @results = ( $col[0], $col[3] ); for my $dataset (@data) { my @beef = @{ $dataset->{ $col[0] } }; @beef = sort { my $diff_a = $col[2] - $a->[1]; $diff_a *= -1 if $diff_a < 0; my $diff_b = $col[2] - $b->[1]; $diff_b *= -1 if $diff_b < 0; $diff_a <=> $diff_b; } @beef; push @results, $beef[0]->[2]; }

      $a and $b are globals, they're how sort works

      Is it still confusing?

        Hi anonymous monk, yes I'm still finding this confusing. You mentioned it is confusing $dataset for a hashref. I've been playing around with this for a while and just keep getting syntax errors.