mulder4786 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have two files, file 1 is 3 fields and many rows and looks like:
1 19002930 0.74 1 19002931 -0.12
And file 2 has 10,002 fields and many rows and looks like:
1 19002930 0.84 0.12 0.94 ... 1 19002931 0 -.20 .12 ...
I would like to compare each field starting with field 3 from file 2, to field 3 in file 1. If fields 1 and 2 in files 1 and 2 match, and If the value in file 2 is greater than or equal to field 3 from file 1, I would like to push that value onto an array. The output would look like:
1 19002930 0.84 0.94 ... 1 19002931 0 .12
I have an R script that does this, but R takes way too long. Would someone be able to help me out?

Replies are listed 'Best First'.
Re: extract values from file if greater than value
by Cristoforo (Curate) on Jun 15, 2016 at 18:18 UTC
    Not knowing if the fields may repeat in either file as anonymous monk asked, I can only guess a solution. The best way to solve this would probably use a lookup hash (as in my solution below), provided the first 2 fields in file 1 don't repeat.
    #!/usr/bin/perl use strict; use warnings; open my $fh1, '<', \<<EOF or die $!; 1 19002930 0.74 1 19002931 -0.12 EOF my %data; while (<$fh1>) { next if /CHR/; my ($chr, $bp, $min) = split; $data{$chr,$bp} = $min; } open my $fh2, '<', \<<EOF or die $!; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 EOF while (<$fh2>) { my ($chr, $bp, @rest) = split; if (defined(my $min = $data{$chr,$bp})) { print join(" ", $chr, $bp, grep $_ >= $min, @rest), "\n"; } }
    This only prints the values - not sure how you want to store in an array (as you mentioned).

    Update: Added the 'defined' operator to the if statement so a 'min' value of '0' will be accepted. Without testing for defined, a '0' value would cause the if statement to be wrongly false.

    Also, like Marshall's solution, the temporary files I created were just for this example. You would need to open your files in the normal way open my $fh1, '<', 'yourfilename' or die $!.

    Update 2: Noting Marshall's comment on separating $data{"$chr$bp"} by a space to be safer, $data{"$chr $bp"}, I used the seldom used idiom $data{$chr,$bp} where a comma separated series of terms as the key to a hash are joined together by the '$SUBSCRIPT_SEPARATOR', $;.

    Also, I'm wondering what the purpose of next if /CHR/; is in his code. It is hard to see whithout a better data sample for the file he is reading.

      I didn't see your post++ before posting my own revised solution according to the changing requirement specs! I encourage posters to be as clear as possible on the requirements - that makes a big difference! If the code doesn't work and the requirements don't either, then that is a mess!

      A very small nit: $data{"$chr$bp"} = $min;, I added a space between the values "$chr $bp" to prevent possible collisions between these two things.

      Update: Cristoforo is right about this seldom used hash key idiom with the commas for hash keys. I also wondered about /CHR/.

Re: extract values from file if greater than value
by mulder4786 (Novice) on Jun 15, 2016 at 16:22 UTC
    I started to write this script, but am stuck:
    #! perl -w use strict; use warnings; my @fst; open (my $fst_in, "<", "file1") or die $!; while (<$fst_in>){ my ( $chr, $bp, $fst ) = split; next if m/CHR/; push @fst, [$chr, $bp, $fst]; } close $fst_in; my$F=shift@ARGV; open IN, "$F"; while (<IN>){ my@L=split; if($L[0]=~ /^$fst->[0]$/ and $L[1]=~ /^$fst->[1]$/ ){ if( # compare every field starting at field 3
Re: extract values from file if greater than value
by Anonymous Monk on Jun 15, 2016 at 17:14 UTC
    It sounds like the first two fields form a kind of ID. Can you count on the IDs in the two files being in the same order? Can the same ID appear more than once in either file?
      Yes, they do form a unique ID, but no I cannot count on them to be in the same order. In fact the second file will be a subset of the first file in terms of ID
        So I see the requirements are becoming more refined. Ok, now...a re-write of my previous post...
        #!/usr/bin/perl use warnings; use strict; # this uses a "trick" to open an in memory file # like a file on the disk for testing purposes my $file1 =<<END; 1 19002930 0.74 1 19002931 -0.12 END my $file2 =<<END; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 END open (my $fh1, '<', \$file1) or die "$!"; open (my $fh2, '<', \$file2) or die "$!"; my %hash; #generate hash table from file 1 while (my $line=<$fh1>) { my ($Col1, $Col2, $Col3) = (split ' ', $line); $hash{"$Col1 $Col2"} = $Col3; } #if col1,2 from file 2 match, then output #col1,2 and all fields >= the col3 field from file1 while (my $line=<$fh2>) { my ($Col1, $Col2, @file2rest) = split ' ', $line; if (defined $hash{"$Col1 $Col2"}) { print "$Col1 $Col2 "; print join" ", grep{ $_>=$hash{"$Col1 $Col2"}}@file2rest; print "\n"; } } __END__ prints: 1 19002930 0.84 0.94 1 19002931 0 .12
Re: extract values from file if greater than value
by Marshall (Canon) on Jun 15, 2016 at 17:57 UTC
    I don't understand your requirements statement. You have file1 with 3 fields and file2 with 10K+ fields. The output winds up with either I guess 10K fields or 4 fields. Can you explain the requirements again?
      The output can be anywhere from 3 to 10k fields, depending on how many of the columns for that row meet the requirements of being greater than or equal to the third column of file 1
        Ok here is something to consider that produces your output to the best of my understanding at the moment:
        #!/usr/bin/perl use warnings; use strict; # this uses a "trick" to open an in memory file # like a file on the disk for testing purposes my $file1 =<<END; 1 19002930 0.74 1 19002931 -0.12 END my $file2 =<<END; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 END open (my $fh1, '<', \$file1) or die "$!"; open (my $fh2, '<', \$file2) or die "$!"; my $lineFile1; my $lineFile2; # compare line by line of both files # stop if either file "runs out of lines" # assumes that say: line 232 of file 1 goes with line 232 of file 2 while (defined ($lineFile1=<$fh1>) and defined ($lineFile2=<$fh2>)) { my ($line1Col1, $line1Col2, $line1Col3) = split ' ', $lineFile1; my ($line2Col1, $line2Col2, @file2rest) = split ' ', $lineFile2; if ($line1Col1 == $line2Col1 and $line1Col2 == $line2Col2) { print "$line1Col1 $line1Col2 "; print join" ", grep{ $_>=$line1Col3 }@file2rest; print "\n"; } else { # col1 and col2 didn't match, so we do nothing # you can delete this else clause entirely } } __END__ prints: 1 19002930 0.84 0.94 1 19002931 0 .12
        Could be written shorter, but I think that this what you want and it will run very quickly (compared with R).