extract values from file if greater than value

mulder4786 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: extract values from file if greater than value by Cristoforo (Curate) on Jun 15, 2016 at 18:18 UTC
Not knowing if the fields may repeat in either file as anonymous monk asked, I can only guess a solution. The best way to solve this would probably use a lookup hash (as in my solution below), provided the first 2 fields in file 1 don't repeat. `#!/usr/bin/perl use strict; use warnings; open my $fh1, '<', \<<EOF or die $!; 1 19002930 0.74 1 19002931 -0.12 EOF my %data; while (<$fh1>) { next if /CHR/; my ($chr, $bp, $min) = split; $data{$chr,$bp} = $min; } open my $fh2, '<', \<<EOF or die $!; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 EOF while (<$fh2>) { my ($chr, $bp, @rest) = split; if (defined(my $min = $data{$chr,$bp})) { print join(" ", $chr, $bp, grep $_ >= $min, @rest), "\n"; } }` [download] This only prints the values - not sure how you want to store in an array (as you mentioned). Update: Added the 'defined' operator to the if statement so a 'min' value of '0' will be accepted. Without testing for defined, a '0' value would cause the if statement to be wrongly false. Also, like Marshall's solution, the temporary files I created were just for this example. You would need to open your files in the normal way `open my $fh1, '<', 'yourfilename' or die $!`. Update 2: Noting Marshall's comment on separating `$data{"$chr$bp"}` by a space to be safer, `$data{"$chr $bp"}`, I used the seldom used idiom `$data{$chr,$bp}` where a comma separated series of terms as the key to a hash are joined together by the '$SUBSCRIPT_SEPARATOR', $;. Also, I'm wondering what the purpose of `next if /CHR/;` is in his code. It is hard to see whithout a better data sample for the file he is reading.	[reply] [d/l] [select]
Re^2: extract values from file if greater than value by Marshall (Canon) on Jun 16, 2016 at 04:23 UTC
I didn't see your post++ before posting my own revised solution according to the changing requirement specs! I encourage posters to be as clear as possible on the requirements - that makes a big difference! If the code doesn't work and the requirements don't either, then that is a mess! A very small nit: `$data{"$chr$bp"} = $min;`, I added a space between the values `"$chr $bp"` to prevent possible collisions between these two things. Update: Cristoforo is right about this seldom used hash key idiom with the commas for hash keys. I also wondered about `/CHR/`.	[reply] [d/l] [select]
Re: extract values from file if greater than value by mulder4786 (Novice) on Jun 15, 2016 at 16:22 UTC
I started to write this script, but am stuck: `#! perl -w use strict; use warnings; my @fst; open (my $fst_in, "<", "file1") or die $!; while (<$fst_in>){ my ( $chr, $bp, $fst ) = split; next if m/CHR/; push @fst, [$chr, $bp, $fst]; } close $fst_in; my$F=shift@ARGV; open IN, "$F"; while (<IN>){ my@L=split; if($L[0]=~ /^$fst->[0]$/ and $L[1]=~ /^$fst->[1]$/ ){ if( # compare every field starting at field 3` [download]	[reply] [d/l]
Re: extract values from file if greater than value by Anonymous Monk on Jun 15, 2016 at 17:14 UTC
It sounds like the first two fields form a kind of ID. Can you count on the IDs in the two files being in the same order? Can the same ID appear more than once in either file?	[reply]
Re^2: extract values from file if greater than value by mulder4786 (Novice) on Jun 15, 2016 at 18:25 UTC
Yes, they do form a unique ID, but no I cannot count on them to be in the same order. In fact the second file will be a subset of the first file in terms of ID	[reply]
Re^3: extract values from file if greater than value by Marshall (Canon) on Jun 16, 2016 at 03:26 UTC
So I see the requirements are becoming more refined. Ok, now...a re-write of my previous post... #!/usr/bin/perl use warnings; use strict; # this uses a "trick" to open an in memory file # like a file on the disk for testing purposes my $file1 =<<END; 1 19002930 0.74 1 19002931 -0.12 END my $file2 =<<END; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 END open (my $fh1, '<', \$file1) or die "$!"; open (my $fh2, '<', \$file2) or die "$!"; my %hash; #generate hash table from file 1 while (my $line=<$fh1>) { my ($Col1, $Col2, $Col3) = (split ' ', $line); $hash{"$Col1 $Col2"} = $Col3; } #if col1,2 from file 2 match, then output #col1,2 and all fields >= the col3 field from file1 while (my $line=<$fh2>) { my ($Col1, $Col2, @file2rest) = split ' ', $line; if (defined $hash{"$Col1 $Col2"}) { print "$Col1 $Col2 "; print join" ", grep{ $_>=$hash{"$Col1 $Col2"}}@file2rest; print "\n"; } } __END__ prints: 1 19002930 0.84 0.94 1 19002931 0 .12 [download]	[reply] [d/l]
Re: extract values from file if greater than value by Marshall (Canon) on Jun 15, 2016 at 17:57 UTC
I don't understand your requirements statement. You have file1 with 3 fields and file2 with 10K+ fields. The output winds up with either I guess 10K fields or 4 fields. Can you explain the requirements again?	[reply]
Re^2: extract values from file if greater than value by mulder4786 (Novice) on Jun 15, 2016 at 18:28 UTC
The output can be anywhere from 3 to 10k fields, depending on how many of the columns for that row meet the requirements of being greater than or equal to the third column of file 1	[reply]
Re^3: extract values from file if greater than value by Marshall (Canon) on Jun 15, 2016 at 19:43 UTC
Ok here is something to consider that produces your output to the best of my understanding at the moment: #!/usr/bin/perl use warnings; use strict; # this uses a "trick" to open an in memory file # like a file on the disk for testing purposes my $file1 =<<END; 1 19002930 0.74 1 19002931 -0.12 END my $file2 =<<END; 1 19002930 0.84 0.12 0.94 1 19002931 0 -.20 .12 END open (my $fh1, '<', \$file1) or die "$!"; open (my $fh2, '<', \$file2) or die "$!"; my $lineFile1; my $lineFile2; # compare line by line of both files # stop if either file "runs out of lines" # assumes that say: line 232 of file 1 goes with line 232 of file 2 while (defined ($lineFile1=<$fh1>) and defined ($lineFile2=<$fh2>)) { my ($line1Col1, $line1Col2, $line1Col3) = split ' ', $lineFile1; my ($line2Col1, $line2Col2, @file2rest) = split ' ', $lineFile2; if ($line1Col1 == $line2Col1 and $line1Col2 == $line2Col2) { print "$line1Col1 $line1Col2 "; print join" ", grep{ $_>=$line1Col3 }@file2rest; print "\n"; } else { # col1 and col2 didn't match, so we do nothing # you can delete this else clause entirely } } __END__ prints: 1 19002930 0.84 0.94 1 19002931 0 .12 [download] Could be written shorter, but I think that this what you want and it will run very quickly (compared with R).	[reply] [d/l]