Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?

pickleswarlz has asked for the wisdom of the Perl Monks concerning the following question:

I am pretty new to Perl, but have what I think is a complex problem that is way beyond my abilities.

I have one .csv file that contains the locations of genes in the genome that looks like this:

Chromosome    Start      End 

chr1          220       250
chr1          259       298
chr2          225       263
chr3          310       344
[...]
[download]

where the first column indicates the chromosome number that the locations map to, second column is the start position of the gene in the genome, and the third column is the end position of the gene in the genome.

I have a second .csv file that gives a set of other locations in the genome with identification numbers:

ID#    position

ID1    214
ID2    222
ID3    258

[...]
[download]

where the first column is the identifier (alphanumeric), and the second column is the position in the genome.

What I would like to do is find the two locations from the second file that are closest to either side of location given in the first file, and output these to a new file. For example, for this line in .csv file 1:

chr1 220 250

search CSV file 2 for the two closest values on either side of 220 AND the two closest values on either side of 250, given the following output .csv file:

Chromosome  Start  End  Start(-)  Start(+)  End (-) End (+)

chr1        220    250   214      222       222     258
[download]

I'm a Windows user, and relatively competent with the command line. I'll have to run this Perl script on multiple files, so I'd love to be able to specify the input and output files in the command line as I go rather than specifying them in the script itself. The files containing the ID information is huge, around 500,000 lines, so I'm wondering if it might be better to transform that file into a hash first.

Thank you!

Comment on Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? Select or Download Code

Replies are listed 'Best First'.
Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? by choroba (Cardinal) on Feb 23, 2013 at 01:44 UTC
No need for a hash, an array should be enough. Try the following: #!/usr/bin/perl use warnings; use strict; open my $GENES, '<', 'genes' or die $!; open my $LOCATIONS, '<', 'locations' or die $!; chomp(my @locations = map { (split ' ')[1] } <$LOCATIONS>); # If IDs are not already sorted, uncomment the following line: # @locations = sort { $a <=> $b } @locations; for (<$GENES>) { my ($chromosome, $start, $end) = split ' '; print "$chromosome\t$start\t$end"; my $idx = 0; # For $end, start searching where you left for + $start. my $correction = 0; # Needed for Start(-) == Start and End(+) == E +nd. for my $pos ($start, $end) { $idx++ while $locations[$idx] <= $pos - $correction and $idx <= $#locations; die "No numbers around $pos ($idx)\n" if $idx == 0 or $idx > $#locations; print "\t$locations[$idx-1]\t$locations[$idx]"; $correction = 1; } print "\n"; } [download] Update: Fixed border cases. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? by pickleswarlz (Initiate) on Feb 26, 2013 at 19:38 UTC
Thank you very much! I am using the first suggestion, as I don't really understand the binary search yet, and so far it doesn't seem to have problems with searching the 500,000 rows. I am now running into the problem of the results being output on different lines in the resulting CSV file for each print command. Here is my code: #!/usr/bin/perl use warnings; use strict; open my $GENES, '<', 'chr1data.csv' or die $!; open my $LOCATIONS, '<', 'chr1snps.csv' or die $!; chomp(my @locations = map { (split ',')[2] } <$LOCATIONS>); # If IDs are not already sorted, uncomment the following line: # @locations = sort { $a <=> $b } @locations; for (<$GENES>) { my ($chromosome, $start, $end) = split ','; print "$chromosome,$start,$end"; my $idx = 0; # For $end, start searching where you left for + $start. my $correction = 0; # Needed for Start(-) == Start and End(+) == E +nd. for my $pos ($start, $end) { $idx++ while $locations[$idx] <= $pos - $correction and $idx <= $#locations; die "No numbers around $pos ($idx) \n" if $idx == 0 or $idx > $#locations; print ",$locations[$idx-1],$locations[$idx]"; $correction = 1; } print "\n"; } [download] Printing `print ",$locations[$idx-1],$locations[$idx]";` puts this information on a new line. I'd like it to come out on the same line as `print "$chromosome,$start,$end";` for each search. Do I have a `\n` in the wrong place?	[reply] [d/l] [select]
Re^3: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? by choroba (Cardinal) on Feb 26, 2013 at 22:42 UTC
Your $end probably contains newline (I split on ' ', which removed it, you split on a comma). Just `chomp $end;` [download] before printing it. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? by pickleswarlz (Initiate) on Feb 27, 2013 at 20:53 UTC
Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv? by 7stud (Deacon) on Feb 23, 2013 at 22:22 UTC
I would think a binary search of the array would be a lot more efficient than a linear pass when n=500,000, which might take on average 250,000 comparisons to locate the range. pickleswarlz, If the term 'binary search' scares you, there is an exercise in "Learning Perl" where you write a program that picks a number between 0-100 and then invites a user to guess the number. If the user starts off by picking 50, then the program should respond by saying 'higher' or 'lower'. What is the quickest way to guess the number that the program picked? Well, if the computer says higher, you should pick 75 next. Then if the computer says lower, you should pick 62 or 63 next. Do you see how halving the range each time will arrive at guessing the correct number the quickest? Try it with a friend. It will only take you 4-5 tries to guess the number if each time you guess the middle of the range that you know contains the number. That process is known as a 'binary search'. The function below uses a binary search to return the (lowest) numbers in an array that bracket a specified number. Once you have all your ID's in a sorted array, you can search the array for the two numbers that bracket a given number: sub binary_search { #Called like this: #binary_search(\@data, $#data, $num); my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my ($middle_index, $middle_value); while ($low_index <= $high_index) { $middle_index = int( ($high_index + $low_index)/2 ); $middle_value = $aref->[$middle_index]; print '.'; #print a dot for each attempt to locate target if ($middle_value < $target) { $low_index = $middle_index + 1; } elsif ($middle_value > $target) { $high_index = $middle_index - 1; } elsif ($middle_value == $target) { print ""; #Indicates the target is actually in @data return $middle_index == 0 ? ($middle_value, $aref->[1]) : ($aref->[$middle_index-1], $middle_value) ; } } while ($aref->[$low_index] >= $target) { --$low_index; } return $aref->[$low_index], $aref->[$low_index+1]; } say ''; my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = binary_search(\@data, $#data, $num); say "$num: @results"; } --output:-- ....1: 1 5 ....2: 1 5 ....3: 1 5 ....4: 1 5 ...5: 1 5 ....6: 5 6 ..7: 6 7 ....8: 7 8 ...9: 8 9 ....10: 9 10 .....11: 10 20 .....12: 10 20 .....13: 10 20 .....14: 10 20 .....15: 10 20 .....16: 10 20 .....17: 10 20 .....18: 10 20 .....19: 10 20 .....20: 10 20 .....21: 20 30 .....22: 20 30 .....23: 20 30 .....24: 20 30 .....25: 20 30 .....26: 20 30 .....27: 20 30 .....28: 20 30 .....29: 20 30 .30: 20 30 ....31: 30 33 ....32: 30 33 ....33: 30 33 ...34: 33 34 ....35: 34 38 ....36: 34 38 ....37: 34 38 ....38: 34 38 ....39: 38 41 ....40: 38 41 ..41: 38 41 ....42: 41 100 ....43: 41 100 ....44: 41 100 ....45: 41 100 ....46: 41 100 ....47: 41 100 ....48: 41 100 ....49: 41 100 ....50: 41 100 ....51: 41 100 ....52: 41 100 ....53: 41 100 ....54: 41 100 ....55: 41 100 ....56: 41 100 ....57: 41 100 ....58: 41 100 ....59: 41 100 ....60: 41 100 ....61: 41 100 ....62: 41 100 ....63: 41 100 ....64: 41 100 ....65: 41 100 ....66: 41 100 ....67: 41 100 ....68: 41 100 ....69: 41 100 ....70: 41 100 ....71: 41 100 ....72: 41 100 ....73: 41 100 ....74: 41 100 ....75: 41 100 ....76: 41 100 ....77: 41 100 ....78: 41 100 ....79: 41 100 ....80: 41 100 ....81: 41 100 ....82: 41 100 ....83: 41 100 ....84: 41 100 ....85: 41 100 ....86: 41 100 ....87: 41 100 ....88: 41 100 ....89: 41 100 ....90: 41 100 ....91: 41 100 ....92: 41 100 ....93: 41 100 ....94: 41 100 ....95: 41 100 ....96: 41 100 ....97: 41 100 ....98: 41 100 ....99: 41 100 ....100: 41 100 ...101: 100 101 ....102: 101 102 .....103: 102 110 .....104: 102 110 .....105: 102 110 .....106: 102 110 .....107: 102 110 .....108: 102 110 .....109: 102 110 .....110: 102 110 [download] And after reading up on the cpan module List::BinarySearch, whose author appears to really know how to construct efficient search algorithms, I would consider using that module's bsearch_num_pos() function, which returns the optimal insertion index for a given number: # The following `[function returns]` an optimal insert point if no match. my @num_array = ( 100, 200, 300, 400 ); $index = bsearch_num_pos 500, @num_array; # Returns 4 - Best insert-at position. Given the index for the insertion, you can get the range the number falls in with $data`[$index-1]`and $data`[$index]`. You just have to adjust for the case when the insertion point is 0, like I did in my code. use List::BinarySearch qw(bsearch_num_pos); ... ... sub cpan_binary_search { my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my $index = bsearch_num_pos($target, @$aref); return $index == 0 ? ($aref->[0], $aref->[1]) : ($aref->[$index-1], $aref->[$index]); } my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = cpan_binary_search(\@data, $#data, $num); say "$num: @results"; } [download] I'll have to run this Perl script on multiple files, so I'd love to be able to specify the input and output files in the command line as I go rather than specifying them in the script itself. Any command line arguments can be found in the @ARGV array: `use strict; use warnings; use 5.012; say for @ARGV; say "\n"; my ($in, $out) = @ARGV; say $in; say $out; --output:-- $ perl prog1.pl infile_name outfile_name 10 20 infile_name outfile_name 10 20 infile_name outfile_name` [download]	[reply] [d/l] [select]