Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

How to extract the best match from matrix

by jnarayan81 (Sexton)
on Mar 11, 2017 at 21:41 UTC ( #1184298=perlquestion: print w/replies, xml ) Need Help??

jnarayan81 has asked for the wisdom of the Perl Monks concerning the following question:

I have a matrix with human features (f1 .. f4)

Name f1 f2 f3 f4 africa1 18 1 48 23 usa2 48 23 60 23 africa2 17 3 49 25 africa3 20 6 52 30 usa1 55 20 56 25 china1 35 37 55 87 china2 40 33 50 73

I wanted to query with selected feature (f1 ..f4), and print the best match.

use strict; use warnings; my @array = (37,35,59,70); #Are in the same order as in matrix f1 f2 f +3 f4 for(my $i = 1; $i <= 4; $i++) { my %hash; my @allval; my $fh=read_fh($ARGV[0]); #Matrix file here while(<$fh>) { chomp; next if $. ==1; #Ignore header my @val = split('\t', $_); $hash{$val[0]}=$val[$i]; push (@allval, $val[$i]); } close $fh; #foreach (sort keys %hash) { print "$_ : $hash{$_}\n";} my @allval_sorted = sort {$a <=> $b} @allval; my $find =$array[$i-1]; #my $nearest = @{nearest(\@allval_sorted)}[0]; use List::Util 'max'; my $selected = max grep { $find >= $_ } @allval_ +sorted; my $nearest=$selected; my @keys = grep { $hash{$_} == $nearest } keys %hash; print "nearest to $find in array is: $nearest and name is $keys[0]\n" +; undef @allval; undef %hash; sub nearest { my ($a) = @_; my $size = @$a; return $a if $size == 1; my $mid = int(($size-1) / 2); my $test = @$a[$mid]; return $test <= $find ? (abs($test-$find)<abs(@$a[$mid+1]-$find) ? [$test] : $find <= @$a[$mid+1] ? [@$a[$mid+1]] : nearest([@$a[$mid+1 .. $ +size-1]])) : (abs($test-$find)<abs(@$a[$mid-1]-$find) ? [$test] : $find >= @$a[$mid-1] ? [@$a[$mid-1]] : nearest([@$a[0 .. $mid]] +)); } } #Open and Read a file sub read_fh { my $filename = shift @_; my $filehandle; if ($filename =~ /gz$/) { open $filehandle, "gunzip -dc $filename |" or die $!; } else { open $filehandle, "<$filename" or die $!; } return $filehandle; }

I wrote script and it print followings; It seems nearest function does not work !!!

WITH NEAREST sub jitendra@jitendra-Aspire-S3-391[test] perl testMAT.pl mat.txt +[10:17PM] nearest to 37 in array is: 35 and name is china1 nearest to 35 in array is: 37 and name is china1 nearest to 59 in array is: 48 and name is africa1 nearest to 70 in array is: 30 and name is africa3 WITH use List::Util 'max' jitendra@jitendra-Aspire-S3-391[test] perl testMAT.pl mat.txt +[10:19PM] nearest to 37 in array is: 35 and name is china1 nearest to 35 in array is: 33 and name is china2 nearest to 59 in array is: 56 and name is usa1 nearest to 70 in array is: 30 and name is africa3

As you can see, I am opening and closing the matrix file four time, which would be really time consuming on long file.

Is there any other SMART way to achieve the best result? Thanks for you time.

Replies are listed 'Best First'.
Re: How to extract the best match from matrix
by kevbot (Priest) on Mar 12, 2017 at 01:24 UTC
    Hi jnarayan81,

    As pme and choroba mentioned, you will get more/better help if you describe the objective of your nearest function.

    I did not closely inspect your code, but I took a guess at what you were trying to do and came up with the following example.

    #!/usr/bin/env perl use strict; use warnings; use Data::Table; # Create a Data::Table with headers (assuming data is tab-delimited) my $dt = Data::Table::fromTSV( 'mat.txt', 1 ); # Get the number of rows in the Data::Table my $n_rows = $dt->nofRow; my $query = [ 37,35,59,70 ]; my $nearest_name = ''; my $min_dist; foreach my $i (0..$n_rows - 1){ my $row_ref = $dt->rowRef($i); # Get row of Data::Table as an ARRA +Y REF my $name = shift @{$row_ref}; # The name is in the first column my $dist = dist($query, $row_ref); $min_dist = !defined($min_dist) ? $dist : $dist < $min_dist ? $dist : $min_dist; $nearest_name = $dist <= $min_dist ? $name : $nearest_name; } print "The nearest to: "; print join(", ", @{$query}); print " is: $nearest_name\n"; exit; # Calculate the Euclidean distance between two vectors sub dist { my ($x, $y) = @_; unless(ref($x) eq 'ARRAY' and ref($y) eq 'ARRAY'){ die "Vectors must be given as array references" } unless (scalar @{$x} == scalar @{$y}) { die "Vectors are not of equal length"; } my $sum_sq = 0; my $len = scalar @{$x}; foreach my $i (0..$len - 1) { $sum_sq += ($x->[$i] - $y->[$i])**2; } return sqrt($sum_sq); }

    I like the Data::Table module for manipulating tabular data, but there are many other ways to load/manipulate your data. If your definition of distance or best match is not euclidean distance then modify the dist subroutine accordingly.

    UPDATE: Originally the dist sub returned $sum_sq. I changed it to return the correct euclidean distance which is sqrt($sum_sq).

      You are so to the point. Exactly, this is what I wanted to do.

Re: How to extract the best match from matrix
by choroba (Cardinal) on Mar 11, 2017 at 23:39 UTC
    Can you describe what the function is supposed to do?

    Using strict is good, but declaring sub nearest inside of a for loop with the variable $find used in the subroutine while declared in the outer scope nearly spurns all its benefits.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: How to extract the best match from matrix
by Marshall (Canon) on Mar 12, 2017 at 23:23 UTC
    Hi jnarayan81,

    Yes, in general re-reading an input file multiple times is a bad idea. I/O is very "expensive" both in terms of CPU time, but also in terms of clock time.

    A simple way to read the input file once, could be:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @matrix; # Array of Array, ["africa1",18,1,48,23] <DATA>; # throw away first line while (<DATA>) { next if /^\s*$/; # skip blank lines push (@matrix, [split(/\s+/,$_)]); # create AoA (2-D matrix) } print Dumper \@matrix; __DATA__ Name f1 f2 f3 f4 africa1 18 1 48 23 usa2 48 23 60 23 africa2 17 3 49 25 africa3 20 6 52 30 usa1 55 20 56 25 china1 35 37 55 87 china2 40 33 50 73 africa4 18 2 47 23
    The above may or may not be the best data structure for your search requirements. But I suggest reading the input file only once.

    It appears to me that you are comparing columns. I think that you want an output that is based "best match(es)" for each each column. I added "africa4" above.

    What is supposed to happen if: my @array = (18,35,59,70); Should this first array value of 18 result in two matches in the first data column?, africa1 and africa4??

    It is completely plausible that 16 and 20 are both equally close to 18. What should the output be in those cases?

Re: How to extract the best match from matrix
by pme (Monsignor) on Mar 11, 2017 at 22:25 UTC
    Hi jnarayan81,

    Would you tell us how to interpret 'best match' and 'nearest' in this context?

      Nearest: Closest match to the other individual features.

      query:f1->56

      check in f1 column and extract the closest match values and its name.

      BestMatch: Where does my query

      my @array = (37,35,59,70); #Are in the same order as in matrix f1 f2 f +3 f4

      fall in matrix. Are features belongs to USA or Africa, China?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1184298]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2022-12-07 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?