CSV Cross Referencing

Jalcock501 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: CSV Cross Referencing by Tux (Canon) on Dec 03, 2014 at 12:42 UTC
Doesn't look too complicated. Modify to your needs: use 5.16.2; use warnings; use Data::Peek; use Text::CSV_XS qw( csv ); my $extra = csv (in => \<<"EOC", headers => "auto"); LocnID,Lat,Lon,TestCo#,LastVol,LocnName,LocnAdr 291061000,5140598,645573,205,3568,Evans Marcross, 250401500,5141451,647444,205,857,Thomas And Ptns Somerset, 250400800,5147664,656137,205,1727,Thomas J Newton House, EOC my $base = csv (in => \<<"EOC", key => "Seq#"); Seq#,TransType,LocnNum,TankNum,Date,TimeStart,TimeEnd,TktVol,AltVol,To +talSrt,TotalEnd,Driver,Trip,Route,MeanDegC,MaxDegC,ColCd,Ccode,TestTy +pe,TrpSdate,GPSlat,GPSlon 174,0,291061000,59,141130,104945,105755,3355,0,255094551,255094551,500 +5,2,505,3.0,3.5,00,00,00,141130,5140596,645572 175,0,250401500,133,141130,110343,110858,760,0,255102140,255102140,500 +5,2,505,3.0,3.2,00,00,00,141130,5141449,647452 176,0,250400800,29,141130,115529,120130,1695,0,255119079,255119079,500 +5,2,505,3.7,4.2,00,00,00,141130,5147668,656139 EOC foreach my $seq (sort keys %$base) { my @extra = grep { abs ($_->{Lat} - $base->{$seq}{GPSlat}) < 10 & +& abs ($_->{Lon} - $base->{$seq}{GPSlon}) < 10 } + @$extra; @extra > 1 and warn "Multiple matches!\n"; my $ref = $extra[0] // { Lat => 0, Lon => 0, LocnName => "?" }; printf "%5d %10d %10d %10d %10d %s\n", $seq, $base->{$seq}{GPSlat}, $base->{$seq}{GPSlon}, $ref->{Lat}, $ref->{Lon}, $ref->{LocnName}; } => 174 5140596 645572 5140598 645573 Evans Marcross 175 5141449 647452 5141451 647444 Thomas And Ptns Some +rset 176 5147668 656139 5147664 656137 Thomas J Newton Hou +se [download] Of course you can write a `sub` that finds the closest match and return the extra information. You could also hang that sub on a callback for `on_in` in the `csv (on_in => \&foo)`, but that is completely up to you. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: CSV Cross Referencing by hippo (Archbishop) on Dec 03, 2014 at 11:53 UTC
What I need is to take the address from the first .csv and match it to the GPS co-ordinates in the second .csv file. ... Being GPS co-ordinates there is some margin of difference -/+10 So you want to merge (JOIN) two data sets, keyed on integer data within a range. That sounds very much like a job for a database. I would expect any decent RDBMS to handle this task with ease (and most will have a load-from-CSV option saving you effort there too). Is there anything you aren't telling us which would preclude solving this in the database and then using Perl to process the results?	[reply]
Re: CSV Cross Referencing by gerangelov (Sexton) on Dec 03, 2014 at 13:00 UTC
#!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; use Data::Dumper; my $loc = "location.csv"; my $trans = "transaction.csv"; my $csv = Text::CSV_XS->new(); my %locations; # error margin my $delta = 10; sub lookup_name { my ($href, $lat, $long) = @_; for my $k (keys %$href) { my ($lat2, $long2) = @{$href->{$k}}; return $k if ((abs($lat - $lat2) < $delta) && (abs($long - $lo +ng2) < $delta)); } return "Unknown"; } open my $fh, "<", $loc or die "Can't open $loc: $!\n"; # skip header line <$fh>; while (my $rec = $csv->getline($fh)) { my ($lat, $long, $name) = (@$rec)[1, 2, 5]; push @{$locations{$name}}, $lat, $long; } print Dumper( %locations ); $csv = Text::CSV_XS->new(); close($fh); open $fh, "<", $trans or die "Can't open $trans: $!\n"; # skip header line <$fh>; while (my $rec = $csv->getline($fh)) { my @fields = @$rec; my ($lat, $long) = @fields[20, 21]; my $name = lookup_name(\%locations, $lat, $long, $delta); { local $" = ", "; print "$name -> @fields\n"; } } [download] This assusmes you will have 2 csv files called location.csv and transaction.csv Output is: `Evans Marcross -> 174, 0, 291061000, 59, 141130, 104945, 105755, 3355, + 0, 255094551, 255094551, 5005, 2, 505, 3.0, 3.5, 00, 00, 00, 141130, + 5140596, 645572 Thomas And Ptns Somerset -> 175, 0, 250401500, 133, 141130, 110343, 11 +0858, 760, 0, 255102140, 255102140, 5005, 2, 505, 3.0, 3.2, 00, 00, 0 +0, 141130, 5141449, 647452 Thomas J Newton House -> 176, 0, 250400800, 29, 141130, 115529, 12013 +0, 1695, 0, 255119079, 255119079, 5005, 2, 505, 3.7, 4.2, 00, 00, 00, + 141130, 5147668, 656139` [download]	[reply] [d/l] [select]
Re: CSV Cross Referencing by ww (Archbishop) on Dec 03, 2014 at 12:19 UTC
I suspect hippo's suggestion that you use a RDBMS' native capabilities is a better idea than this regex alternate... but nonetheless: if the (supposedly matching) lat/lon values are inconsistent by no worse than 9, simply compare the initial six digits of each lat and first five digits of the lon (in each case with an unspecified trailing digit and end_of_string, `$`, marker). That could produce false positives, but if my fingers-and-toes math is correct, the addresses would have to be fone-booth size properties to match. Similarly, but using your variance of 10, use five digits of lat and four of lon with two unspecified trailing difits. Caution: this is going to be a lot more squiggly. Spirit of the Monastery	[reply] [d/l] [select]
Re^2: CSV Cross Referencing by hippo (Archbishop) on Dec 03, 2014 at 12:34 UTC
It is a nice thought, but I can see a problem with false negatives too. Suppose the lat in one set is 5000000 and in the other it's 4999998 - those are not going to be found using pattern-matching. You would have to both add and subtract the maximum error and then compare both of those outer limits to the second set. It's do-able in Perl but the RDBMS seems like the better way to me.	[reply]
Re^3: CSV Cross Referencing by ww (Archbishop) on Dec 03, 2014 at 13:00 UTC
AMEN! At some point(s), be it as in your example, or otherwise, there are going to be edge cases that are going to require a lot of effort to resolve. Thanks for making an important point I missed. Spirit of the Monastery	[reply] [d/l]
Re: CSV Cross Referencing by james28909 (Deacon) on Dec 03, 2014 at 10:47 UTC
Check out How to add column into array from delimited tab file. I think it can give you some ideas. Also Parse::CSV looks interesting aswell.	[reply]
Re: CSV Cross Referencing by KurtSchwind (Chaplain) on Dec 03, 2014 at 14:55 UTC
So, it looks like you have some nice answers as to how to do a lookup/join across two delimited files. So I won't enter that part of the discussion. Instead let's focus on the next part. That is that instead of a straight join, it appears you really need a distance function. Depending on how many entries, you could solve this in a number of ways. The brute-force method is to use the following formula. sqrt((lat1-lat2)^2 + (long1-long2)^2). That'll give you a course distance. (Course, the earth is round, not flat, and there are better forumlas, but I think that this will be good enough for our discussion. So you'll want to compare the distance to each entry. Pick the smallest distance (note that an exact match would equal 0). Compare that distance to some $threshold value and see if it's "close enough". You'll notice that this method is very different than joining or doing lookups per-se. And it's possible you could sort/store your file in a way that it could easily filter out a lot of GPS locations immediately because they are so far off. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]
Re^2: CSV Cross Referencing by SuicideJunkie (Vicar) on Dec 03, 2014 at 15:33 UTC
Did you mean 'coarse'? As a note; there is no need to `sqrt(...)` for all N^2 of the distances. Instead, simply square your threshold; that is a much cheaper operation, only needing to be done once. If you have a large number of coordinates, you could also break the world up into a grid of buckets. Each entry then only needs to check distance to the entries in the nearest four adjacent buckets. Use a 2D hash of buckets since most buckets will be unused/empty out in the countryside.	[reply] [d/l]
Re^3: CSV Cross Referencing by RonW (Parson) on Dec 03, 2014 at 18:22 UTC
Still cheaper (which Tux did in his code), just compare the Lat and Lon separately. Only need to calculate the distance (or square of distance) when there are multiple matches.	[reply]
Re^4: CSV Cross Referencing by SuicideJunkie (Vicar) on Dec 03, 2014 at 18:43 UTC
Re: CSV Cross Referencing by kevbot (Vicar) on Dec 04, 2014 at 06:29 UTC
Here is an way of doing it with a combination of Data::Table and Math::Round. As others have mentioned, there may be a better algorithm for determining a match for lat/lon coordinates based on distances. However, the method below can be adjusted using the $tol variable. Using the two data tables that you provided and $tol = 10 I get one match in the joined table, using $tol = 50 all three match. #!/usr/bin/env perl use strict; use warnings; use Data::Table; use Math::Round; my $tol = 50; # Tolerance, i.e. round values to the neared multiple of + $tol my $t1 = Data::Table::fromCSV("data1.txt", 1, undef); # Delete lat & lon columns are replace with rounded values my $gps_lat_col = $t1->delCol("GPSlat"); my $gps_lon_col = $t1->delCol("GPSlon"); my @nGPSlat = map{ nearest($tol, $_) } @{$gps_lat_col}; my @nGPSlon = map{ nearest($tol, $_) } @{$gps_lon_col}; $t1->addCol(\@nGPSlat, "GPSlat"); $t1->addCol(\@nGPSlon, "GPSlon"); print "\nTable 1\n"; print $t1->csv; my $t2 = Data::Table::fromCSV("data2.txt", 1, undef); #, { delimiter = +> " " }); # Delete lat & lon columns are replace with rounded values my $lat_col = $t2->delCol("Lat"); # Delete column 'age'. my $lon_col = $t2->delCol("Lon"); # Delete column 'age'. my @nlat = map{ nearest($tol, $_) } @{$lat_col}; my @nlon = map{ nearest($tol, $_) } @{$lon_col}; $t2->addCol(\@nlat, "Lat"); $t2->addCol(\@nlon, "Lon"); print "\nTable 2\n"; print $t2->csv; # Join the tables for cases where the rounded lat/lon pairs are exact +matches my $t3 = $t1->join($t2, Data::Table::INNER_JOIN, [ 'GPSlat', 'GPSlon' +], ['Lat', 'Lon']); print "\nJoined Table\n"; print $t3->csv; exit; [download] Output: Table 1 Seq#,TransType,LocnNum,TankNum,Date,TimeStart,TimeEnd,TktVol,AltVol,To +talSrt,TotalEnd,Driver,Trip,Route,MeanDegC,MaxDegC,ColCd,Ccode,TestTy +pe,TrpSdate,GPSlat,GPSlon 174,0,291061000,59,141130,104945,105755,3355,0,255094551,255094551,500 +5,2,505,3.0,3.5,00,00,00,141130,5140600,645550 175,0,250401500,133,141130,110343,110858,760,0,255102140,255102140,500 +5,2,505,3.0,3.2,00,00,00,141130,5141450,647450 176,0,250400800,29,141130,115529,120130,1695,0,255119079,255119079,500 +5,2,505,3.7,4.2,00,00,00,141130,5147650,656150 Table 2 LocnID,TestCo#,LastVol,LocnName,LocnAdr,Lat,Lon 291061000,205,3568,Evans Marcross,,5140600,645550 250401500,205,857,Thomas And Ptns Somerset,,5141450,647450 250400800,205,1727,Thomas J Newton House,,5147650,656150 Joined Table Seq#,TransType,LocnNum,TankNum,Date,TimeStart,TimeEnd,TktVol,AltVol,To +talSrt,TotalEnd,Driver,Trip,Route,MeanDegC,MaxDegC,ColCd,Ccode,TestTy +pe,TrpSdate,GPSlat,GPSlon,LocnID,TestCo#,LastVol,LocnName,LocnAdr 176,0,250400800,29,141130,115529,120130,1695,0,255119079,255119079,500 +5,2,505,3.7,4.2,00,00,00,141130,5147650,656150,250400800,205,1727,Tho +mas J Newton House, 175,0,250401500,133,141130,110343,110858,760,0,255102140,255102140,500 +5,2,505,3.0,3.2,00,00,00,141130,5141450,647450,250401500,205,857,Thom +as And Ptns Somerset, 174,0,291061000,59,141130,104945,105755,3355,0,255094551,255094551,500 +5,2,505,3.0,3.5,00,00,00,141130,5140600,645550,291061000,205,3568,Eva +ns Marcross, [download]	[reply] [d/l] [select]
Re: CSV Cross Referencing by i5513 (Pilgrim) on Dec 07, 2014 at 20:56 UTC
Hello, This is, again, offtopic, but I think it could be useful for you: q Maybe some monk with time could port that project to perl ;) Do you know an alternative perl implementation for "q" ? Regards!	[reply]
Re^2: CSV Cross Referencing by choroba (Cardinal) on Dec 07, 2014 at 21:46 UTC
No need to port the project, it already exists: DBD::CSV لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^3: CSV Cross Referencing by i5513 (Pilgrim) on Dec 07, 2014 at 23:31 UTC
Thanks for remember me such module, i didn't know about it could joins tables, bit I see it can (there is a list_tables sub which help to make the work) Thank you!!	[reply]