in reply to multi column multi file comparison
use Modern::Perl; use Number::Range; use Text::CSV::Auto qw( process_csv ); use Data::Dump qw /dump/; my $debug = 0; my %database; process_csv('./primary.txt', sub { my $row = shift; push @{$database{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%database) if $debug; analyse ('./secondary_1.txt'); analyse ('./secondary_2.txt'); sub analyse { my $file = shift; my %data; process_csv($file, sub { my $row = shift; push @{$data{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%data) if $debug; for my $name (sort keys %data) { unless ($database{$name}) { say "Unknown name '$name'"; next; } for my $range_ref (@{$database{$name}}) { print "$name: $range_ref->[0] $range_ref->[1] "; my $range = Number::Range->new($range_ref->[0] .. $range_r +ef->[1]); for my $testrange_ref (@{$data{$name}}) { if ($range->inrange(@$testrange_ref)) { print "present $testrange_ref->[0] $testrange_ref- +>[1] "; } else { print "absent 0 0 "; } } print "\n"; } } }
Output:
Alex: 3 44 absent 0 0 absent 0 0 Alex: 124 175 absent 0 0 present 134 155 Barry: 2 44 present 12 24 James: 6 45 absent 0 0 Alex: 3 44 absent 0 0 Alex: 124 175 present 154 174 Drew: 9 43 present 19 54 absent 0 0 James: 6 45 present 29 45
This solution suffers from the same problem you already mentioned: it checks all ranges and finds that some ranges in the secondary files are outside of the ranges in the primary file (see the data for "Alex"). But unless you find a good rule to define which ranges in the secondary files are to be checked against which ranges in the primary files, there is no way to solve this problem.
Oh and BTW, it is "Perl" (the language) or "perl" (the executable), but never ever "PERL". :)
CountZero
A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: multi column multi file comparison
by onlyIDleft (Scribe) on May 24, 2011 at 18:15 UTC | |
by CountZero (Bishop) on May 24, 2011 at 21:23 UTC | |
by onlyIDleft (Scribe) on May 24, 2011 at 23:14 UTC | |
by Anonymous Monk on May 25, 2011 at 04:40 UTC |