multi column multi file comparison

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Since I am a PERL newbie, I thought of seeking wisdom before inadvertently converting my original problem into an X-Y problem!

I have a master list with names in 1st column followed by 2 more columns, with numbers, 1st number smaller than the 2nd. The names can be repeated more than once in this list, and are not sorted in any order. The two numbers associated with a name in each row can be different when the names are repeated, but not necessarily. Like so

Alex 3 44

Barry 2 44

James 6 45

Drew 9 43

Alex 124 175

Though it may be obvious, there may only be ONE master list or file name, the first element in @ARGV from $bash

Then I have multiple secondary files (could be just one to several, dont know a priori)- in the same format as the master file list, i.e also containing names that can be the same list OR more commonly a subset of the names in the master list. So these files also have 3 columns, 1st column with a name, followed by 2 columns of numbers, 1st one smaller than the 2nd. For example, the 1st secondary file's contents could be, in no particular alphabetical or numerical order:

James 1 22

Alex 89 120

Alex 134 155

Barry 12 24

While the 2nd secondary file's contents could be likewise:

Alex 154 174

James 29 45

Drew 19 54

Drew 139 154

My final output needs to contain the following information in a grid form

For each name from primary file, IF is present in the secondary files, AND when the secondary numerical range is equal to or within the primary's numerical range, indicate as present, and include secondary numerical range numbers.

Else fields for name should be indicated as absent, and range start and end filled with zeroes or just left empty.

Based on the rules above, my output should look as below with some sort of informative headers for the output columns that I casually made up:

name 1'start 1'end file#1 #1start #1end file#2 #2start #2end

Alex 3 44 absent 0 0 absent 0 0

Barry 2 44 present 12 24 absent 0 0

James 6 45 present 1 22 present 29 45

Drew 9 43 absent 0 0 absent 0 0

Alex 124 175 present 134 155 present 154 174

Dear Monks - How should I go about doing this? This problem is a little too tricky for me because of the repetitive nature of names combined the possibility of their different numerical ranges for each occurrence of the repeated name. This means that I might mistakenly try to match the wrong secondary range to the primary range, and conclude that a match does NOT exist, when reality I have compared ranges that should NOT have been compared, and should have instead looked for the numerical range of other instance(s) of the name. Does that sort of make sense? Perhaps I am obfuscating by typing more than I should....

Thanks in advance for your advice, have a nice weekend!

Comment on multi column multi file comparison

Replies are listed 'Best First'.
Re: multi column multi file comparison by GrandFather (Saint) on May 21, 2011 at 09:20 UTC
What have you tried? There is a neat trick you can use in Perl to make it easy to mock up a few files that may help you get started on a solution. Consider: `use strict; use warnings; my $file1Str = <<FILEDATA; Alex 3 44 Barry 2 44 James 6 45 Drew 9 43 Alex 124 175 FILEDATA my $file2Str = <<FILEDATA; James 1 22 Alex 89 120 Alex 134 155 Barry 12 24 FILEDATA my $file3Str = <<FILEDATA; Alex 154 174 James 29 45 Drew 19 54 Drew 139 154 FILEDATA open my $fileIn, '<', \$file1Str; print "Prim: $_" for <$fileIn>; close $fileIn; for my $secFile ($file2Str, $file3Str) { open my $secIn, '<', \$secFile; print "Sec: $_" for <$secIn>; }` [download] Prints: `Prim: Alex 3 44 Prim: Barry 2 44 Prim: James 6 45 Prim: Drew 9 43 Prim: Alex 124 175 Sec: James 1 22 Sec: Alex 89 120 Sec: Alex 134 155 Sec: Barry 12 24 Sec: Alex 154 174 Sec: James 29 45 Sec: Drew 19 54 Sec: Drew 139 154` [download] The trick is using a string as an input file by passing a reference to it in place of a file name in the open statement. That allows you to easily include several tests "files" in your source code so you can easily play around with test data. It also makes it really easy for us to reproduce your results when you ask a question concerning your code. True laziness is hard work	[reply] [d/l] [select]
Re: multi column multi file comparison by tospo (Hermit) on May 21, 2011 at 12:11 UTC
The way to get started with something like this is to write an algorithm for solving the problem in plain English and then try to break the problem into chunks that you can tackle one by one. In this case, your first chunk to solve would be how to read all the data from the master file into a format that allows you to compare each entry from the other files with the master file. What you want for this is a Hash of Arrays like so: `my %master = ( Alex => [ [3,44], [124,175] ], Barry => [ [2,44] ], # more data here # );` [download] Here, I associate every name with a list of ranges stored as references to arrays. Each range in turn is also a reference to an Array of two elements, the start and end of the range. Of course you could also store the range as a string. wouldn't really make much of a difference in this case. Now you can access data for a person by name and iterate over all their ranges like so: `my $some_name = 'Alex'; foreach my $range ( @{$master{$some_name}} ){ my ($start, $end) = @$range; print "$start, $end\n"; }` [download] What you need to figure out now is how to read your master into the %master hash and how to get $some_name and how to compare the ranges. Have a go at that and let us know if you run into problems.	[reply] [d/l] [select]
Re: multi column multi file comparison by CountZero (Bishop) on May 22, 2011 at 09:01 UTC
With the use of some modules (Text::CSV::Auto to read in the data and Number::Range to process the ranges), the following works ... in a way. use Modern::Perl; use Number::Range; use Text::CSV::Auto qw( process_csv ); use Data::Dump qw /dump/; my $debug = 0; my %database; process_csv('./primary.txt', sub { my $row = shift; push @{$database{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%database) if $debug; analyse ('./secondary_1.txt'); analyse ('./secondary_2.txt'); sub analyse { my $file = shift; my %data; process_csv($file, sub { my $row = shift; push @{$data{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%data) if $debug; for my $name (sort keys %data) { unless ($database{$name}) { say "Unknown name '$name'"; next; } for my $range_ref (@{$database{$name}}) { print "$name: $range_ref->[0] $range_ref->[1] "; my $range = Number::Range->new($range_ref->[0] .. $range_r +ef->[1]); for my $testrange_ref (@{$data{$name}}) { if ($range->inrange(@$testrange_ref)) { print "present $testrange_ref->[0] $testrange_ref- +>[1] "; } else { print "absent 0 0 "; } } print "\n"; } } } [download] Output: `Alex: 3 44 absent 0 0 absent 0 0 Alex: 124 175 absent 0 0 present 134 155 Barry: 2 44 present 12 24 James: 6 45 absent 0 0 Alex: 3 44 absent 0 0 Alex: 124 175 present 154 174 Drew: 9 43 present 19 54 absent 0 0 James: 6 45 present 29 45` [download] This solution suffers from the same problem you already mentioned: it checks all ranges and finds that some ranges in the secondary files are outside of the ranges in the primary file (see the data for "Alex"). But unless you find a good rule to define which ranges in the secondary files are to be checked against which ranges in the primary files, there is no way to solve this problem. Oh and BTW, it is "Perl" (the language) or "perl" (the executable), but never ever "PERL". :) CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: multi column multi file comparison by onlyIDleft (Scribe) on May 24, 2011 at 18:15 UTC
Hi CountZero, Thanks for your posting, I think its pretty close to what I was hoping to produce as an output Since I've never used such complicated code before, I was not very successful at modifying it for my exact purpose (see below) Could you please help out again with a modification of your Perl script? (not PERL) :) It looks like, in the output you generated, you did more than compare 1 primary file to all secondary files, but I cant say for sure Could you help out by modifying your code so that the output is in tabular form for ONLY the primary to EACH secondary file comparison, and no reciprocal comparisons of secondary to primary, or one secondary to another secondary file? Also, I think in your example output using your script, for Drew and Barry the comparisons for one of the secondary files is missing, only one set of entries is present.. Is that fixed easily? In terms of range comparison, it is very simple math, but I think you have that implemented right. Anyways, described below: In the original data file, as a rule, ALWAYS p1<p2, and S1-1 < S1-2, and S2-1 < S2-2 If primary range is numerically p1 to p2 and for secondary ranges they are S1-1 to S1-2 (for 1st secondary file), S2-1 to S2-2 (for 2nd secondary file), then simply S1-1 >= p1 S1-1 <= p2 S1-2 >= p1 S1-2 <= p2 This means the range S1-1 to S1-2 is nested within the range p1 to p2, or can be the same, but any extension past the primary range is disallowed... Likewise for any other range, S2-1 to S2-2 for another file....Therefore, again S2-1 >= p1 S2-1 <= p2 S2-2 >= p1 S2-2 <= p2 Thanks again CountZero	[reply]
Re^3: multi column multi file comparison by CountZero (Bishop) on May 24, 2011 at 21:23 UTC
I see where I went wrong. I checked the secondary files against the primary file (`analyse` subroutine), rather than the other way around. Easy enough to switch that: use Modern::Perl; use Number::Range; use Text::CSV::Auto qw( process_csv ); use Data::Dump qw /dump/; my $debug = 0; my %database; process_csv('./primary.txt', sub { my $row = shift; push @{$database{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%database) if $debug; analyse ('./secondary_1.txt'); analyse ('./secondary_2.txt'); sub analyse { my $file = shift; say "--Checking $file--"; my %data; process_csv($file, sub { my $row = shift; push @{$data{$row->{name}}}, [$row->{start}, $row->{end}]; } ); say dump(\%data) if $debug; for my $name (sort keys %database) { unless ($data{$name}) { for my $range_ref (@{$database{$name}}) { say "$name: $range_ref->[0] $range_ref->[1] absent 0 +0"; } next; } for my $range_ref (@{$database{$name}}) { print "$name: $range_ref->[0] $range_ref->[1] "; my $range = Number::Range->new($range_ref->[0] .. $range_r +ef->[1]); for my $testrange_ref (@{$data{$name}}) { if ($range->inrange(@$testrange_ref)) { print "present $testrange_ref->[0] $testrange_ref- +>[1] "; } else { print "absent 0 0 "; } } print "\n"; } } say '------------------------------'; } [download] Which produces the following output: `--Checking ./secondary_1.txt-- Alex: 3 44 absent 0 0 absent 0 0 Alex: 124 175 absent 0 0 present 134 155 Barry: 2 44 present 12 24 Drew: 9 43 absent 0 0 James: 6 45 absent 0 0 ------------------------------ --Checking ./secondary_2.txt-- Alex: 3 44 absent 0 0 Alex: 124 175 present 154 174 Barry: 2 44 absent 0 0 Drew: 9 43 present 19 54 absent 0 0 James: 6 45 present 29 45 ------------------------------` [download] The whole range checking is done thanks to the Number::Range module. Please, check out its documentation and more specifically the `inrange`-method called in scalar context. I hand it the begin and endpoints of the range to be checked and if both are within the range to be checked against, it returns true, hence the whole range must be within. The `analyse` subroutine in the script above takes as its only argument the filename of a "secondary" file and the primary file is checked against this secondary file. There is no secondary-against-secondary checking done or a reciprocal secondary against primary. Check the main `for`-loop inside this subroutine: it uses the data in the `%database`-hash which has been populated with the data from the primary file. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^4: multi column multi file comparison by onlyIDleft (Scribe) on May 24, 2011 at 23:14 UTC
Re^5: multi column multi file comparison by Anonymous Monk on May 25, 2011 at 04:40 UTC