help requested with collating data from two files

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two files - one which looks like this (FILE1)

1.10.10
1.10.1040
1.10.150
1.10.220
[download]

...etc

and another that looks like this (FILE2)

1.10.10.640
1.10.10.650
1.10.10.660
1.10.1040.20
1.10.150.290
1.10.150.300
1.10.150.310
1.10.220.80
[download]

...etc

I'm wanting to match the first three numbers from the 4 number strings in FILE2 (each number being separated by a '.') with the corresponding three number strings in FILE1 and then count how many matches there are.

For example for the three number string 1.10.10 in FILE1 there are 3 matches in FILE2.

I want the output file to look like this:

1.10.10 3
1.10.1040 1
1.10.150 3
1.10.220 1
[download]

I've got the following script

use strict;

my $file1 = shift;

my $file2 = shift;

open(FILE1, $file1) or die "Cant open $file1:$!\n";

open(FILE2, $file2) or die "Cant open $file2:$!\n";

my @file1 =<FILE1>;

my @file2 =<FILE2>;

#print @file1;
#print @file2;

for(my $i=0; $i<@file1; $i++)
{
    chop($file1[$i]);
    #print "F $file1[$i]";

    for(my $j=0; $j<@file2; $j++)
    {
    chop($file2[$j]);

    my @array = split(/\./, $file2[$j]);

    my $cat = "$array[0]" . "." . "$array[1]" . "." . "$array[2]";

    if("$file1[$i]" eq "$cat")
    {
        print "$file1[$i] $cat\n";
    }
    
    }
}
[download]

But its simply not working properly - its not finding all the matches. I've given only example files here - the actual files are much bigger - some matches are recognised but not all. Any pointers in the right direction much appreciated!!!

Comment on help requested with collating data from two files Select or Download Code

Replies are listed 'Best First'.
Re: help requested with collating data from two files by ikegami (Patriarch) on Sep 13, 2010 at 21:07 UTC
Your chop'ing the element in @file2 waaaay too many times. Had you use the recommended chomp, all you have have lost is CPU cycles. With chop, you're loosing data. `use strict; use warnings; my $file1 = shift; my $file2 = shift; open(my $fh1, '<', $file1) or die "Cant open $file1: $!\n"; open(my $fh2, '<', $file2) or die "Cant open $file2: $!\n"; chomp( my @file1 = <$fh1> ); chomp( my @file2 = <$fh2> ); my %counts; for my $base (@file1) { my $re = qr/^\Q$base\E(\.\|\z)/; for my $node (@file2) { ++$counts{$base} if $node =~ /$re/; } } for my $base (keys(%counts)) { print("$base: $counts{$base}\n"); }` [download] or even `... my %counts; for my $base (@file1) { my $re = qr/^\Q$base\E(\.\|\z)/; $counts{$base} += grep /$re/, @file2; } ...` [download]	[reply] [d/l] [select]
Re: help requested with collating data from two files by kennethk (Abbot) on Sep 13, 2010 at 21:01 UTC
I suspect the issue is because you are using chop instead of chomp. If your input files are not newline terminated, your chops will remove data values from the terminating entry. The code: #!/usr/bin/perl use strict; use warnings; my @file1 = qw( 1.10.10 1.10.1040 1.10.150 1.10.220 ); my @file2 = qw( 1.10.10.640 1.10.10.650 1.10.10.660 1.10.1040.20 1.10.150.290 1.10.150.300 1.10.150.310 1.10.220.80 ); for(my $i=0; $i<@file1; $i++) { chomp($file1[$i]); #print "F $file1[$i]"; for(my $j=0; $j<@file2; $j++) { chomp($file2[$j]); my @array = split(/\./, $file2[$j]); my $cat = "$array[0]" . "." . "$array[1]" . "." . "$array[2]"; if("$file1[$i]" eq "$cat") { print "$file1[$i] $cat\n"; } } } [download] (with null-op chomps) yields the result `1.10.10 1.10.10 1.10.10 1.10.10 1.10.10 1.10.10 1.10.1040 1.10.1040 1.10.150 1.10.150 1.10.150 1.10.150 1.10.150 1.10.150 1.10.220 1.10.220` [download] which seems to my eyes to be the spec. There are some stylistic modifications I would implement (Foreach Loops, hash instead of iterating with `eq`), but this should fix your bug. Update: I'd missed that you were chopping in an inner loop, as ikegami notes below. The bug fix still holds.	[reply] [d/l] [select]
Re^2: help requested with collating data from two files by ikegami (Patriarch) on Sep 13, 2010 at 21:10 UTC
If your input files are not newline terminated, your chops will remove data values from the terminating entry. The code: If the files weren't newline terninated, `@file1` and `@file2` would only have one element.	[reply] [d/l] [select]
Re: help requested with collating data from two files by dasgar (Priest) on Sep 13, 2010 at 21:10 UTC
I think kennethk is probably right about what may be causing your issue. However, I'd like to offer a tip on debugging. The Data::Dumper module is a very handy tool in debugging. If you encounter a problem with regular expression's with variables not matching as you want, try using Data::Dumper (or a simple print) just before the regex to find out for sure what is being stored in the variable(s). If the variable(s) don't have what you think they should, then you know where to begin to look for the source of the problem.	[reply]
Re^2: help requested with collating data from two files by planetscape (Chancellor) on Sep 14, 2010 at 03:29 UTC
Or see How can I visualize my complex data structure? The Basic debugging checklist may also help. HTH, planetscape	[reply]
Re: help requested with collating data from two files by Utilitarian (Vicar) on Sep 14, 2010 at 08:37 UTC
The remaining issue will be that `1.10.l0` also matches lines that match `1.10.1040` To resolve this try something like the following `use strict; use warnings; open(my $data_1 ,"<", "data1.dat"); my %stubs; while (<$data_1>){ chomp; $stubs{$_}=0; } close $data_1; open(my $data_2 ,"<", "data2.dat"); while (<$data_2>){ chomp; for my $stub (reverse sort keys %stubs){ if ($_=~ /\Q$stub\E/){ $stubs{$stub}++; last } } } for my $stub (sort {$stubs{$a} <=> $stubs{$b}} (keys %stubs)){ printf "%-10s: %d\n",$stub, $stubs{$stub}; } __OUTPUT__ 1.10.220 : 1 1.10.1040 : 1 1.10.10 : 3 1.10.150 : 3` [download] By using the reverse sort on the keys of `%stubs` you can guarantee a "longest match only" `print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."`	[reply] [d/l] [select]
Re: help requested with collating data from two files by Angharad (Pilgrim) on Sep 13, 2010 at 21:15 UTC
Great, you have all been really helpful. Thank you. Amazing how one silly error can mess up everything. And I'll have a look at data:dumper too :)	[reply]


The stupid question is the question not asked
	PerlMonks