Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

help requested with collating data from two files

by Angharad (Pilgrim)
on Sep 13, 2010 at 20:47 UTC ( [id://860029]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

I have two files - one which looks like this (FILE1)
1.10.10 1.10.1040 1.10.150 1.10.220
...etc

and another that looks like this (FILE2)

1.10.10.640 1.10.10.650 1.10.10.660 1.10.1040.20 1.10.150.290 1.10.150.300 1.10.150.310 1.10.220.80
...etc

I'm wanting to match the first three numbers from the 4 number strings in FILE2 (each number being separated by a '.') with the corresponding three number strings in FILE1 and then count how many matches there are.

For example for the three number string 1.10.10 in FILE1 there are 3 matches in FILE2.

I want the output file to look like this:

1.10.10 3 1.10.1040 1 1.10.150 3 1.10.220 1
I've got the following script
use strict; my $file1 = shift; my $file2 = shift; open(FILE1, $file1) or die "Cant open $file1:$!\n"; open(FILE2, $file2) or die "Cant open $file2:$!\n"; my @file1 =<FILE1>; my @file2 =<FILE2>; #print @file1; #print @file2; for(my $i=0; $i<@file1; $i++) { chop($file1[$i]); #print "F $file1[$i]"; for(my $j=0; $j<@file2; $j++) { chop($file2[$j]); my @array = split(/\./, $file2[$j]); my $cat = "$array[0]" . "." . "$array[1]" . "." . "$array[2]"; if("$file1[$i]" eq "$cat") { print "$file1[$i] $cat\n"; } } }
But its simply not working properly - its not finding all the matches. I've given only example files here - the actual files are much bigger - some matches are recognised but not all. Any pointers in the right direction much appreciated!!!

Replies are listed 'Best First'.
Re: help requested with collating data from two files
by ikegami (Patriarch) on Sep 13, 2010 at 21:07 UTC
    Your chop'ing the element in @file2 waaaay too many times. Had you use the recommended chomp, all you have have lost is CPU cycles. With chop, you're loosing data.
    use strict; use warnings; my $file1 = shift; my $file2 = shift; open(my $fh1, '<', $file1) or die "Cant open $file1: $!\n"; open(my $fh2, '<', $file2) or die "Cant open $file2: $!\n"; chomp( my @file1 = <$fh1> ); chomp( my @file2 = <$fh2> ); my %counts; for my $base (@file1) { my $re = qr/^\Q$base\E(\.|\z)/; for my $node (@file2) { ++$counts{$base} if $node =~ /$re/; } } for my $base (keys(%counts)) { print("$base: $counts{$base}\n"); }
    or even
    ... my %counts; for my $base (@file1) { my $re = qr/^\Q$base\E(\.|\z)/; $counts{$base} += grep /$re/, @file2; } ...
Re: help requested with collating data from two files
by kennethk (Abbot) on Sep 13, 2010 at 21:01 UTC
    I suspect the issue is because you are using chop instead of chomp. If your input files are not newline terminated, your chops will remove data values from the terminating entry. The code:

    #!/usr/bin/perl use strict; use warnings; my @file1 = qw( 1.10.10 1.10.1040 1.10.150 1.10.220 ); my @file2 = qw( 1.10.10.640 1.10.10.650 1.10.10.660 1.10.1040.20 1.10.150.290 1.10.150.300 1.10.150.310 1.10.220.80 ); for(my $i=0; $i<@file1; $i++) { chomp($file1[$i]); #print "F $file1[$i]"; for(my $j=0; $j<@file2; $j++) { chomp($file2[$j]); my @array = split(/\./, $file2[$j]); my $cat = "$array[0]" . "." . "$array[1]" . "." . "$array[2]"; if("$file1[$i]" eq "$cat") { print "$file1[$i] $cat\n"; } } }

    (with null-op chomps) yields the result

    1.10.10 1.10.10 1.10.10 1.10.10 1.10.10 1.10.10 1.10.1040 1.10.1040 1.10.150 1.10.150 1.10.150 1.10.150 1.10.150 1.10.150 1.10.220 1.10.220

    which seems to my eyes to be the spec. There are some stylistic modifications I would implement (Foreach Loops, hash instead of iterating with eq), but this should fix your bug.

    Update: I'd missed that you were chopping in an inner loop, as ikegami notes below. The bug fix still holds.

      If your input files are not newline terminated, your chops will remove data values from the terminating entry. The code:

      If the files weren't newline terninated, @file1 and @file2 would only have one element.

Re: help requested with collating data from two files
by dasgar (Priest) on Sep 13, 2010 at 21:10 UTC

    I think kennethk is probably right about what may be causing your issue.

    However, I'd like to offer a tip on debugging. The Data::Dumper module is a very handy tool in debugging. If you encounter a problem with regular expression's with variables not matching as you want, try using Data::Dumper (or a simple print) just before the regex to find out for sure what is being stored in the variable(s). If the variable(s) don't have what you think they should, then you know where to begin to look for the source of the problem.

Re: help requested with collating data from two files
by Utilitarian (Vicar) on Sep 14, 2010 at 08:37 UTC
    The remaining issue will be that 1.10.l0 also matches lines that match 1.10.1040 To resolve this try something like the following
    use strict; use warnings; open(my $data_1 ,"<", "data1.dat"); my %stubs; while (<$data_1>){ chomp; $stubs{$_}=0; } close $data_1; open(my $data_2 ,"<", "data2.dat"); while (<$data_2>){ chomp; for my $stub (reverse sort keys %stubs){ if ($_=~ /\Q$stub\E/){ $stubs{$stub}++; last } } } for my $stub (sort {$stubs{$a} <=> $stubs{$b}} (keys %stubs)){ printf "%-10s: %d\n",$stub, $stubs{$stub}; } __OUTPUT__ 1.10.220 : 1 1.10.1040 : 1 1.10.10 : 3 1.10.150 : 3
    By using the reverse sort on the keys of %stubs you can guarantee a "longest match only"

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: help requested with collating data from two files
by Angharad (Pilgrim) on Sep 13, 2010 at 21:15 UTC
    Great, you have all been really helpful. Thank you. Amazing how one silly error can mess up everything. And I'll have a look at data:dumper too :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://860029]
Approved by kennethk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2024-04-18 14:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found