Compare Partial Lines of 2 Text Files

Knoperl has asked for the wisdom of the Perl Monks concerning the following question:

I beg your monkness with a basic problem. I have two text files:
<file#1>

applebananapearcarrotcarrotbeardeerdeer
goatcowduckswanchickenmouseratbirdmouse
chocolatedogdogfishmousecatdeerbird
newyorkcalifornianewjerseymousecatdeerbird
[download]

<file#2>

monksbicyclewindbikecars
computercomputerprinters
hellicopterairplaneshelf
chocolatedogdogfishmouse
printerprintermousecouch

.
.
goes on for another 600,000 lines
[download]

The output should be:
catdeerbird
Which is the 3rd line in <File#1> and the 4th line of <File#2>.
I anticipate that there will be several thousand lines that will partially match. For any line which both files share the first 24 characters (they match) and thus I want to output after the 24 characters for those particular lines of <File#1>.

I did supersearch, talked on CB and asked if I could temporarily mask for <File#1> after the 24th character and once it then matches with a line inside <File#2> it would print out those masked characters. Anno said I should implement substr. I have put together a basic version but obviously I really need help with this.

#!/usr/bin/perl -w
use strict;
my %lines;
my %lines2;
my $str2;
open(UNIQUES,"<$ARGV[0]");
open(ALL, "<$ARGV[1]");

while (<UNIQUES>)
    {
    $lines{$str1} = 1;
    }

while (<ALL>)
    {
    print $str1 if substr( $str1, 0, 24) eq ($str2); 
    next ($str2);
    }
[download]

Thank you for your kind assistance. Perl Monks Rule!

Comment on Compare Partial Lines of 2 Text Files Select or Download Code

Replies are listed 'Best First'.
Re: Compare Partial Lines of 2 Text Files by GrandFather (Saint) on Jul 31, 2007 at 00:00 UTC
Parse the file containing the smaller number of lines and build a hash. Then parse the larger file and match lines using a hash lookup: use strict; use warnings; my $file1 = <<FILE; applebananapearcarrotcarrotbeardeerdeer goatcowduckswanchickenmouseratbirdmouse chocolatedogdogfishmousecatdeerbird newyorkcalifornianewjerseymousecatdeerbird FILE my $file2 = <<FILE; monksbicyclewindbikecars computercomputerprinters hellicopterairplaneshelf chocolatedogdogfishmouse printerprintermousecouch FILE my %f1Lines; open IN, '<', \$file1; while (<IN>) { my ($key, $tail) = m/(.{24})(.)/; push @{$f1Lines{$key}}, [$tail, $.]; } close IN; open IN, '<', \$file2; while (<IN>) { my ($key, $tail) = m/(.{24})(.)/; next unless exists $f1Lines{$key}; my @matches = @{$f1Lines{$key}}; print "Line $. of file2 ($key$tail) matches:\n"; print " line $_->[1] of file1 ($key$_->[0])\n" for @matches; } close IN; [download] Prints: `Line 4 of file2 (chocolatedogdogfishmouse) matches: line 3 of file1 (chocolatedogdogfishmousecatdeerbird)` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Compare Partial Lines of 2 Text Files by Knoperl (Acolyte) on Jul 31, 2007 at 00:33 UTC
Thank you very much Grandfather but I think I did not give a clear example of my output I wanted: File#1 `abcd efgh ijkl mnop` [download] File#2 `qq rr ij st mn jj rr` [download] Output I would want: `kl op` [download] In this example I am saying character delimited by 2 characters. Meaning after the 2 characters is the part I want but I want to match it between the 2 files just for the first 2 characters. In the original question I wrote 24 characters which is what I do want but I realize that is confusing. Also I would like it to read external files and not have the data actually embedded inside the Perl program. I again appreciate any further assistance greatly either by you or other people who would love to join in the fun here at PerlMonks.com!	[reply] [d/l] [select]
Re^3: Compare Partial Lines of 2 Text Files by GrandFather (Saint) on Jul 31, 2007 at 01:14 UTC
Reread my sample and use that thing on your shoulders that prevents your hair falling down your throat and forming a hair ball. My sample isn't intended to be a complete answer to your problem. It is intended to show you some tools and an approach using those tools to solve your problem. It is also intended to be self contained so that you can easily reproduce the output I indicated that it generates. It should be pretty obvious how you plug in your own local files in place of the "internal files" used in the sample. The sample prints out more information that you asked for because that demonstrates how to store information (such as line number) for the data in the hash and how to access that ancillary information. Because you don't tell us the back story and don't provide context details such as "duplicate keys can/can't exist", the sample code assumes that not only duplicate keys may exist, but that their context is important. You may wish to consult perllol to gain some insight into how the hash of array (hoa) works if you've not encountered it before (or take a trip to the Tutorials section). DWIM is Perl's answer to Gödel	[reply]
Re: Compare Partial Lines of 2 Text Files by Anonymous Monk on Jul 31, 2007 at 07:46 UTC
my solution is almost the same as GrandFather's, except for the use of `unpack()` instead of a regex to split the lines. this may be noticeably faster with large files; however, the perl regex engine is often so well optimized that, in the case of a simple regex like the one below, the actual code generated may actually be an `unpack()` call or something very like it! three variations are presented; all produce the same output. only the section of the code that differs is presented below. (and, yes, you should have considered GrandFather's response more carefully before replying.) use constant UNPACKER => 'A24 A'; open IN, '<', \$file1; while (<IN>) { # my ($key, $tail) = m/(.{24})(.)/; # my ($key, $tail) = unpack('A24 A', $_); # my ($key, $tail) = unpacker(); my ($key, $tail) = unpack(UNPACKER, $_); push @{$f1Lines{$key}}, [$tail, $.]; } close IN; open IN, '<', \$file2; while (<IN>) { # my ($key, $tail) = m/(.{24})(.)/; # my ($key, $tail) = unpack('A24 A', $_); # my ($key, $tail) = unpacker(); my ($key, $tail) = unpack(UNPACKER, $_); next unless exists $f1Lines{$key}; my @matches = @{$f1Lines{$key}}; print "Line $. of file2 ($key$tail) matches:\n"; print " line $_->[1] of file1 ($key$_->[0])\n" for @matches; } close IN; sub unpacker { unpack('A24 A', $_) } [download]	[reply] [d/l] [select]
Re^2: Compare Partial Lines of 2 Text Files by Knoperl (Acolyte) on Jul 31, 2007 at 20:21 UTC
I would like to express my gratitude to Grandfather and Anonymous Monk for their assistance on this code. As to my reply and whether I have a brain or being more careful, I really had good intentions. I realized that I could have made the question more verbose and was concerned that I was not being clear and thus confusing. That is why when I saw there was no filehandling in the code of Grandfather I felt that it was my fault for not telling him exactly what I was looking for. I do apologize and hope with my public display of penance that I will be forgiven.	[reply]