Compare two large files and extract a matching row...

suneel.reddy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Compare two large files and extract a matching row... by mzedeler (Pilgrim) on Apr 17, 2012 at 11:37 UTC
Here is an approach that should work: 1. Read file 1, pulling field 2 out and storing it in a hash for fast lookup. `while(<FILE1>) { my @fields = split /\|/; $lookup{$fields[1]} = 1; }` [download] 2. Read file 2 line by line and put the line in your output file if field 5 is in the %lookup hash. By using this approach, you will never need to have more than one line from each file in memory at any time, and the ids from file 2. This should be both fast and memory efficient.	[reply] [d/l]
Re: Compare two large files and extract a matching row... by marto (Cardinal) on Apr 17, 2012 at 11:37 UTC
You already have a thread open: Re^3: Compare two Large FlatFiles no need to repost.	[reply]
Re: Compare two large files and extract a matching row... by traceyfreitas (Sexton) on Apr 17, 2012 at 20:39 UTC
A couple of things I notice from your code: 1. you store all the fields of your child file in @carray. Since you only need the second field, you could just have each element in @carray contain this field. 2. for every line (M) in parent, you're searching the entire child list (N) against. That's NM comparisons. From the way you've arranged @carray, duplicates are not consolidated and so you're potentially doing extra searches. 3. you call chomp() more than need to. Call chomp once upon reading a line from <PRFILE> (prior to splitting). Do this when you load your field into @carray and you won't have to call chomp during your foreach(@carray) loop. `while(my $pline=<PRFILE>) { chomp $pline; $parrecord = $pline; my @parfields = split(); ... }` [download] I see a couple of approaches you could take: ARRAY approach* 1. Create a non-redundant array (@carray) of field #2 from your child file (File1) -either add each field#2 to a hash; extract the keys with "keys()"; undef the hash or let it go out of scope to free up mem to perl or -push field#2 to (@carray) directly; sort; remove duplicates 2. Search <PRFILE> for each $_ in (@carray); `CHILD: foreach (@carray) { # search <PRFILE> line-by-line for a match of $_ to field #5 # if each field#5 in <PRFILE> is guaranteed unique, you can go to +the next CHILD element once a match is found }` [download] HASH approach In my own experience, I've noticed that the number of keys in a hash table can grow to about 280,000 keys before it starts slowing down considerably. If, in your 10M or so records, you have < 300,000 unique fields, then this approach should be fine. If not, then you'll either have to go with the ARRAY approach above, or break up your key into a prefix and suffix and store it that way. I'll give an example below. 1. Create a hash of all the elements: `my %lookupHash = (); while(my $line=<CHFILE>) { chomp $line; my @fields = split(/\\|/, $line); $lookupHash{ $fields[1] } = (); # $field[1] = 2nd field #$lookupHash{ $fields[1] } = $line; # need entire line? }` [download] 2. Search <PRFILE>, line-by-line, to see if any element in %lookup hash is present `while(my $line=<PRFILE>) { chomp $line; my @fields = split(/\\|/, $line); print $OUTFILE $line."\n" if(exists $lookupHash{ $fields[4] }); }` [download] If your performance with the above approached really is like molasses, continue reading... **** It gets more complicated below here **** Now in the worst-case scenario, if your 1GB file of 10M or so records truly has 10M unique records, then this hash lookup will be slow as balls. At the cost of extra space, though hopefully not in excess of your system RAM, you could break up your hash key into a prefix and suffix to effectively create subsets of your data (as a hash-of-hash) so you don't have to look through the entire 10M records every time you do an "exists()". For example, you could break "400000042597061" down like one of these: `field pre suf $HoH{$pre}->{$suf} = () ---------------- ---- ----------------- ------------------------ 1400000042597061="1" +"400000042597061" $HoH{1}->{400000042597061}=(); ="14"+"00000042597061" $HoH{14}->{00000042597061}=(); ...` [download] This way, you could do an exists() on a much smaller subset of data. Unfortunately, your 10M records means there are only at most 8 unique digits (10000000 => 8 digits) and your field here has 16 digits, so I would store the suffix as the primary key, and the prefix behind the suffix. In addition, you would do well to keep the length of the suffix down to 5 digits or less (at most 99,000 keys) rather than 6 digits (at most 999,999 keys which could be too slow). Example: `field prefix suffix $HoH{$suf}->{$pre} = (); ---------------- ----------- -------- ------------------------- 1400000042597061="1400000042"+"597061" $HoH{597061}->{1400000042}=();` [download] Assuming $myField is the field of interest like "1400000042597061", you can break them down into these prefix/suffix segments with substr(): `my $prefix = substr($myField, 0, 11); my $suffix = substr($myField, 11);` [download] Now load it up in your 2D hash (hash-of-hash): `$hash{$suffix}->{$prefix} = ();` So this is how you would generate your lookup hash with <CHFILE>: `my %lookupHash = (); LINE: while(my $line=<CHFILE>) { chomp $line; my @fields = split(/\\|/, $line); my $prefix = substr($fields[1], 0, 11); # Grab from pos0 to pos9 my $suffix = substr($fields[1], 11); # Grab from pos10 and on $lookupHash{$suffix}->{$prefix} = (); #$lookupHash{$suffix}->{$prefix} = $line; }` [download] %lookupHash will be your lookup that you use to search your parent file. Note that I didn't put in any error checking to see if your field conforms to what you're expecting. If you need to keep track of the entire $line itself, you can use the alternate version of %lookuphash assignment above. Here's how you would search your parent file: LINE: while(my $line=<PRFILE>) { chomp $line; my @fields = split(/\\|/, $line); # do suffix first; skip line in parent if suffix absent my $suffix = substr($fields[4], 11); next LINE if(!exists $lookupHash{$suffix}); # reach here if suffix exists; check prefix my $prefix = substr($fields[4], 0, 11); next LINE if(!exists $lookupHash{$suffix}->{$prefix}); # Output the parent line print $OUTFILE $line."\n"; # ... or output the child line if you needed that one #print $OUTFILE $lookupHash{$suffix}->{$prefix}."\n"; } [download] This should speed up your program and reduce memory without additional data profiling. Good luck!	[reply] [d/l] [select]
Re^2: Compare two large files and extract a matching row... by suneel.reddy (Novice) on Apr 18, 2012 at 07:18 UTC
Hi traceyfreitas, Thanks a lot for explaining it in detail , Let me try those approaches....and one more thing , there will be no duplicates in any of my files. Thanks again , I heartfully appreciate you for explaining it with minute details......	[reply]
Re: Compare two large files and extract a matching row... by mrguy123 (Hermit) on Apr 17, 2012 at 12:57 UTC
The problem here is that storing too much data in an array will give you a memory problem very fast (especially for large files) Placing the data in a hash is a great solution (as mentioned above) but for large files that might still be a problem What might help you out here, is that Perl is amazingly fast in reading files. I've seen it go through more than 40 GB of data in minutes. This means if your hash is to big you can split File1 into smaller hashes and then go over File2 a few times and it won't take you very long Hope this helps Mr Guy	[reply]