A couple of things I notice from your code:
1. you store all the fields of your child file in @carray. Since you only need the second field, you could just have each element in @carray contain this field.
2. for every line (M) in parent, you're searching the entire child list (N) against. That's N*M comparisons. From the way you've arranged @carray, duplicates are not consolidated and so you're potentially doing extra searches.
3. you call chomp() more than need to. Call chomp once upon reading a line from <PRFILE> (prior to splitting). Do this when you load your field into @carray and you won't have to call chomp during your foreach(@carray) loop.
while(my $pline=<PRFILE>) {
chomp $pline;
$parrecord = $pline;
my @parfields = split();
...
}
I see a couple of approaches you could take:
ARRAY approach
1. Create a non-redundant array (@carray) of field #2 from your child file (File1)
-either add each field#2 to a hash; extract the keys with "keys()"; undef the hash or let it go out of scope to free up mem to perl
or
-push field#2 to (@carray) directly; sort; remove duplicates
2. Search <PRFILE> for each $_ in (@carray);
CHILD: foreach (@carray) {
# search <PRFILE> line-by-line for a match of $_ to field #5
# if each field#5 in <PRFILE> is guaranteed unique, you can go to
+the next CHILD element once a match is found
}
HASH approach
In my own experience, I've noticed that the number of keys in a hash table can grow to about 280,000 keys before it starts slowing down considerably. If, in your 10M or so records, you have < 300,000 unique fields, then this approach should be fine. If not, then you'll either have to go with the ARRAY approach above, or break up your key into a prefix and suffix and store it that way. I'll give an example below.
1. Create a hash of all the elements:
my %lookupHash = ();
while(my $line=<CHFILE>) {
chomp $line;
my @fields = split(/\|/, $line);
$lookupHash{ $fields[1] } = (); # $field[1] = 2nd field
#$lookupHash{ $fields[1] } = $line; # need entire line?
}
2. Search <PRFILE>, line-by-line, to see if any element in %lookup hash is present
while(my $line=<PRFILE>) {
chomp $line;
my @fields = split(/\|/, $line);
print $OUTFILE $line."\n" if(exists $lookupHash{ $fields[4] });
}
If your performance with the above approached really is like molasses, continue reading...
****** It gets more complicated below here ******
Now in the worst-case scenario, if your 1GB file of 10M or so records truly has 10M *unique* records, then this hash lookup will be slow as balls. At the cost of extra space, though hopefully not in excess of your system RAM, you could break up your hash key into a prefix and suffix to effectively create subsets of your data (as a hash-of-hash) so you don't have to look through the entire 10M records every time you do an "exists()". For example, you could break "400000042597061" down like one of these:
field pre suf $HoH{$pre}->{$suf} = ()
---------------- ---- ----------------- ------------------------
1400000042597061="1" +"400000042597061" $HoH{1}->{400000042597061}=();
="14"+"00000042597061" $HoH{14}->{00000042597061}=();
...
This way, you could do an exists() on a much smaller subset of data. Unfortunately, your 10M records means there are only at most 8 unique digits (10000000 => 8 digits) and your field here has 16 digits, so I would store the suffix as the primary key, and the prefix behind the suffix. In addition, you would do well to keep the length of the suffix down to 5 digits or less (at most 99,000 keys) rather than 6 digits (at most 999,999 keys which could be too slow). Example:
field prefix suffix $HoH{$suf}->{$pre} = ();
---------------- ----------- -------- -------------------------
1400000042597061="1400000042"+"597061" $HoH{597061}->{1400000042}=();
Assuming $myField is the field of interest like "1400000042597061", you can break them down into these prefix/suffix segments with substr():
my $prefix = substr($myField, 0, 11);
my $suffix = substr($myField, 11);
Now load it up in your 2D hash (hash-of-hash):
$hash{$suffix}->{$prefix} = ();
So this is how you would generate your lookup hash with <CHFILE>:
my %lookupHash = ();
LINE: while(my $line=<CHFILE>) {
chomp $line;
my @fields = split(/\|/, $line);
my $prefix = substr($fields[1], 0, 11); # Grab from pos0 to pos9
my $suffix = substr($fields[1], 11); # Grab from pos10 and on
$lookupHash{$suffix}->{$prefix} = ();
#$lookupHash{$suffix}->{$prefix} = $line;
}
%lookupHash will be your lookup that you use to search your parent file.
Note that I didn't put in *any* error checking to see if your field conforms to what you're expecting. If you need to keep track of the entire $line itself, you can use the alternate version of %lookuphash assignment above.
Here's how you would search your parent file:
LINE: while(my $line=<PRFILE>) {
chomp $line;
my @fields = split(/\|/, $line);
# do suffix first; skip line in parent if suffix absent
my $suffix = substr($fields[4], 11);
next LINE if(!exists $lookupHash{$suffix});
# reach here if suffix exists; check prefix
my $prefix = substr($fields[4], 0, 11);
next LINE if(!exists $lookupHash{$suffix}->{$prefix});
# Output the parent line
print $OUTFILE $line."\n";
# ... or output the child line if you needed that one
#print $OUTFILE $lookupHash{$suffix}->{$prefix}."\n";
}
This should speed up your program and reduce memory without additional data profiling.
Good luck! |