Help on file comparison

Danu has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

First let me thank you for accepted my registration and given the chance to interact the people like you who all are guiding us to become the Monster.

Thank you once again.

Let me start my first interaction for getting the suggestion instead providing. I have a question for file handling. I have two files named DUMP_ID and DUMP_CARD_NO and both the files contains the records of 37 million (this might be increase/decrease some quarter).

The scenario is - DUMP_ID contains the unique id of the card member and DUMP_CARD_NO contains unique id and the corresponding card account number. The unique id can have more than one card account number depends on the region/locale. Here I have to take a record from DUMP_ID and compare this unique id with DUMP_CARD_NO’s unique id; if it matches then unique id and the corresponding account number has to be write into an another file(remember the single unique id can have more than one card account number).

Currently, I take all the IDs from DUMP_ACCT_NO. and moved into the hashmap.

while (<DUMP_ACCT_NO>) {
    chomp;
    my ($id, $accno) = split /\|/, $_;
    push @{$hashList{$guid}}, $accno;
}
[download]

Then take the ID from DUMP_ID and do the comparison with this hashmap.

while(<DUMP_ID>){
    chomp;
    s/\s+//g;
    if ($hashList{$_}) {
        for my $accno (@{$hashList{$_}}) {
            print FINAL_FILE "$_|$accno\n";
        }
    }
}
[download]

Here the problem is, we are getting the out of memory; thus makes unable to complete the execution. Kindly guide me to go with the right approach.

Thank you.

Comment on Help on file comparison Select or Download Code

Replies are listed 'Best First'.
Re: Help on file comparison by BrowserUk (Patriarch) on May 28, 2011 at 14:32 UTC
A couple of questions: What do the unique ids look like. How many of what characters? What do the account numbers look like. How many of what characters? You say: "both the files contains the records of 37 million" and "The unique id can have more than one card account number"; both of which cannot be completely correct, because if both files contained the same number of records, there would have to be a one-to-one correspondance between unique ids and account numbers. So which is it? Are the 37 million unique ids and more than 37 million records in second file? If so, how many records in the second file? Or are the 37 million records in the second file and less unique ids? If so, how many? At what point do you run out of memory. Is it when building the hash of unique ids, or whilst populating the arrays with the corresponding account numbers? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Help on file comparison by Anonymous Monk on May 28, 2011 at 18:10 UTC
Hi, Unique id contains 32 digit of alpha numeric value. And the card number is 15 digits of numeric value. DUMP_ID contains the value like, aer7893jhufn3ko9ij8omnu89koi8hyt aer7893jhufn7ko9ij8omnu89koi8hnh DUMP_ACCT_NO contains the record like, aer7893jhufn7ko9ij8omnu89koi8hnh\|675634902349287 aer7893jhufn7ko9ij8omnu89koi8hnh\|324634902349287 Here, DUMP_ID contains more records than DUMP_ACCT_NO. We are getting the out of memory while pushing into the hash map.	[reply]
Re^3: Help on file comparison (Just sort!) by BrowserUk (Patriarch) on May 28, 2011 at 19:12 UTC
There are various things you could do to reduce the memory requirements. For example, you could build up a string of account numbers rather than an array. This would save quite a lot of space: `C:\test>p1 @a = map int rand( 1e16 ), 1 .. 10;; print total_size \@a;; 496 $s = join ' ', @a;; print total_size $s;; 216` [download] Multiply that saving by 37 million and you might avoid the problem. Take it a step further and pack the account numbers and you can save even more: `$a = pack 'Q', @a;; print total_size $a;; 136` [download] But having looked back at your OP, what you are doing make no sense at all. It makes no sense to even read the second file as you only output records if they are already in the hash from processing the first file. In other words, having built the hash from the first file, all you need to do is dump its contents and ignore the second file completely But as your final output file is identical to the first of your input files, except that all the records with the same unique id are grouped together, the simplest, fastest way to achieve that is to just sort that file. Originally you call your files "DUMP_ID and DUMP_CARD_NO"* and then later you talk about "DUMP_ACCT_NO". That combined with the inconsistencies in your posted code make me think that this question is a plant. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Help on file comparison by davido (Cardinal) on May 28, 2011 at 16:08 UTC
You could tie your datastructures to disk entities rather than trying to hold them all in memory, but that may be too lightweight for your ultimate goals. I don't know what you'll be doing with the data, and how long you need to hold onto it. But your problem may be a good candidate for a database. With a relational database such as MySQL (or many others), you could create a table of user ID's and card-members, and another table of ID's and account numbers. In one table your id's will be primary (and unique) keys. In your second table the primary unique key will be account numbers (since presumably one account number cannot belong to more than one ID. Your ID's have a 1:1 relationship with individuals. Your ID's have a 1:M relationship (one to many) with account numbers. Your account numbers have a M:1 relationship (Many to one) with ID's. It would obviously make more sense if I could draw a relationship sketch. The database solution would be interacted with using DBI, or Class::DBI, the latter of which would provide an object oriented model for your database. Dave	[reply]
Re: Help on file comparison by graff (Chancellor) on May 28, 2011 at 16:26 UTC
If the snippets in the OP are what you're actually running, then I assume that either you did not 'use strict', or else you have a variable declared elsewhere called '$guid' and you are using it where you shouldn't be. Here's the problem in your first while loop: `my ($id, $accno) = split /\\|/, $_; push @{$hashList{$guid}}, $accno;` [download] I think you want: `push @{$hashList{id}}, $accno;` As it is, all records in the first input are being pushed onto a single array. Then, in the second while loop, it may be you are running into that one string that matches the one hash key that contains the one big array. And this statement: `for my $accno (@{$hashList{$_}})` [download] tries to make a copy of the array. Boom. (I can't be sure this is what's happening based on the small amount of info you gave, but it looks plausible.) UPDATE: Actually, I'm not sure that the for-loop expression there is copying the array -- seems like it shouldn't have to. Still, you need to fix how the hash gets loaded.	[reply] [d/l] [select]