sivaraman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a question for file handling. I have two files named DUMP_A and DUMP_B and both the files has records close to 5 million. The scenario is - DUMP_A contains only unique id and DUMP_B contains unique id, account number. Here I have to take a record from DUMP_A and compare this unique id with DUMP_B unique id; if it matches then unique id and the corresponding account number has to be write into an another file. The unique number can have more than one account number. Here, in my code the comparison takes more than 12 hours and still the script is running. Kindly look into the script and guide me on this.
open(A, "<" . DUMP_A) || die("Could not open file \n"); open(B, "<" . DUMP_B) || die("Could not open file \n"); open(OUTPUT,">" . INACTIVE_LIST) || die ("Could not write to file \n") +; my $readA; my $readB; while(defined($readA = <A>)){ $readA =~ s!\s+!!g; #chomp($readA); my $a_ID = $readA; while(defined($readB = <B>)){ chomp($readB); my ($b_ID, $b_acctno)=split(/\|/,$readB); # Check both the GUIDs are same if($a_ID =~ m/^$b_ID/i){ #if($a_ID eq $b_ID){ print OUTPUT "$a_ID|$b_ID \n"; } } seek(B,0,0); } close(OUTPUT); close(A); close(B);
- Thank you.

Replies are listed 'Best First'.
Re: Reg: Performance
by Marshall (Canon) on Oct 28, 2010 at 05:46 UTC
    Use the command line sort to sort both files. Then you can "walk" through both of them in one pass without having to read file B, 5 million times! B appears to be small enough to fit into memory, you could just make a hash, but I take it that the "unique id" id's really aren't unique and you can have multiple values for each id, if so then just make a HoA.
      Thank you Marshall for your quick reply. Can you please tell me the command to sort both the files as I am new to PERL. - Thank you.

        It's not a Perl command, it's a system command: sort. It's common on *nix and available for Windows.

Re: Reg: Performance
by salva (Canon) on Oct 28, 2010 at 08:57 UTC
    Nowadays 5 million entries is nothing!

    Just load DUMP_B into a hash and then go over DUMP_A using the hash for fast lookup.

    The unique number can have more than one account number

    To cope with that you will have to use a hash of arrays:

    # untested! my %b; open(B, "<" . DUMP_B) || die("Could not open file \n"); while (<B>) { chomp; my ($id, $ac) = split /\|/, $_; push @{$b{$id}}, $ac; } open(A, "<" . DUMP_A) || die("Could not open file \n"); while(<A>) { chomp; my $ac = $b{$_} || []; print "$_: ", join(', ', @$ac), "\n"; }

      Salva, This is working fine but prints all the matching and unmatching information in the console. Please help me that I want to print only the matches in to the FILE.

      Thank you.

      Thank you Salva. The problem is, it takes more time while write the output into the file. Please tell me, in your scenario how to write only the data which is matched into the file instead of into the console. - Thank you.

        from shell
        perl script.pl > file.txt
        or in code
        open OUT, '>', 'file.txt'; ... print OUT ...

      Thank you Salva. Now I have a code like

      my %b; while (DUMP_B) { chomp; my ($id, $ac) = split /\|/, $_; push @{$b{$id}}, $ac; } my $readID; while(defined($readID = DUMP_A)){ #chomp; $readID =~ s!\s+!!g; my $ac = $b{$readID} || []; #print "$readID: ", join(', ', @$ac), "\n"; my $arrSize = scalar @$ac; if($arrSize > 0 ){ for (my $i = 0; $i < $arrSsize ; $i++) { print OUTPUT "$readID|@$ac[$i] \n"; } } }
      This gives the expected output. Could you please explain me on 'my $ac = $b{$readID} || [];' Thank you once again and for all the Monks.

        $b{$readID} || []
        When there are no entries for $readID in DUMP_B, $b{$readID} becomes undef, appending || [] handles this special case replacing undef by a reference to an empty array so that we can dereference it later as @$ac.

        A simpler version of the code follows:

        my %b; open(B, "<" . DUMP_B) || die("Could not open file \n"); while (<B>) { chomp; my ($id, $ac) = split /\|/, $_; push @{$b{$id}}, $ac; } open( OUTPUT, '>', 'INACTIVE_LIST' ) || die "Could not write to file\n +"; open(A, "<" . DUMP_A) || die("Could not open file \n"); while(<A>) { chomp; s/\s+//g; if ($b{$_}) { for my $ac (@{$b{$_}}) { print OUTPUT "$_|$ac\n" } } }
        BTW, when programming in Perl it is very unusual to use C-like (or Java-like) for loops. Instead, for (@array) { ... } and for (0..$top) { ... } are used.
Re: Reg: Performance
by Khen1950fx (Canon) on Oct 28, 2010 at 06:09 UTC
    This prints $a_ID and $b_ID to the INACTIVE_LIST just as you told it to do.
    #!/usr/bin/perl use strict; use warnings; open( A, '<', 'DUMP_A' ) or die "Could not open file\n"; open( B, '<', 'DUMP_B' ) or die "Could not open file\n"; open( OUTPUT, '>', 'INACTIVE_LIST' ) or die "Could not write to file\n"; my $readA; my $readB; while ( defined( $readA = <A> ) ) { $readA =~ s/\s+//g; my $a_ID = $readA; while ( defined( $readB = <B> ) ) { chomp($readB); my ( $b_ID, $b_acctno ) = split( /\|/, $readB ); if ( $a_ID =~ /^$b_ID/i ) { print OUTPUT "$a_ID|$b_ID \n"; } } seek( B, 0, 0 ); } close OUTPUT; close A; close B;
Re: Reg: Performance
by mjscott2702 (Pilgrim) on Oct 28, 2010 at 08:15 UTC
    Sounds like an ideal job for Tie::File::AsHash ?

    Anyway, you really want to avoid reading DUMP_B millions of times, once for every line in DUMP_A. It should be possible to read in both files once, and create a hash for each, keyed by the ID field, then iterate through the hash for DUMP_A and look up the corresponding entry in the hash for DUMP_B - this is the reason I suggested the above module (haven't actually tried it, shame on me) - you should be able to treat each dump file like a hash, and split on whatever character you want.

      Greetings,

      I realize that it is your intention to use perl for this task, and while I haven't seen either dump_a || dump_b

      I can't help but wonder if

      cat | sed | sort | uniq

      might not be of great help here.

      I run a _huge_ RBL with _millions_ of IP addresses.

      I constantly need to parse logs, and add/remove results from the block lists.

      While I began my strategy using perl scripts. I ultimately found that

      cat | sed | sort | uniq

      would accomplish the task in seconds as opposed to minutes/hours.

      Perhaps it's my perl skills. But I just thought it was worth mentioning.

      HTH

      --Chris

      Shameless self pronotion follows
      PerlWatch
Re: Reg: Performance
by locked_user sundialsvc4 (Abbot) on Oct 28, 2010 at 13:40 UTC

    Yup... do it the COBOL way.   Do it just like they did it before digital computers existed.

    Sort the two files.   Now you can use the diff command, or logic identical to that command, to locate the records that are common and the records (in each file) that are different.

    As a test, while writing this, I generated a disk-file of five million random strings and disk-sorted it.   Nine seconds.   Hmm... kinda slow...

    Let’s say that each of these are ... not a file of millions of records, but two packs of playing-cards that you accidentally dropped on the floor.   (Or, as in my case, a great big box of 80-column punched cards that you just dropped onto the floor.)   You possess a “magic sorting-box” that can sort a deck of cards in a blink of an eye.   Put both decks, in turn, into the box and push the button.   Take the two now-sorted decks and turn over the top card in each.   Because you know that the two decks are sorted identically, you can answer a great many questions just by looking at the next card (if there is one, and remembering what the preceding card was) in each of the two decks.   You do not have to “search,” nor do you have to go backward.   In just one sequential pass through the two files in parallel, you can merge them, compare them, detect whether records are missing, identify gaps and their size, and so on.   (Almost) no memory required.

    In the days of reel-to-reel magnetic tapes and punched cards, yup, that’s what they were doing.   And it worked.   Still does.   Long before IBM got into the business of renting computers, they rented punched-card sorters and collating machines.   (And sold punchcards by the truckload.)

      Dear Friend, I am little bit of confused here. In DUMP_A i have unique id and DUMP_B i have unique id along with ACCT #. Consider, this is my inputs.

      DUMP_A: ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 DUMP_B: ffe47aadf1add54e3a8e925b40530c20|323568945210360 ffe47aadf1add54e3a8e925b40530c20|323568945210361 ffe47aadf1add54e3a8e925b40530c20|323568945210362 ffe47aadf1add54e3a8e925b40530c20|323568945210363 ffe47aadf1add54e3a8e925b40530c20|323568945210364
      It should take unique id from DUMP_A and check with DUMP_B, if it matches then both the value has to be write into the new file. The output should be as the below for the above inputs:
      ffe47aadf1add54e3a8e925b40530c20|323568945210360 ffe47aadf1add54e3a8e925b40530c20|323568945210361 ffe47aadf1add54e3a8e925b40530c20|323568945210362 ffe47aadf1add54e3a8e925b40530c20|323568945210363 ffe47aadf1add54e3a8e925b40530c20|323568945210364

      - Thank you.
        BTW, the id in DUMP_A is not unique.
        Nevertheless, you can still use the algorithm: Sort both input files. Take the first id from DUMP_A. Process DUMP_B while it contains the same id. Move to the next id in DUMP_A, repeat until the end of DUMP_A.