Reg: Performance

sivaraman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reg: Performance by Marshall (Canon) on Oct 28, 2010 at 05:46 UTC
Use the command line sort to sort both files. Then you can "walk" through both of them in one pass without having to read file B, 5 million times! B appears to be small enough to fit into memory, you could just make a hash, but I take it that the "unique id" id's really aren't unique and you can have multiple values for each id, if so then just make a HoA.	[reply]
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 28, 2010 at 05:55 UTC
Thank you Marshall for your quick reply. Can you please tell me the command to sort both the files as I am new to PERL. - Thank you.	[reply]
Re^3: Reg: Performance by ig (Vicar) on Oct 28, 2010 at 07:01 UTC
It's not a Perl command, it's a system command: sort. It's common on *nix and available for Windows.	[reply]
Re: Reg: Performance by salva (Canon) on Oct 28, 2010 at 08:57 UTC
Nowadays 5 million entries is nothing! Just load DUMP_B into a hash and then go over DUMP_A using the hash for fast lookup. The unique number can have more than one account number To cope with that you will have to use a hash of arrays: `# untested! my %b; open(B, "<" . DUMP_B) \|\| die("Could not open file \n"); while (<B>) { chomp; my ($id, $ac) = split /\\|/, $_; push @{$b{$id}}, $ac; } open(A, "<" . DUMP_A) \|\| die("Could not open file \n"); while(<A>) { chomp; my $ac = $b{$_} \|\| []; print "$_: ", join(', ', @$ac), "\n"; }` [download]	[reply] [d/l]
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 29, 2010 at 04:31 UTC
Salva, This is working fine but prints all the matching and unmatching information in the console. Please help me that I want to print only the matches in to the FILE. Thank you.	[reply]
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 28, 2010 at 10:12 UTC
Thank you Salva. The problem is, it takes more time while write the output into the file. Please tell me, in your scenario how to write only the data which is matched into the file instead of into the console. - Thank you.	[reply]
Re^3: Reg: Performance by happy.barney (Friar) on Oct 28, 2010 at 10:17 UTC
from shell `perl script.pl > file.txt` [download] or in code `open OUT, '>', 'file.txt'; ... print OUT ...` [download]	[reply] [d/l] [select]
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 29, 2010 at 06:34 UTC
Thank you Salva. Now I have a code like `my %b; while (DUMP_B) { chomp; my ($id, $ac) = split /\\|/, $_; push @{$b{$id}}, $ac; } my $readID; while(defined($readID = DUMP_A)){ #chomp; $readID =~ s!\s+!!g; my $ac = $b{$readID} \|\| []; #print "$readID: ", join(', ', @$ac), "\n"; my $arrSize = scalar @$ac; if($arrSize > 0 ){ for (my $i = 0; $i < $arrSsize ; $i++) { print OUTPUT "$readID\|@$ac[$i] \n"; } } }` [download] This gives the expected output. Could you please explain me on 'my $ac = $b{$readID} \|\| [];' Thank you once again and for all the Monks.	[reply] [d/l]
Re^3: Reg: Performance by salva (Canon) on Oct 29, 2010 at 08:14 UTC
`$b{$readID} \|\| []` [download] When there are no entries for `$readID` in `DUMP_B`, `$b{$readID}` becomes `undef`, appending `\|\| []` handles this special case replacing `undef` by a reference to an empty array so that we can dereference it later as `@$ac`. A simpler version of the code follows: `my %b; open(B, "<" . DUMP_B) \|\| die("Could not open file \n"); while (<B>) { chomp; my ($id, $ac) = split /\\|/, $_; push @{$b{$id}}, $ac; } open( OUTPUT, '>', 'INACTIVE_LIST' ) \|\| die "Could not write to file\n +"; open(A, "<" . DUMP_A) \|\| die("Could not open file \n"); while(<A>) { chomp; s/\s+//g; if ($b{$_}) { for my $ac (@{$b{$_}}) { print OUTPUT "$_\|$ac\n" } } }` [download] BTW, when programming in Perl it is very unusual to use C-like (or Java-like) `for` loops. Instead, `for (@array) { ... }` and `for (0..$top) { ... }` are used.	[reply] [d/l] [select]
Re^4: Reg: Performance by sivaraman (Initiate) on Oct 29, 2010 at 12:57 UTC
Re: Reg: Performance by Khen1950fx (Canon) on Oct 28, 2010 at 06:09 UTC
This prints $a_ID and $b_ID to the INACTIVE_LIST just as you told it to do. #!/usr/bin/perl use strict; use warnings; open( A, '<', 'DUMP_A' ) or die "Could not open file\n"; open( B, '<', 'DUMP_B' ) or die "Could not open file\n"; open( OUTPUT, '>', 'INACTIVE_LIST' ) or die "Could not write to file\n"; my $readA; my $readB; while ( defined( $readA = <A> ) ) { $readA =~ s/\s+//g; my $a_ID = $readA; while ( defined( $readB = <B> ) ) { chomp($readB); my ( $b_ID, $b_acctno ) = split( /\\|/, $readB ); if ( $a_ID =~ /^$b_ID/i ) { print OUTPUT "$a_ID\|$b_ID \n"; } } seek( B, 0, 0 ); } close OUTPUT; close A; close B; [download]	[reply] [d/l]
Re: Reg: Performance by mjscott2702 (Pilgrim) on Oct 28, 2010 at 08:15 UTC
Sounds like an ideal job for Tie::File::AsHash ? Anyway, you really want to avoid reading DUMP_B millions of times, once for every line in DUMP_A. It should be possible to read in both files once, and create a hash for each, keyed by the ID field, then iterate through the hash for DUMP_A and look up the corresponding entry in the hash for DUMP_B - this is the reason I suggested the above module (haven't actually tried it, shame on me) - you should be able to treat each dump file like a hash, and split on whatever character you want.	[reply]
Re^2: Reg: Performance by use perl::always (Initiate) on Oct 28, 2010 at 09:02 UTC
Greetings, I realize that it is your intention to use perl for this task, and while I haven't seen either dump_a \|\| dump_b I can't help but wonder if cat \| sed \| sort \| uniq might not be of great help here. I run a _huge_ RBL with _millions_ of IP addresses. I constantly need to parse logs, and add/remove results from the block lists. While I began my strategy using perl scripts. I ultimately found that cat \| sed \| sort \| uniq would accomplish the task in seconds as opposed to minutes/hours. Perhaps it's my perl skills. But I just thought it was worth mentioning. HTH --Chris Shameless self pronotion follows PerlWatch	[reply]
Re: Reg: Performance by locked_user sundialsvc4 (Abbot) on Oct 28, 2010 at 13:40 UTC
Yup... do it the `COBOL` way. Do it just like they did it before digital computers existed. Sort the two files. Now you can use the `diff` command, or logic identical to that command, to locate the records that are common and the records (in each file) that are different. As a test, while writing this, I generated a disk-file of five million random strings and disk-sorted it. Nine seconds. Hmm... kinda slow... Let’s say that each of these are ... not a file of millions of records, but two packs of playing-cards that you accidentally dropped on the floor. (Or, as in my case, a great big box of 80-column punched cards that you just dropped onto the floor.) You possess a “magic sorting-box” that can sort a deck of cards in a blink of an eye. Put both decks, in turn, into the box and push the button. Take the two now-sorted decks and turn over the top card in each. Because you know that the two decks are sorted identically, you can answer a great many questions just by looking at the next card (if there is one, and remembering what the preceding card was) in each of the two decks. You do not have to “search,” nor do you have to go backward. In just one sequential pass through the two files in parallel, you can merge them, compare them, detect whether records are missing, identify gaps and their size, and so on. (Almost) no memory required. In the days of reel-to-reel magnetic tapes and punched cards, yup, that’s what they were doing. And it worked. Still does. Long before IBM got into the business of renting computers, they rented punched-card sorters and collating machines. (And sold punchcards by the truckload.)
Re^2: Reg: Performance by sivaraman (Initiate) on Oct 29, 2010 at 04:18 UTC
Dear Friend, I am little bit of confused here. In DUMP_A i have unique id and DUMP_B i have unique id along with ACCT #. Consider, this is my inputs. DUMP_A: ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c20 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 ffe47aadf1add54e3a8e925b40530c29 DUMP_B: ffe47aadf1add54e3a8e925b40530c20\|323568945210360 ffe47aadf1add54e3a8e925b40530c20\|323568945210361 ffe47aadf1add54e3a8e925b40530c20\|323568945210362 ffe47aadf1add54e3a8e925b40530c20\|323568945210363 ffe47aadf1add54e3a8e925b40530c20\|323568945210364 [download] It should take unique id from DUMP_A and check with DUMP_B, if it matches then both the value has to be write into the new file. The output should be as the below for the above inputs: `ffe47aadf1add54e3a8e925b40530c20\|323568945210360 ffe47aadf1add54e3a8e925b40530c20\|323568945210361 ffe47aadf1add54e3a8e925b40530c20\|323568945210362 ffe47aadf1add54e3a8e925b40530c20\|323568945210363 ffe47aadf1add54e3a8e925b40530c20\|323568945210364` [download] - Thank you.	[reply] [d/l] [select]
Re^3: Reg: Performance by choroba (Cardinal) on Oct 29, 2010 at 07:45 UTC
BTW, the id in DUMP_A is not unique. Nevertheless, you can still use the algorithm: Sort both input files. Take the first id from DUMP_A. Process DUMP_B while it contains the same id. Move to the next id in DUMP_A, repeat until the end of DUMP_A.	[reply]