Re: Reg: Performance
by Marshall (Canon) on Oct 28, 2010 at 05:46 UTC
|
Use the command line sort to sort both files. Then you can "walk" through both of them in one pass without having to read file B, 5 million times! B appears to be small enough to fit into memory, you could just make a hash, but I take it that the "unique id" id's really aren't unique and you can have multiple values for each id, if so then just make a HoA. | [reply] |
|
|
Thank you Marshall for your quick reply.
Can you please tell me the command to sort both the files as I am new to PERL.
- Thank you.
| [reply] |
|
|
| [reply] |
Re: Reg: Performance
by salva (Canon) on Oct 28, 2010 at 08:57 UTC
|
Nowadays 5 million entries is nothing!
Just load DUMP_B into a hash and then go over DUMP_A using the hash for fast lookup.
The unique number can have more than one account number
To cope with that you will have to use a hash of arrays:
# untested!
my %b;
open(B, "<" . DUMP_B) || die("Could not open file \n");
while (<B>) {
chomp;
my ($id, $ac) = split /\|/, $_;
push @{$b{$id}}, $ac;
}
open(A, "<" . DUMP_A) || die("Could not open file \n");
while(<A>) {
chomp;
my $ac = $b{$_} || [];
print "$_: ", join(', ', @$ac), "\n";
}
| [reply] [d/l] |
|
|
| [reply] |
|
|
Thank you Salva.
The problem is, it takes more time while write the output into the file. Please tell me, in your scenario how to write only the data which is matched into the file instead of into the console.
- Thank you.
| [reply] |
|
|
perl script.pl > file.txt
or in code
open OUT, '>', 'file.txt';
...
print OUT ...
| [reply] [d/l] [select] |
|
|
my %b;
while (DUMP_B) {
chomp;
my ($id, $ac) = split /\|/, $_;
push @{$b{$id}}, $ac;
}
my $readID;
while(defined($readID = DUMP_A)){
#chomp;
$readID =~ s!\s+!!g;
my $ac = $b{$readID} || [];
#print "$readID: ", join(', ', @$ac), "\n";
my $arrSize = scalar @$ac;
if($arrSize > 0 ){
for (my $i = 0; $i < $arrSsize ; $i++) {
print OUTPUT "$readID|@$ac[$i] \n";
}
}
}
This gives the expected output.
Could you please explain me on 'my $ac = $b{$readID} || [];'
Thank you once again and for all the Monks.
| [reply] [d/l] |
|
|
my %b;
open(B, "<" . DUMP_B) || die("Could not open file \n");
while (<B>) {
chomp;
my ($id, $ac) = split /\|/, $_;
push @{$b{$id}}, $ac;
}
open( OUTPUT, '>', 'INACTIVE_LIST' ) || die "Could not write to file\n
+";
open(A, "<" . DUMP_A) || die("Could not open file \n");
while(<A>) {
chomp;
s/\s+//g;
if ($b{$_}) {
for my $ac (@{$b{$_}}) {
print OUTPUT "$_|$ac\n"
}
}
}
BTW, when programming in Perl it is very unusual to use C-like (or Java-like) for loops. Instead, for (@array) { ... } and for (0..$top) { ... } are used. | [reply] [d/l] [select] |
|
|
Re: Reg: Performance
by Khen1950fx (Canon) on Oct 28, 2010 at 06:09 UTC
|
This prints $a_ID and $b_ID to the INACTIVE_LIST just as you told it to do.
#!/usr/bin/perl
use strict;
use warnings;
open( A, '<', 'DUMP_A' )
or die "Could not open file\n";
open( B, '<', 'DUMP_B' )
or die "Could not open file\n";
open( OUTPUT, '>', 'INACTIVE_LIST' )
or die "Could not write to file\n";
my $readA;
my $readB;
while ( defined( $readA = <A> ) ) {
$readA =~ s/\s+//g;
my $a_ID = $readA;
while ( defined( $readB = <B> ) ) {
chomp($readB);
my ( $b_ID, $b_acctno ) = split( /\|/, $readB );
if ( $a_ID =~ /^$b_ID/i ) {
print OUTPUT "$a_ID|$b_ID \n";
}
}
seek( B, 0, 0 );
}
close OUTPUT;
close A;
close B;
| [reply] [d/l] |
Re: Reg: Performance
by mjscott2702 (Pilgrim) on Oct 28, 2010 at 08:15 UTC
|
Sounds like an ideal job for Tie::File::AsHash ?
Anyway, you really want to avoid reading DUMP_B millions of times, once for every line in DUMP_A. It should be possible to read in both files once, and create a hash for each, keyed by the ID field, then iterate through the hash for DUMP_A and look up the corresponding entry in the hash for DUMP_B - this is the reason I suggested the above module (haven't actually tried it, shame on me) - you should be able to treat each dump file like a hash, and split on whatever character you want. | [reply] |
|
|
Greetings,
I realize that it is your intention to use perl for this task,
and while I haven't seen either dump_a || dump_b
I can't help but wonder if
cat | sed | sort | uniq
might not be
of great help here. I run a _huge_ RBL with _millions_ of
IP addresses. I constantly need to parse logs, and add/remove results from the block lists. While I began my strategy using perl scripts. I ultimately found that cat | sed | sort | uniq would accomplish the task in seconds as opposed to minutes/hours. Perhaps it's my perl skills. But I just thought it was worth mentioning.
HTH
--Chris
| [reply] |
Re: Reg: Performance
by locked_user sundialsvc4 (Abbot) on Oct 28, 2010 at 13:40 UTC
|
Yup... do it the COBOL way. Do it just like they did it before digital computers existed.
Sort the two files. Now you can use the diff command, or logic identical to that command, to locate the records that are common and the records (in each file) that are different.
As a test, while writing this, I generated a disk-file of five million random strings and disk-sorted it. Nine seconds. Hmm... kinda slow...
Let’s say that each of these are ... not a file of millions of records, but two packs of playing-cards that you accidentally dropped on the floor. (Or, as in my case, a great big box of 80-column punched cards that you just dropped onto the floor.) You possess a “magic sorting-box” that can sort a deck of cards in a blink of an eye. Put both decks, in turn, into the box and push the button. Take the two now-sorted decks and turn over the top card in each. Because you know that the two decks are sorted identically, you can answer a great many questions just by looking at the next card (if there is one, and remembering what the preceding card was) in each of the two decks. You do not have to “search,” nor do you have to go backward. In just one sequential pass through the two files in parallel, you can merge them, compare them, detect whether records are missing, identify gaps and their size, and so on. (Almost) no memory required.
In the days of reel-to-reel magnetic tapes and punched cards, yup, that’s what they were doing. And it worked. Still does. Long before IBM got into the business of renting computers, they rented punched-card sorters and collating machines. (And sold punchcards by the truckload.)
| |
|
|
DUMP_A:
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c20
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c20
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
ffe47aadf1add54e3a8e925b40530c29
DUMP_B:
ffe47aadf1add54e3a8e925b40530c20|323568945210360
ffe47aadf1add54e3a8e925b40530c20|323568945210361
ffe47aadf1add54e3a8e925b40530c20|323568945210362
ffe47aadf1add54e3a8e925b40530c20|323568945210363
ffe47aadf1add54e3a8e925b40530c20|323568945210364
It should take unique id from DUMP_A and check with DUMP_B, if it matches then both the value has to be write into the new file.
The output should be as the below for the above inputs:
ffe47aadf1add54e3a8e925b40530c20|323568945210360
ffe47aadf1add54e3a8e925b40530c20|323568945210361
ffe47aadf1add54e3a8e925b40530c20|323568945210362
ffe47aadf1add54e3a8e925b40530c20|323568945210363
ffe47aadf1add54e3a8e925b40530c20|323568945210364
- Thank you. | [reply] [d/l] [select] |
|
|
BTW, the id in DUMP_A is not unique. Nevertheless, you can still use the algorithm: Sort both input files. Take the first id from DUMP_A. Process DUMP_B while it contains the same id. Move to the next id in DUMP_A, repeat until the end of DUMP_A.
| [reply] |