Re^2: Needed Performance improvement in reading and fetching from a file

The data in the reference file is as below.

UTRIR8709990166-PRIS
UTRIR8709990166-IONJ
UTRIR8709990166-SONIC
UTRIR8709990166-INTR
UTRIR8709990166-MNSS
UTRIR8709990166-POIO
and so on
[download]

iam storing UTR (payment reference number) along with client seperated.
i will read the reference file containing above information, split and take UTR number into @sentUTRs array.
for information purpose , iam doing below..
recieve flat text file containing payments seperated by ~ in each line every 15 minutes
run this perl script and consider only new payments i.e. check reference file for already considered payments and ignore them and consider new payments and update these into reference file at the end.
All the new payments iam writing into CSV file w.r.t client.
so to achieve above, i just open flat file, read each line and split it and take second field containing UTR number and match it with the ones already sent in reference file and do further processing.

Comment on Re^2: Needed Performance improvement in reading and fetching from a file Download Code

Replies are listed 'Best First'.
Re^3: Needed Performance improvement in reading and fetching from a file by GrandFather (Saint) on Oct 08, 2008 at 20:16 UTC
There are many options that may help solve your problem. For a start, if it is the same file every 15 minutes then you can remember (possibly in a configuration file) where you had processed up to last time and continue from that point this time - no searching required at all! The absolute standard fix to your immediate problem is to store your payment numbers (keys) in a hash then use a very fast constant time lookup (that's what hashes do when you give them a key and ask for a value) for your match check. Consider: use strict; use warnings; my @refnos = (); my $old = <<OLD; UTRIR8709990166 ZPHLHLKJ87 OLD my %oldPayments; open my $payments, '<', \$old; %oldPayments = map {$_ => undef} grep {chomp; length} <$payments>; close $payments; print "Reading UTR Payment numbers \n"; while (<DATA>) { chomp; my @data = split (/~/, $_); (my $utr = uc $data[1]) =~ s/\s*//g; next if exists $oldPayments{$utr}; $oldPayments{$utr} = $data[1]; print "Payment $utr received of $data[2]\n"; } open $payments, '>', \$old; print $payments join "\n", sort keys %oldPayments, ''; close $payments; print "New payments are:\n "; print join "\n ", grep {defined $oldPayments{$_}} sort keys %oldPaym +ents; __DATA__ 0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~ +IFSCSEND001 ~Remiter Details ~1000007 ~TEST R +TGS TRF7 ~ ~ + ~ ~RTGS~REVOSN OIL CORPORATION ~IOC +L ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~ +AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~4578962 ~/BNF/9822644928 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7 +~ 1~ 1 0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB +C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567 + ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D +ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO +N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~ + ~SDV AIR LINK REVOS LIMITED ~458ss4 +53 ~ ~ + ~ ~ ~NEFT~REVOSN OIL CORPORA +TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ + 1 0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~ + ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~ ~/BNF/9826805798 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~ + 1~ 1 [download] Prints: `Reading UTR Payment numbers Payment UTRIN9080980866 received of 1222706 Payment Q08709798905745 received of 5960.74 Payment UTRI790898U0166 received of 2414 New payments are: Q08709798905745 UTRI790898U0166 UTRIN9080980866` [download] Of course I've used a variable as a file to save needing to use a disk based file for the example, but in practice you would use a disk based file of course. However, if your data set gets very large (millions of entries perhaps) you should seriously consider using a database instead of a flat file if at all possible. Perl reduces RSI - it saves typing	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Needed Performance improvement in reading and fetching from a file
by GrandFather (Saint) on Oct 08, 2008 at 20:16 UTC

There are many options that may help solve your problem. For a start, if it is the same file every 15 minutes then you can remember (possibly in a configuration file) where you had processed up to last time and continue from that point this time - no searching required at all!

The absolute standard fix to your immediate problem is to store your payment numbers (keys) in a hash then use a very fast constant time lookup (that's what hashes do when you give them a key and ask for a value) for your match check. Consider:

use strict;
use warnings;
my @refnos = ();

my $old = <<OLD;
UTRIR8709990166
ZPHLHLKJ87
OLD

my %oldPayments;

open my $payments, '<', \$old;
%oldPayments = map {$_ => undef} grep {chomp; length} <$payments>;
close $payments;

print "Reading UTR Payment numbers \n";

while (<DATA>) {
    chomp;
    my @data = split (/~/, $_);
    (my $utr = uc $data[1]) =~ s/\s*//g;

    next if exists $oldPayments{$utr};
    $oldPayments{$utr} = $data[1];
    print "Payment $utr received of $data[2]\n";
}

open $payments, '>', \$old;
print $payments join "\n", sort keys %oldPayments, '';
close $payments;

print "New payments are:\n   ";
print join "\n   ", grep {defined $oldPayments{$_}} sort keys %oldPaym
+ents;

__DATA__
0906928472847292INR~UTRIR8709990166~     700000~INR~20080623~RC425484~
+IFSCSEND001                       ~Remiter Details ~1000007   ~TEST R
+TGS TRF7                     ~                                   ~   
+                                ~ ~RTGS~REVOSN OIL CORPORATION   ~IOC
+L  ~09065010889~0906501088900122INR~         7~         1~ 1
0906472983472834HJR~UTRIN9080980866~    1222706~INR~20080623~NI209960~
+AMEX0888888                       ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~4578962   ~/BNF/9822644928                    ~                    
+               ~                                   ~ ~NEFT~REVOSN OIL
+ CORPORATION   ~IO    CL  ~09065010889~0906501088900122INR~         7
+~         1~ 1
0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB
+C0560002                       ~MOTOSPECT UNILEVER LIMITED ~1234567  
+ ~/INFO/ATTN:                        ~//REF 1104210 PLEASE FIND THE D
+ET  ~                                   ~ ~NEFT~REVOSN OIL CORPORATIO
+N   ~IOCL  ~09065010889~0906501088900122INR~         7~         1~ 1
0906506749056822INR~Q08709798905745~    5960.74~INR~20080623~NI209987~
+                                  ~SDV AIR LINK REVOS LIMITED ~458ss4
+53  ~                                   ~                            
+       ~                                   ~ ~NEFT~REVOSN OIL CORPORA
+TION   ~IOCL  ~09065010889~0906501088900122INR~         7~         1~
+ 1
0906503389054302INR~UTRI790898U0166~       2414~INR~20080623~NI209976~
+                                  ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~          ~/BNF/9826805798                    ~                    
+               ~                                   ~ ~NEFT~REVOSN OIL
+ CORPORATION   ~IOCL  ~09065010889~0906501088900122INR~         7~   
+      1~ 1
[download]

Prints:

Reading UTR Payment numbers 
Payment UTRIN9080980866 received of     1222706
Payment Q08709798905745 received of     5960.74
Payment UTRI790898U0166 received of        2414
New payments are:
   Q08709798905745
   UTRI790898U0166
   UTRIN9080980866
[download]

Of course I've used a variable as a file to save needing to use a disk based file for the example, but in practice you would use a disk based file of course.

However, if your data set gets very large (millions of entries perhaps) you should seriously consider using a database instead of a flat file if at all possible.

Perl reduces RSI - it saves typing

[reply]
[d/l]
[select]