Re: Needed Performance improvement in reading and fetching from a file
by CountZero (Bishop) on Oct 08, 2008 at 05:33 UTC
|
Let's optimize the split-funtion and see how fast we can get:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all) ;
# Set up an array of data to test
my $data = '0906928472847292INR~UTRIR8709990166~ 700000~INR~200806
+23~RC425484~IFSCSEND001 ~Remiter Details ~10000
+07 ~TEST RTGS TRF7 ~
+ ~ ~ ~RTGS~REVOSN OIL CORPOR
+ATION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1
+~ 1';
my @data;
push @data, $data for 1 .. 1000;
cmpthese(1000, {
'Full split' => sub {my @refnos; for (@data){my @splitted = s
+plit(/~/,$_); push(@refnos,$splitted[1]);}},
'Limited split' => sub {my @refnos; for (@data){my @splitted = s
+plit(/~/,$_,3); push(@refnos,$splitted[1]);}},
'Optimized split' => sub {my @refnos; for (@data){push @refnos, (s
+plit(/~/,$_,3))[1];}},
});
Results: Rate Full split Limited split Optimized split
Full split 23.1/s -- -82% -93%
Limited split 127/s 452% -- -63%
Optimized split 343/s 1387% 169% --
Some things to remember: - Don't split into more fields than needed.
- Avoid assigning to an array if you only need one field.
- Even better, use split and an index into its reults as an argument in push
By using these simple rules, we get an almost 14 fold speed increase!
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
I wondered how a plain regex would fare. Probably not enough in it to be significant.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all) ;
# Set up an array of data to test
my $data = '0906928472847292INR~UTRIR8709990166~ 700000~INR~200806
+23~RC425484~IFSCSEND001 ~Remiter Details ~10000
+07 ~TEST RTGS TRF7 ~
+ ~ ~ ~RTGS~REVOSN OIL CORPOR
+ATION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1
+~ 1';
my @data;
push @data, $data for 1 .. 1000;
cmpthese(
1000,
{
'Full split' => sub {my @refnos; for (@data){my @splitted
+ = split(/~/,$_); push(@refnos,$splitted[1]);}},
'Limited split' => sub {my @refnos; for (@data){my @splitted
+ = split(/~/,$_,3); push(@refnos,$splitted[1]);}},
'Optimized split' => sub {my @refnos; for (@data){push @refnos
+, (split(/~/,$_,3))[1];}},
'regex' => sub {my @refnos; for (@data){push @refnos
+, /~([^~]+)~/;}},
}
);
Rate Full split Limited split Optimized split
+ regex
Full split 23.3/s -- -81% -90%
+ -91%
Limited split 120/s 413% -- -51%
+ -54%
Optimized split 244/s 946% 104% --
+ -7%
regex 262/s 1024% 119% 7%
+ --
| [reply] [d/l] [select] |
Re: Needed Performance improvement in reading and fetching from a file
by smiffy (Pilgrim) on Oct 08, 2008 at 05:16 UTC
|
I can't honestly see anything wrong with doing that split nor any reason that you have a performance issue with only 20k records.
However...
- chomp is a superfluous operation - why strip from the end of the line when you are only reading the second field?
- I know that you say this is a code fragment. One thing to consider if it were not a fragment is that you first populate an array then run through and print from it. That push() could have been print "$data[1]\n"; rather than running through two loops.
- Are you absolutely sure that your performance issues are coming from here and not elsewhere in your code?
Hope this helps.
| [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
Sorry - obviously didn't make myself clear. I meant looping through the @refnos array (with foreach) to print, not looping through the @data array; in the example, the @refnos array is superfluous although it is probably doing more in the whole application - or at least I hope it is.
I wasn't actually aware - or had forgotten - that split could be limited like your CountZero's example. Every day I learn something (or am reminded of it) is a day not wasted. Or something.
| [reply] |
Re: Needed Performance improvement in reading and fetching from a file
by perrin (Chancellor) on Oct 08, 2008 at 05:33 UTC
|
| [reply] |
|
|
looks that program is very fast in terms of handling huge files, can i customize it to my needs?
| [reply] |
Re: Needed Performance improvement in reading and fetching from a file
by gone2015 (Deacon) on Oct 08, 2008 at 12:02 UTC
|
Problem appears to come in two parts:
- extracting the "UTR" from input line
- testing whether that "UTR" is one of the "sent UTR"s
you say this is critical to the performance, so I'm assuming that this is going to filter out a lot of records, which don't need further processing.
Extracting the "UTR" is an SMOC, and the only question is what Perl will do quickest. I found (see below) that good old fashioned index/substr did the trick -- that small part of the puzzle runs ~9 times faster. (If the first field is fixed length, you could do better still.) Note that for records you do want to process you'll need to do the split as well -- so the actual saving depends on what proportion of records are being filtered out.
Testing whether the "UTR" is one of the "sent UTR"s requires some sort of search/match. The grep in the code is running linear search along @sentUTRs, what's more, it processes every entry even if there's been a match already. I suggest a hash would be a better choice.
CAVEAT: I have just realised that the test grep($r =~ /$_/, @sentUTRs) is, of course, not grep($r eq $_, @sentUTRs) -- if partial matches are essential, then a hash won't cut it :-(
Code below. Using index/substr and a hash ran ~16 times faster on my artificial test. Benchmark output (edited for clarity):
Benchmark: timing 400000 iterations
Split : 33.39 usr + 0.01 sys = 33.40 CPU @ 11976.05/s
Regex_b: 5.56 usr + 0.00 sys = 5.56 CPU @ 71942.45/s
Regex_a: 5.33 usr + 0.01 sys = 5.34 CPU @ 74906.37/s
Index : 3.87 usr + 0.00 sys = 3.87 CPU @ 103359.17/s
Benchmark: timing 200000 iterations
Split Grep: 55.82 usr + 0.02 sys = 55.84 CPU @ 3581.66/s
Index Grep: 39.86 usr + 0.01 sys = 39.87 CPU @ 5016.30/s
Split Hash: 18.12 usr + 0.01 sys = 18.13 CPU @ 11031.44/s
Index Hash: 3.30 usr + 0.00 sys = 3.30 CPU @ 60606.06/s
YMMV.
As you'd expect, fiddling with the coding to optimize the extraction of the "UTR" makes only a modest difference. Changing the algorithm for searching the "sentUTRs" makes a rather bigger difference.
Update: added the essential exists to the hash lookups, and updated the benchmark timings.
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark () ;
# Gather in the data
my @input = <DATA> ;
# Extracting the 'UTR'
print "Testing the 'UTR' extraction\n" ;
for (@input) {
my $r_s = by_split() ;
my $r_a = by_regex_a() ;
my $r_b = by_regex_b() ;
my $r_i = by_index() ;
my $s = "" ;
if ($r_a ne $r_s) { $s .= " BUT \$r_a='$r_a'" ; } ;
if ($r_b ne $r_s) { $s .= " BUT \$r_b='$r_b'" ; } ;
if ($r_i ne $r_s) { $s .= " BUT \$r_i='$r_i'" ; } ;
print " $r_s", ($s ? $s : " OK"), "\n" ;
} ;
Benchmark::timethese(400000, {
'Split ' => sub { by_split() for (@input) ; },
'Regex_a' => sub { by_regex_a() for (@input) ; },
'Regex_b' => sub { by_regex_b() for (@input) ; },
'Index ' => sub { by_index() for (@input) ; },
});
sub by_split {
my @data = split(/~/, $_) ; return $data[1] ;
} ;
sub by_regex_a {
m/~(.*?)~/ ; return $1 ;
} ;
sub by_regex_b {
m/~([^~]*)~/ ; return $1 ;
} ;
sub by_index {
my $i = index($_, '~') + 1 ;
return substr($_, $i, index($_, '~', $i) - $i) ;
} ;
# Testing for existing 'UTR'
my @received = map by_split(), @input ;
my @sentUTRs = ('ffsdahgdf', 'hjgfsdfghgaghsfd', $received[3],
'ppuiwdwsc', '4155dvcs7', $received[1]) ;
my %sentUTRs ; @sentUTRs{@sentUTRs} = undef ;
Benchmark::timethese(200000, {
'Split Grep' => sub { for (@input) {
my $r = by_split() ;
next if grep($r =~ /$_/, @sentUTRs) ;
$r .= $r ;
} ;
},
'Split Hash' => sub { for (@input) {
my $r = by_split() ;
next if exists $sentUTRs{$r} ;
$r .= $r ;
} ;
},
'Index Grep' => sub { for (@input) {
my $r = by_index() ;
next if grep($r =~ /$_/, @sentUTRs) ;
$r .= $r ;
} ;
},
'Index Hash' => sub { for (@input) {
my $r = by_index() ;
next if exists $sentUTRs{$r} ;
$r .= $r ;
} ;
},
});
__DATA__
0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~
+IFSCSEND001 ~Remiter Details ~1000007 ~TEST R
+TGS TRF7 ~ ~
+ ~ ~RTGS~REVOSN OIL CORPORATION ~IOC
+L ~09065010889~0906501088900122INR~ 7~ 1~ 1
0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~
+AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~4578962 ~/BNF/9822644928 ~
+ ~ ~ ~NEFT~REVOSN OIL
+ CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7
+~ 1~ 1
0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB
+C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567
+ ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D
+ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO
+N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1
0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~
+ ~SDV AIR LINK REVOS LIMITED ~458ss4
+53 ~ ~
+ ~ ~ ~NEFT~REVOSN OIL CORPORA
+TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~
+ 1
0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~
+ ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~ ~/BNF/9826805798 ~
+ ~ ~ ~NEFT~REVOSN OIL
+ CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~
+ 1~ 1
| [reply] [d/l] [select] |
|
|
Iam really happy to see the above, its really a good learning, i will try the above and its not only a performance improvement in my program but a very good learning.
| [reply] |
|
|
you are right , index + hash turned to be the best combination and i saw speed as below..
Split : 107 wallclock secs (49.84 usr + 1.81 sys = 51.65 CPU) @ 0.0
+2/s (n=1)~
i considered 2.6 lakh records and whole program is timed hash versus grep, believe me, grep is out of question, its taking time as good as counting from 1 to 2.6 lakh.
i guess i have made a poor choice of using grep in this scenario, but indeed grep cannot be used when it comes to huge data. | [reply] [d/l] |
Re: Needed Performance improvement in reading and fetching from a file
by fmerges (Chaplain) on Oct 08, 2008 at 05:44 UTC
|
Hi,
I agree with smiffy. Even just splitting up to the fields you need, would not make that much difference, pushing to an array and later going again over it for printing is superfluous; if you really need to get it these way from another part of the program, you can also think of using an iterator/closure. And again, are you really sure it's this section that is taking all the time? have you used Devel::DProf or Benchmark::Stopwatch?
Regards,
fmerges at irc.freenode.net
| [reply] |
|
|
i do have other blocks in the program , but not really sure to how to find performance, so i suspected this splitting operation.
infact . my requirement is little bit complex.
Based on second field fetched, i will do some comparisions which then might need all fields which i use for CSV conversion(to csv file) in later part. so its wise to get second field only initially (using Optimized split as said above) and then if needed i will further split down later.
i will try to see if other parts of program are really hitting peformance
| [reply] |
|
|
"but not really sure to how to find performance"
It sounds like you would benefit from benchmarking/profiling your code. See Debugging and Optimization from the tutorials section of this site, also worth looking at is the Devel::NYTProf module, which I mention here.
Hope this helps
Martin
| [reply] |
|
|
Re: Needed Performance improvement in reading and fetching from a file
by GrandFather (Saint) on Oct 08, 2008 at 09:43 UTC
|
If you need a bigger answer you need to paint a bigger picture. There are many ways to speed up searches, and they all depend on what you are searching for, how big the hay stack is, what you are going to do when you find each item and how many items there are. Good techniques also depend on whether the file is sorted, how big the keys are, how much ancillary data there is, is the file across a network, on local disk or in memory, and probably a slew of other thing.
The best thing you can do is describe the bigger picture - 20,000 records are not very many so most likely there is something in your bigger process that you are doing inefficiently. It's unlikely to be a Perl construct, it's much more likely to be of the nature of needlessly nested loops (quite possibly the grep you mentioned) or something of that nature.
We have a sample of your data. Now describe what you need to do with it.
Perl reduces RSI - it saves typing
| [reply] |
|
|
The data in the reference file is as below.
UTRIR8709990166-PRIS
UTRIR8709990166-IONJ
UTRIR8709990166-SONIC
UTRIR8709990166-INTR
UTRIR8709990166-MNSS
UTRIR8709990166-POIO
and so on
iam storing UTR (payment reference number) along with client seperated.
i will read the reference file containing above information, split and take UTR number into @sentUTRs array.
for information purpose , iam doing below..
recieve flat text file containing payments seperated by ~ in each line every 15 minutes
run this perl script and consider only new payments i.e. check reference file for already considered payments and ignore them and consider new payments and update these into reference file at the end.
All the new payments iam writing into CSV file w.r.t client.
so to achieve above, i just open flat file, read each line and split it and take second field containing UTR number and match it with the ones already sent in reference file and do further processing.
| [reply] [d/l] |
|
|
There are many options that may help solve your problem. For a start, if it is the same file every 15 minutes then you can remember (possibly in a configuration file) where you had processed up to last time and continue from that point this time - no searching required at all!
The absolute standard fix to your immediate problem is to store your payment numbers (keys) in a hash then use a very fast constant time lookup (that's what hashes do when you give them a key and ask for a value) for your match check. Consider:
use strict;
use warnings;
my @refnos = ();
my $old = <<OLD;
UTRIR8709990166
ZPHLHLKJ87
OLD
my %oldPayments;
open my $payments, '<', \$old;
%oldPayments = map {$_ => undef} grep {chomp; length} <$payments>;
close $payments;
print "Reading UTR Payment numbers \n";
while (<DATA>) {
chomp;
my @data = split (/~/, $_);
(my $utr = uc $data[1]) =~ s/\s*//g;
next if exists $oldPayments{$utr};
$oldPayments{$utr} = $data[1];
print "Payment $utr received of $data[2]\n";
}
open $payments, '>', \$old;
print $payments join "\n", sort keys %oldPayments, '';
close $payments;
print "New payments are:\n ";
print join "\n ", grep {defined $oldPayments{$_}} sort keys %oldPaym
+ents;
__DATA__
0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~
+IFSCSEND001 ~Remiter Details ~1000007 ~TEST R
+TGS TRF7 ~ ~
+ ~ ~RTGS~REVOSN OIL CORPORATION ~IOC
+L ~09065010889~0906501088900122INR~ 7~ 1~ 1
0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~
+AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~4578962 ~/BNF/9822644928 ~
+ ~ ~ ~NEFT~REVOSN OIL
+ CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7
+~ 1~ 1
0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB
+C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567
+ ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D
+ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO
+N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1
0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~
+ ~SDV AIR LINK REVOS LIMITED ~458ss4
+53 ~ ~
+ ~ ~ ~NEFT~REVOSN OIL CORPORA
+TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~
+ 1
0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~
+ ~FRAGNOS EXPRESS - TRS CARD S DIVIS
+I~ ~/BNF/9826805798 ~
+ ~ ~ ~NEFT~REVOSN OIL
+ CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~
+ 1~ 1
Prints:
Reading UTR Payment numbers
Payment UTRIN9080980866 received of 1222706
Payment Q08709798905745 received of 5960.74
Payment UTRI790898U0166 received of 2414
New payments are:
Q08709798905745
UTRI790898U0166
UTRIN9080980866
Of course I've used a variable as a file to save needing to use a disk based file for the example, but in practice you would use a disk based file of course.
However, if your data set gets very large (millions of entries perhaps) you should seriously consider using a database instead of a flat file if at all possible.
Perl reduces RSI - it saves typing
| [reply] [d/l] [select] |
Re: Needed Performance improvement in reading and fetching from a file
by NiJo (Friar) on Oct 09, 2008 at 17:48 UTC
|
Your code is working at a low level with many perl instructions. Your code becomes much simpler and should become a lot faster when using C based modules. DBD::CSV seems to be the way to go. I suspect that you have to use the other fields somewhere else. | [reply] |
Re: Needed Performance improvement in reading and fetching from a file
by aufflick (Deacon) on Oct 11, 2008 at 10:09 UTC
|
So what you are basically doing is:
1. Checking if column 2 has been seen already - if so, next line
2. Else, do some processing on the row and record the value of column 2 as seen
This is a pretty common thing to do and can be super fast. As already pointed out, the easiest win is to use a hash for maintaining the record of what col 2 values have been seen. That will get your check nearer O(1) than O(N).
20k isn't a lot - this is all you should have to do. If you find yourself dealing with a LOT of records (say half a million) you can get really cheap use of multiple cpu/cores (assuming you have them) by writing two scripts - 1 to strip out all the lines with duplicated col 2 values, the second can thus skip that step. Then pipe the output of one script to the input of the other and Unix will run the two processes in parallel for you. Assuming you are using a Unix OS that is. Something like:
cat the_file.txt | remove_duplicates.pl | process_data.pl
| [reply] [d/l] |