Needed Performance improvement in reading and fetching from a file

harishnuti has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Needed Performance improvement in reading and fetching from a file by CountZero (Bishop) on Oct 08, 2008 at 05:33 UTC
Let's optimize the `split`-funtion and see how fast we can get: #!/usr/bin/perl use strict; use warnings; use Benchmark qw(:all) ; # Set up an array of data to test my $data = '0906928472847292INR~UTRIR8709990166~ 700000~INR~200806 +23~RC425484~IFSCSEND001 ~Remiter Details ~10000 +07 ~TEST RTGS TRF7 ~ + ~ ~ ~RTGS~REVOSN OIL CORPOR +ATION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1 +~ 1'; my @data; push @data, $data for 1 .. 1000; cmpthese(1000, { 'Full split' => sub {my @refnos; for (@data){my @splitted = s +plit(/~/,$_); push(@refnos,$splitted[1]);}}, 'Limited split' => sub {my @refnos; for (@data){my @splitted = s +plit(/~/,$_,3); push(@refnos,$splitted[1]);}}, 'Optimized split' => sub {my @refnos; for (@data){push @refnos, (s +plit(/~/,$_,3))[1];}}, }); [download] Results: `Rate Full split Limited split Optimized split Full split 23.1/s -- -82% -93% Limited split 127/s 452% -- -63% Optimized split 343/s 1387% 169% --` [download] Some things to remember: Don't split into more fields than needed. Avoid assigning to an array if you only need one field. Even better, use `split` and an index into its reults as an argument in `push` By using these simple rules, we get an almost 14 fold speed increase! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: Needed Performance improvement in reading and fetching from a file by wfsp (Abbot) on Oct 08, 2008 at 07:49 UTC
I wondered how a plain regex would fare. Probably not enough in it to be significant. #!/usr/bin/perl use strict; use warnings; use Benchmark qw(:all) ; # Set up an array of data to test my $data = '0906928472847292INR~UTRIR8709990166~ 700000~INR~200806 +23~RC425484~IFSCSEND001 ~Remiter Details ~10000 +07 ~TEST RTGS TRF7 ~ + ~ ~ ~RTGS~REVOSN OIL CORPOR +ATION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1 +~ 1'; my @data; push @data, $data for 1 .. 1000; cmpthese( 1000, { 'Full split' => sub {my @refnos; for (@data){my @splitted + = split(/~/,$_); push(@refnos,$splitted[1]);}}, 'Limited split' => sub {my @refnos; for (@data){my @splitted + = split(/~/,$_,3); push(@refnos,$splitted[1]);}}, 'Optimized split' => sub {my @refnos; for (@data){push @refnos +, (split(/~/,$_,3))[1];}}, 'regex' => sub {my @refnos; for (@data){push @refnos +, /~([^~]+)~/;}}, } ); [download] `Rate Full split Limited split Optimized split + regex Full split 23.3/s -- -81% -90% + -91% Limited split 120/s 413% -- -51% + -54% Optimized split 244/s 946% 104% -- + -7% regex 262/s 1024% 119% 7% + --` [download]	[reply] [d/l] [select]
Re: Needed Performance improvement in reading and fetching from a file by smiffy (Pilgrim) on Oct 08, 2008 at 05:16 UTC
I can't honestly see anything wrong with doing that split nor any reason that you have a performance issue with only 20k records. However... chomp is a superfluous operation - why strip from the end of the line when you are only reading the second field? I know that you say this is a code fragment. One thing to consider if it were not a fragment is that you first populate an array then run through and print from it. That push() could have been `print "$data[1]\n";` rather than running through two loops. Are you absolutely sure that your performance issues are coming from here and not elsewhere in your code? Hope this helps.	[reply] [d/l]
Re^2: Needed Performance improvement in reading and fetching from a file by CountZero (Bishop) on Oct 08, 2008 at 05:46 UTC
you first populate an array then run through and print from it Look again: the OP is indeed --unnecessarily-- populating an array, but he is not looping through it. He indexes into the array in the `push`. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re^3: Needed Performance improvement in reading and fetching from a file by smiffy (Pilgrim) on Oct 08, 2008 at 06:40 UTC
Sorry - obviously didn't make myself clear. I meant looping through the @refnos array (with foreach) to print, not looping through the @data array; in the example, the @refnos array is superfluous although it is probably doing more in the whole application - or at least I hope it is. I wasn't actually aware - or had forgotten - that split could be limited like your CountZero's example. Every day I learn something (or am reminded of it) is a day not wasted. Or something.	[reply]
Re: Needed Performance improvement in reading and fetching from a file by perrin (Chancellor) on Oct 08, 2008 at 05:33 UTC
The slowness is probably from the file I/O, not the split. Look at Matching in huge files.	[reply]
Re^2: Needed Performance improvement in reading and fetching from a file by harishnuti (Beadle) on Oct 08, 2008 at 07:27 UTC
looks that program is very fast in terms of handling huge files, can i customize it to my needs?	[reply]
Re: Needed Performance improvement in reading and fetching from a file by gone2015 (Deacon) on Oct 08, 2008 at 12:02 UTC
Problem appears to come in two parts: extracting the "UTR" from input line testing whether that "UTR" is one of the "sent UTR"s you say this is critical to the performance, so I'm assuming that this is going to filter out a lot of records, which don't need further processing. Extracting the "UTR" is an SMOC, and the only question is what Perl will do quickest. I found (see below) that good old fashioned `index/substr` did the trick -- that small part of the puzzle runs ~9 times faster. (If the first field is fixed length, you could do better still.) Note that for records you do want to process you'll need to do the `split` as well -- so the actual saving depends on what proportion of records are being filtered out. Testing whether the "UTR" is one of the "sent UTR"s requires some sort of search/match. The `grep` in the code is running linear search along `@sentUTRs`, what's more, it processes every entry even if there's been a match already. I suggest a hash would be a better choice. CAVEAT: I have just realised that the test `grep($r =~ /$_/, @sentUTRs)` is, of course, not `grep($r eq $_, @sentUTRs)` -- if partial matches are essential, then a hash won't cut it :-( Code below. Using `index/substr` and a hash ran ~16 times faster on my artificial test. Benchmark output (edited for clarity): Benchmark: timing 400000 iterations Split : 33.39 usr + 0.01 sys = 33.40 CPU @ 11976.05/s Regex_b: 5.56 usr + 0.00 sys = 5.56 CPU @ 71942.45/s Regex_a: 5.33 usr + 0.01 sys = 5.34 CPU @ 74906.37/s Index : 3.87 usr + 0.00 sys = 3.87 CPU @ 103359.17/s Benchmark: timing 200000 iterations Split Grep: 55.82 usr + 0.02 sys = 55.84 CPU @ 3581.66/s Index Grep: 39.86 usr + 0.01 sys = 39.87 CPU @ 5016.30/s Split Hash: 18.12 usr + 0.01 sys = 18.13 CPU @ 11031.44/s Index Hash: 3.30 usr + 0.00 sys = 3.30 CPU @ 60606.06/s YMMV. As you'd expect, fiddling with the coding to optimize the extraction of the "UTR" makes only a modest difference. Changing the algorithm for searching the "sentUTRs" makes a rather bigger difference. Update: added the essential `exists` to the hash lookups, and updated the benchmark timings. #!/usr/bin/perl use strict; use warnings; use Benchmark () ; # Gather in the data my @input = <DATA> ; # Extracting the 'UTR' print "Testing the 'UTR' extraction\n" ; for (@input) { my $r_s = by_split() ; my $r_a = by_regex_a() ; my $r_b = by_regex_b() ; my $r_i = by_index() ; my $s = "" ; if ($r_a ne $r_s) { $s .= " BUT \$r_a='$r_a'" ; } ; if ($r_b ne $r_s) { $s .= " BUT \$r_b='$r_b'" ; } ; if ($r_i ne $r_s) { $s .= " BUT \$r_i='$r_i'" ; } ; print " $r_s", ($s ? $s : " OK"), "\n" ; } ; Benchmark::timethese(400000, { 'Split ' => sub { by_split() for (@input) ; }, 'Regex_a' => sub { by_regex_a() for (@input) ; }, 'Regex_b' => sub { by_regex_b() for (@input) ; }, 'Index ' => sub { by_index() for (@input) ; }, }); sub by_split { my @data = split(/~/, $_) ; return $data[1] ; } ; sub by_regex_a { m/~(.?)~/ ; return $1 ; } ; sub by_regex_b { m/~([^~])~/ ; return $1 ; } ; sub by_index { my $i = index($_, '~') + 1 ; return substr($_, $i, index($_, '~', $i) - $i) ; } ; # Testing for existing 'UTR' my @received = map by_split(), @input ; my @sentUTRs = ('ffsdahgdf', 'hjgfsdfghgaghsfd', $received[3], 'ppuiwdwsc', '4155dvcs7', $received[1]) ; my %sentUTRs ; @sentUTRs{@sentUTRs} = undef ; Benchmark::timethese(200000, { 'Split Grep' => sub { for (@input) { my $r = by_split() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Split Hash' => sub { for (@input) { my $r = by_split() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, 'Index Grep' => sub { for (@input) { my $r = by_index() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Index Hash' => sub { for (@input) { my $r = by_index() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, }); __DATA__ 0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~ +IFSCSEND001 ~Remiter Details ~1000007 ~TEST R +TGS TRF7 ~ ~ + ~ ~RTGS~REVOSN OIL CORPORATION ~IOC +L ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~ +AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~4578962 ~/BNF/9822644928 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7 +~ 1~ 1 0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB +C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567 + ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D +ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO +N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~ + ~SDV AIR LINK REVOS LIMITED ~458ss4 +53 ~ ~ + ~ ~ ~NEFT~REVOSN OIL CORPORA +TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ + 1 0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~ + ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~ ~/BNF/9826805798 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~ + 1~ 1 [download]	[reply] [d/l] [select]
Re^2: Needed Performance improvement in reading and fetching from a file by harishnuti (Beadle) on Oct 08, 2008 at 12:56 UTC
Iam really happy to see the above, its really a good learning, i will try the above and its not only a performance improvement in my program but a very good learning.	[reply]
Re^2: Needed Performance improvement in reading and fetching from a file by harishnuti (Beadle) on Oct 09, 2008 at 03:05 UTC
you are right , index + hash turned to be the best combination and i saw speed as below.. `Split : 107 wallclock secs (49.84 usr + 1.81 sys = 51.65 CPU) @ 0.0 +2/s (n=1)~` [download] i considered 2.6 lakh records and whole program is timed hash versus grep, believe me, grep is out of question, its taking time as good as counting from 1 to 2.6 lakh. i guess i have made a poor choice of using grep in this scenario, but indeed grep cannot be used when it comes to huge data.	[reply] [d/l]
Re: Needed Performance improvement in reading and fetching from a file by fmerges (Chaplain) on Oct 08, 2008 at 05:44 UTC
Hi, I agree with smiffy. Even just splitting up to the fields you need, would not make that much difference, pushing to an array and later going again over it for printing is superfluous; if you really need to get it these way from another part of the program, you can also think of using an iterator/closure. And again, are you really sure it's this section that is taking all the time? have you used Devel::DProf or Benchmark::Stopwatch? Regards, fmerges at irc.freenode.net	[reply]
Re^2: Needed Performance improvement in reading and fetching from a file by harishnuti (Beadle) on Oct 08, 2008 at 07:23 UTC
i do have other blocks in the program , but not really sure to how to find performance, so i suspected this splitting operation. infact . my requirement is little bit complex. Based on second field fetched, i will do some comparisions which then might need all fields which i use for CSV conversion(to csv file) in later part. so its wise to get second field only initially (using Optimized split as said above) and then if needed i will further split down later. i will try to see if other parts of program are really hitting peformance	[reply]
Re^3: Needed Performance improvement in reading and fetching from a file by marto (Cardinal) on Oct 08, 2008 at 08:23 UTC
"but not really sure to how to find performance" It sounds like you would benefit from benchmarking/profiling your code. See Debugging and Optimization from the tutorials section of this site, also worth looking at is the Devel::NYTProf module, which I mention here. Hope this helps Martin	[reply]
Re^4: Needed Performance improvement in reading and fetching from a file by harishnuti (Beadle) on Oct 08, 2008 at 08:38 UTC
Re: Needed Performance improvement in reading and fetching from a file by GrandFather (Saint) on Oct 08, 2008 at 09:43 UTC
If you need a bigger answer you need to paint a bigger picture. There are many ways to speed up searches, and they all depend on what you are searching for, how big the hay stack is, what you are going to do when you find each item and how many items there are. Good techniques also depend on whether the file is sorted, how big the keys are, how much ancillary data there is, is the file across a network, on local disk or in memory, and probably a slew of other thing. The best thing you can do is describe the bigger picture - 20,000 records are not very many so most likely there is something in your bigger process that you are doing inefficiently. It's unlikely to be a Perl construct, it's much more likely to be of the nature of needlessly nested loops (quite possibly the grep you mentioned) or something of that nature. We have a sample of your data. Now describe what you need to do with it. Perl reduces RSI - it saves typing	[reply]
Re^2: Needed Performance improvement in reading and fetching from a file by Anonymous Monk on Oct 08, 2008 at 12:52 UTC
The data in the reference file is as below. `UTRIR8709990166-PRIS UTRIR8709990166-IONJ UTRIR8709990166-SONIC UTRIR8709990166-INTR UTRIR8709990166-MNSS UTRIR8709990166-POIO and so on` [download] iam storing UTR (payment reference number) along with client seperated. i will read the reference file containing above information, split and take UTR number into @sentUTRs array. for information purpose , iam doing below.. recieve flat text file containing payments seperated by ~ in each line every 15 minutes run this perl script and consider only new payments i.e. check reference file for already considered payments and ignore them and consider new payments and update these into reference file at the end. All the new payments iam writing into CSV file w.r.t client. so to achieve above, i just open flat file, read each line and split it and take second field containing UTR number and match it with the ones already sent in reference file and do further processing.	[reply] [d/l]
Re^3: Needed Performance improvement in reading and fetching from a file by GrandFather (Saint) on Oct 08, 2008 at 20:16 UTC
There are many options that may help solve your problem. For a start, if it is the same file every 15 minutes then you can remember (possibly in a configuration file) where you had processed up to last time and continue from that point this time - no searching required at all! The absolute standard fix to your immediate problem is to store your payment numbers (keys) in a hash then use a very fast constant time lookup (that's what hashes do when you give them a key and ask for a value) for your match check. Consider: use strict; use warnings; my @refnos = (); my $old = <<OLD; UTRIR8709990166 ZPHLHLKJ87 OLD my %oldPayments; open my $payments, '<', \$old; %oldPayments = map {$_ => undef} grep {chomp; length} <$payments>; close $payments; print "Reading UTR Payment numbers \n"; while (<DATA>) { chomp; my @data = split (/~/, $_); (my $utr = uc $data[1]) =~ s/\s*//g; next if exists $oldPayments{$utr}; $oldPayments{$utr} = $data[1]; print "Payment $utr received of $data[2]\n"; } open $payments, '>', \$old; print $payments join "\n", sort keys %oldPayments, ''; close $payments; print "New payments are:\n "; print join "\n ", grep {defined $oldPayments{$_}} sort keys %oldPaym +ents; __DATA__ 0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~ +IFSCSEND001 ~Remiter Details ~1000007 ~TEST R +TGS TRF7 ~ ~ + ~ ~RTGS~REVOSN OIL CORPORATION ~IOC +L ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~ +AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~4578962 ~/BNF/9822644928 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7 +~ 1~ 1 0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB +C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567 + ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D +ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO +N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~ + ~SDV AIR LINK REVOS LIMITED ~458ss4 +53 ~ ~ + ~ ~ ~NEFT~REVOSN OIL CORPORA +TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ + 1 0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~ + ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~ ~/BNF/9826805798 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~ + 1~ 1 [download] Prints: `Reading UTR Payment numbers Payment UTRIN9080980866 received of 1222706 Payment Q08709798905745 received of 5960.74 Payment UTRI790898U0166 received of 2414 New payments are: Q08709798905745 UTRI790898U0166 UTRIN9080980866` [download] Of course I've used a variable as a file to save needing to use a disk based file for the example, but in practice you would use a disk based file of course. However, if your data set gets very large (millions of entries perhaps) you should seriously consider using a database instead of a flat file if at all possible. Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re: Needed Performance improvement in reading and fetching from a file by NiJo (Friar) on Oct 09, 2008 at 17:48 UTC
Your code is working at a low level with many perl instructions. Your code becomes much simpler and should become a lot faster when using C based modules. DBD::CSV seems to be the way to go. I suspect that you have to use the other fields somewhere else.	[reply]
Re: Needed Performance improvement in reading and fetching from a file by aufflick (Deacon) on Oct 11, 2008 at 10:09 UTC
So what you are basically doing is: 1. Checking if column 2 has been seen already - if so, next line 2. Else, do some processing on the row and record the value of column 2 as seen This is a pretty common thing to do and can be super fast. As already pointed out, the easiest win is to use a hash for maintaining the record of what col 2 values have been seen. That will get your check nearer O(1) than O(N). 20k isn't a lot - this is all you should have to do. If you find yourself dealing with a LOT of records (say half a million) you can get really cheap use of multiple cpu/cores (assuming you have them) by writing two scripts - 1 to strip out all the lines with duplicated col 2 values, the second can thus skip that step. Then pipe the output of one script to the input of the other and Unix will run the two processes in parallel for you. Assuming you are using a Unix OS that is. Something like: `cat the_file.txt \| remove_duplicates.pl \| process_data.pl` [download]	[reply] [d/l]