in reply to Needed Performance improvement in reading and fetching from a file

Problem appears to come in two parts:

you say this is critical to the performance, so I'm assuming that this is going to filter out a lot of records, which don't need further processing.

Extracting the "UTR" is an SMOC, and the only question is what Perl will do quickest. I found (see below) that good old fashioned index/substr did the trick -- that small part of the puzzle runs ~9 times faster. (If the first field is fixed length, you could do better still.) Note that for records you do want to process you'll need to do the split as well -- so the actual saving depends on what proportion of records are being filtered out.

Testing whether the "UTR" is one of the "sent UTR"s requires some sort of search/match. The grep in the code is running linear search along @sentUTRs, what's more, it processes every entry even if there's been a match already. I suggest a hash would be a better choice.

CAVEAT: I have just realised that the test grep($r =~ /$_/, @sentUTRs) is, of course, not grep($r eq $_, @sentUTRs) -- if partial matches are essential, then a hash won't cut it :-(

Code below. Using index/substr and a hash ran ~16 times faster on my artificial test. Benchmark output (edited for clarity):

Benchmark: timing 400000 iterations
     Split  : 33.39 usr +  0.01 sys = 33.40 CPU @  11976.05/s
     Regex_b:  5.56 usr +  0.00 sys =  5.56 CPU @  71942.45/s
     Regex_a:  5.33 usr +  0.01 sys =  5.34 CPU @  74906.37/s
     Index  :  3.87 usr +  0.00 sys =  3.87 CPU @ 103359.17/s
Benchmark: timing 200000 iterations
  Split Grep: 55.82 usr +  0.02 sys = 55.84 CPU @   3581.66/s
  Index Grep: 39.86 usr +  0.01 sys = 39.87 CPU @   5016.30/s
  Split Hash: 18.12 usr +  0.01 sys = 18.13 CPU @  11031.44/s
  Index Hash:  3.30 usr +  0.00 sys =  3.30 CPU @  60606.06/s
YMMV.

As you'd expect, fiddling with the coding to optimize the extraction of the "UTR" makes only a modest difference. Changing the algorithm for searching the "sentUTRs" makes a rather bigger difference.

Update: added the essential exists to the hash lookups, and updated the benchmark timings.


#!/usr/bin/perl use strict; use warnings; use Benchmark () ; # Gather in the data my @input = <DATA> ; # Extracting the 'UTR' print "Testing the 'UTR' extraction\n" ; for (@input) { my $r_s = by_split() ; my $r_a = by_regex_a() ; my $r_b = by_regex_b() ; my $r_i = by_index() ; my $s = "" ; if ($r_a ne $r_s) { $s .= " BUT \$r_a='$r_a'" ; } ; if ($r_b ne $r_s) { $s .= " BUT \$r_b='$r_b'" ; } ; if ($r_i ne $r_s) { $s .= " BUT \$r_i='$r_i'" ; } ; print " $r_s", ($s ? $s : " OK"), "\n" ; } ; Benchmark::timethese(400000, { 'Split ' => sub { by_split() for (@input) ; }, 'Regex_a' => sub { by_regex_a() for (@input) ; }, 'Regex_b' => sub { by_regex_b() for (@input) ; }, 'Index ' => sub { by_index() for (@input) ; }, }); sub by_split { my @data = split(/~/, $_) ; return $data[1] ; } ; sub by_regex_a { m/~(.*?)~/ ; return $1 ; } ; sub by_regex_b { m/~([^~]*)~/ ; return $1 ; } ; sub by_index { my $i = index($_, '~') + 1 ; return substr($_, $i, index($_, '~', $i) - $i) ; } ; # Testing for existing 'UTR' my @received = map by_split(), @input ; my @sentUTRs = ('ffsdahgdf', 'hjgfsdfghgaghsfd', $received[3], 'ppuiwdwsc', '4155dvcs7', $received[1]) ; my %sentUTRs ; @sentUTRs{@sentUTRs} = undef ; Benchmark::timethese(200000, { 'Split Grep' => sub { for (@input) { my $r = by_split() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Split Hash' => sub { for (@input) { my $r = by_split() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, 'Index Grep' => sub { for (@input) { my $r = by_index() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Index Hash' => sub { for (@input) { my $r = by_index() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, }); __DATA__ 0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~ +IFSCSEND001 ~Remiter Details ~1000007 ~TEST R +TGS TRF7 ~ ~ + ~ ~RTGS~REVOSN OIL CORPORATION ~IOC +L ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~ +AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~4578962 ~/BNF/9822644928 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7 +~ 1~ 1 0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB +C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567 + ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D +ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO +N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~ + ~SDV AIR LINK REVOS LIMITED ~458ss4 +53 ~ ~ + ~ ~ ~NEFT~REVOSN OIL CORPORA +TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ + 1 0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~ + ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~ ~/BNF/9826805798 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~ + 1~ 1

Replies are listed 'Best First'.
Re^2: Needed Performance improvement in reading and fetching from a file
by harishnuti (Beadle) on Oct 08, 2008 at 12:56 UTC

    Iam really happy to see the above, its really a good learning, i will try the above and its not only a performance improvement in my program but a very good learning.
Re^2: Needed Performance improvement in reading and fetching from a file
by harishnuti (Beadle) on Oct 09, 2008 at 03:05 UTC

    you are right , index + hash turned to be the best combination and i saw speed as below..
    Split : 107 wallclock secs (49.84 usr + 1.81 sys = 51.65 CPU) @ 0.0 +2/s (n=1)~

    i considered 2.6 lakh records and whole program is timed hash versus grep, believe me, grep is out of question, its taking time as good as counting from 1 to 2.6 lakh.
    i guess i have made a poor choice of using grep in this scenario, but indeed grep cannot be used when it comes to huge data.