in reply to Needed Performance improvement in reading and fetching from a file
Problem appears to come in two parts:
Extracting the "UTR" is an SMOC, and the only question is what Perl will do quickest. I found (see below) that good old fashioned index/substr did the trick -- that small part of the puzzle runs ~9 times faster. (If the first field is fixed length, you could do better still.) Note that for records you do want to process you'll need to do the split as well -- so the actual saving depends on what proportion of records are being filtered out.
Testing whether the "UTR" is one of the "sent UTR"s requires some sort of search/match. The grep in the code is running linear search along @sentUTRs, what's more, it processes every entry even if there's been a match already. I suggest a hash would be a better choice.
CAVEAT: I have just realised that the test grep($r =~ /$_/, @sentUTRs) is, of course, not grep($r eq $_, @sentUTRs) -- if partial matches are essential, then a hash won't cut it :-(
Code below. Using index/substr and a hash ran ~16 times faster on my artificial test. Benchmark output (edited for clarity):
Benchmark: timing 400000 iterations
Split : 33.39 usr + 0.01 sys = 33.40 CPU @ 11976.05/s
Regex_b: 5.56 usr + 0.00 sys = 5.56 CPU @ 71942.45/s
Regex_a: 5.33 usr + 0.01 sys = 5.34 CPU @ 74906.37/s
Index : 3.87 usr + 0.00 sys = 3.87 CPU @ 103359.17/s
Benchmark: timing 200000 iterations
Split Grep: 55.82 usr + 0.02 sys = 55.84 CPU @ 3581.66/s
Index Grep: 39.86 usr + 0.01 sys = 39.87 CPU @ 5016.30/s
Split Hash: 18.12 usr + 0.01 sys = 18.13 CPU @ 11031.44/s
Index Hash: 3.30 usr + 0.00 sys = 3.30 CPU @ 60606.06/s
YMMV.
As you'd expect, fiddling with the coding to optimize the extraction of the "UTR" makes only a modest difference. Changing the algorithm for searching the "sentUTRs" makes a rather bigger difference.
Update: added the essential exists to the hash lookups, and updated the benchmark timings.
#!/usr/bin/perl use strict; use warnings; use Benchmark () ; # Gather in the data my @input = <DATA> ; # Extracting the 'UTR' print "Testing the 'UTR' extraction\n" ; for (@input) { my $r_s = by_split() ; my $r_a = by_regex_a() ; my $r_b = by_regex_b() ; my $r_i = by_index() ; my $s = "" ; if ($r_a ne $r_s) { $s .= " BUT \$r_a='$r_a'" ; } ; if ($r_b ne $r_s) { $s .= " BUT \$r_b='$r_b'" ; } ; if ($r_i ne $r_s) { $s .= " BUT \$r_i='$r_i'" ; } ; print " $r_s", ($s ? $s : " OK"), "\n" ; } ; Benchmark::timethese(400000, { 'Split ' => sub { by_split() for (@input) ; }, 'Regex_a' => sub { by_regex_a() for (@input) ; }, 'Regex_b' => sub { by_regex_b() for (@input) ; }, 'Index ' => sub { by_index() for (@input) ; }, }); sub by_split { my @data = split(/~/, $_) ; return $data[1] ; } ; sub by_regex_a { m/~(.*?)~/ ; return $1 ; } ; sub by_regex_b { m/~([^~]*)~/ ; return $1 ; } ; sub by_index { my $i = index($_, '~') + 1 ; return substr($_, $i, index($_, '~', $i) - $i) ; } ; # Testing for existing 'UTR' my @received = map by_split(), @input ; my @sentUTRs = ('ffsdahgdf', 'hjgfsdfghgaghsfd', $received[3], 'ppuiwdwsc', '4155dvcs7', $received[1]) ; my %sentUTRs ; @sentUTRs{@sentUTRs} = undef ; Benchmark::timethese(200000, { 'Split Grep' => sub { for (@input) { my $r = by_split() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Split Hash' => sub { for (@input) { my $r = by_split() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, 'Index Grep' => sub { for (@input) { my $r = by_index() ; next if grep($r =~ /$_/, @sentUTRs) ; $r .= $r ; } ; }, 'Index Hash' => sub { for (@input) { my $r = by_index() ; next if exists $sentUTRs{$r} ; $r .= $r ; } ; }, }); __DATA__ 0906928472847292INR~UTRIR8709990166~ 700000~INR~20080623~RC425484~ +IFSCSEND001 ~Remiter Details ~1000007 ~TEST R +TGS TRF7 ~ ~ + ~ ~RTGS~REVOSN OIL CORPORATION ~IOC +L ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906472983472834HJR~UTRIN9080980866~ 1222706~INR~20080623~NI209960~ +AMEX0888888 ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~4578962 ~/BNF/9822644928 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IO CL ~09065010889~0906501088900122INR~ 7 +~ 1~ 1 0906568946748922INR~ZP HLHLKJ87 ~ 1437865.95~INR~20080623~NI209969~HSB +C0560002 ~MOTOSPECT UNILEVER LIMITED ~1234567 + ~/INFO/ATTN: ~//REF 1104210 PLEASE FIND THE D +ET ~ ~ ~NEFT~REVOSN OIL CORPORATIO +N ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ 1 0906506749056822INR~Q08709798905745~ 5960.74~INR~20080623~NI209987~ + ~SDV AIR LINK REVOS LIMITED ~458ss4 +53 ~ ~ + ~ ~ ~NEFT~REVOSN OIL CORPORA +TION ~IOCL ~09065010889~0906501088900122INR~ 7~ 1~ + 1 0906503389054302INR~UTRI790898U0166~ 2414~INR~20080623~NI209976~ + ~FRAGNOS EXPRESS - TRS CARD S DIVIS +I~ ~/BNF/9826805798 ~ + ~ ~ ~NEFT~REVOSN OIL + CORPORATION ~IOCL ~09065010889~0906501088900122INR~ 7~ + 1~ 1
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Needed Performance improvement in reading and fetching from a file
by harishnuti (Beadle) on Oct 08, 2008 at 12:56 UTC | |
|
Re^2: Needed Performance improvement in reading and fetching from a file
by harishnuti (Beadle) on Oct 09, 2008 at 03:05 UTC |