comment on

sagarika:

As others have mentioned, you're a bit short on details. But I had a bit of time this morning, so I looked over the code briefly. I noticed that in the first subroutine, you have a nested loop in which you're making a pass through the file for each pattern. This is normally less efficient than making a single pass through the file, checking for each pattern every line, as I/O time is generally "expensive" compared to scanning a string. (I swapped the inner and outer loops in getCsvHash2 shown later in the code listing later.)

Also, since you're looping through a small set of patterns, you may be paying too much time for recompiling the regular expression each time through the loop. If you're only using a small number of patterns, it may be worth it (performance-wise) to let perl compile the regular expressions only once. I rearranged the code a bit and came up with the function getCsvHash3. (Note: I may have error(s) in this, so you'll want to test it to ensure that you get the same results.)

Of course, the only way to really be sure about the performance of changes is to measure it. So I whipped up a test file and coded up a benchmark to compare them:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my @Patterns=("xxx", "SSS", "s:S");
my %master;

cmpthese(50, {
            orig=>\&getCsvHash1,
            swap=>\&getCsvHash2,
            regex=>\&getCsvHash3,
});

sub getCsvHash1
{
  %master=(); # unset the HASH.
  foreach my $wlp (@Patterns)
  {
    my $key=$wlp;
    $key =~ s/[\s+|:]/_/g;
    open FILE, "csv_file.csv" or die $!;
    while(<FILE>)
    {
      my $line=$_;
      {
        my @csv = split(",", $line);
        if ($csv[1] =~ /"$wlp/)
        {
          push (@{$master{$key}}, $line); # push as value of a hash-
        }
      }
    } #while(<FILE>) ends here
    close FILE;
  } #foreach $wlp ends here
} #Function getWhiteListCsvArrays ends here.

sub getCsvHash2
{
  %master=(); # unset the HASH.
  open FILE, "csv_file.csv" or die $!;
  while(<FILE>)
  {
    my $line=$_;
    foreach my $wlp (@Patterns)
    {
      my $key=$wlp;
      $key =~ s/[\s+|:]/_/g;
      {
        my @csv = split(",", $line);
        if ($csv[1] =~ /"$wlp/)
        {
          push (@{$master{$key}}, $line); # push as value of a hash-
        }
      }
    }
  }
  close FILE;
}

sub getCsvHash3
{
  %master=(); # unset the HASH.
  open FILE, "csv_file.csv" or die $!;
  while(<FILE>)
  {
    my $line=$_;
    my @csv = split(",", $line);

    if ($csv[1] =~ /xxx/) { push @{$master{$1}},  $line; }
    if ($csv[1] =~ /SSS/) { push @{$master{$1}},  $line; }
    if ($csv[1] =~ /s:S/) { push @{$master{s_S}}, $line; }
  }
  close FILE;
}
[download]

So if all of your time is spent in the getCsvHash routine, then making these changes will help you out--the third version is a shade over 4 times faster on my machine. But if most of your execution time is spent in other routines, then you'll want to profile them and make improvements as indicated.

$ time perl pm923771.pl
         Rate  swap  orig regex
swap  0.544/s    --  -47%  -81%
orig   1.03/s   89%    --  -64%
regex  2.87/s  427%  178%    --

real    2m38.179s
user    2m34.503s
sys     0m3.463s
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re^5: how to split huge file reading into multiple threads by roboticus
in thread how to split huge file reading into multiple threads by sagarika

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.