Perl Script performance issue

by Tara (Initiate) on Dec 17, 2015 at 12:39 UTC

Perl's in built grep works on elements. How do i make i behave the way i intend? Do i have to store the line into an array or something? Confused..please help..

by Athanasius (Archbishop) on Dec 17, 2015 at 13:54 UTC

Hello again Tara,

You have the line:

$final_data = ` grep "$key_pattern" "${indir}/${lkp_file_name}" |cut -
+d"|" -f"$data_location"`;
[download]

Here is one way to write this in pure Perl (untested):

my $file = $indir . '/' . $lkp_file_name;

open(my $fh, '<', $file)
    or die "Cannot open file '$file' for reading: $!";

my @matched_lines = grep { /\Q$key_pattern/ } <$fh>;

close $fh
    or die "Cannot close file '$file': $!";

my @fields;

for my $line (@matched_lines)
{
    my    $field = (split /\|/, $line)[$data_location - 1];
    chomp $field;
    push  @fields, $field;
}

$final_data = join "\n", @fields;
[download]

(There should also be some error handling to deal with the possibility that a matched line contains an insufficient number of fields.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Perl Script performance issue
by Cristoforo (Curate) on Dec 15, 2015 at 15:12 UTC

split(/$hash_ref->{DELIMITER}/

|

split /|/, ...

split //, ...

To correct this, you should use the \Q escape sequence. That tells perl to treat the pipe as a regular character and not mean OR in the regular expression.

I saved a file of what are called the 'dirty dozen' of metacharacters. The pipe is one of the dirty dozen. They are:

\ | ( ) [ { ^ $ * + ? .

They all need escaping if they are to be treated as a 'regular' character in a regular expression (not as a metacharacter). So, your split should look like
split(/\Q$hash_ref->{DELIMITER}\E/

(There also is the quotemeta built in function).

by AnomalousMonk (Archbishop) on Dec 15, 2015 at 15:30 UTC

See also Quote and Quote-like Operators (about two-thirds of the way through the section) for more info on \Q and friends. (Update: There are also a few examples of use in perlretut Part 2, in the section "More on characters, strings, and character classes".)

Give a man a fish: <%-{-{-{-<

by Tara (Initiate) on Dec 16, 2015 at 08:58 UTC

Thanks for pointing that out. Since we are passing delimiter from config file, we are passing it as \|. That takes care of suppression automatically.

Re: Perl Script performance issue
by Laurent_R (Canon) on Dec 15, 2015 at 13:49 UTC

Such a process can often be several orders of magnitude faster (i.e. hundreds or even thousands times faster, perhaps even more). But this is feasible only of the data in $data_location (or the part of it that you actually use) is not too large to fit in memory.

Therefore the question asked by poj about the size of your data and about the current timings is really crucial.

If the data is too large to fit into memory, there is still the possibility of storing the data in $data_location into a database to enable indexed access to the piece of data the you need. The performance gain would be much smaller, but can still be very significant and might be sufficient for your purpose.

by Tara (Initiate) on Dec 16, 2015 at 07:39 UTC

The data files are large. Initially i did consider storing them in memory, but discarded the option due to large size of the files. Account file is 182777579 byte, which varies daily, but remains more or less of the same size. Currently holds 62394 records.

by poj (Abbot) on Dec 16, 2015 at 08:46 UTC

ACCT NUMBER read main file, second column in main file is primary key for looking up positionfile.delim.

If value is held in 6th column, which column in positionfile.delim is the primary key. Is is column 1 ?

Which column in main file are these held, they can't all be the second column, or am I missing something ?

ACCT NUMBER|positionfile.delim|2|6 
PO TYPE|positionfile.delim|2|3
LOC CODE|positionfile.delim|2|47
[download]

Re^4: Perl Script performance issue

by Tara (Initiate) on Dec 16, 2015 at 08:56 UTC

Re^5: Perl Script performance issue

by poj (Abbot) on Dec 16, 2015 at 09:05 UTC

by Laurent_R (Canon) on Dec 16, 2015 at 18:30 UTC

The data files are large. Initially i did consider storing them in memory, but discarded the option due to large size of the files. Account file is 1827775ely tiv79 byte, which varies daily, but remains more or less of the same size. Currently holds 62394 records.

Re: Perl Script performance issue
by GrandFather (Saint) on Dec 15, 2015 at 20:11 UTC

This is a job for a database. In fact, very likely generating a database on the fly from your existing data then querying the database will be quicker and easier to maintain than trying to juggle a plethora of files.

See Databases made easy to get your eye in with databases. Databases aren't as hard to use for this sort of job as you probably imagine and are much easier than trying to roll your own.

Premature optimization is the root of all job security

by dasgar (Priest) on Dec 16, 2015 at 05:36 UTC

The problem description was kind of sounding like a database type structure. I haven't used this myself, but I believe that someone could use DBD::CSV to run SQL queries against the original files without needing to import into a database first.

by Laurent_R (Canon) on Dec 16, 2015 at 07:34 UTC

For me, the real solution is using either a hash structure in memory (super fast if feasible) or an actual database such as MySql or MariaDB (or possibly even SQLlite, but that might be a bit more complicated with several files) with full support to indexed data access enabling fast processing.

Re: Perl Script performance issue
by poj (Abbot) on Dec 15, 2015 at 12:51 UTC

Please suggest how to improve it or any better design

What size are these files approximately ? How many records and length of record. How long does the program currently take to execute ?