Tara has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have just registered to the site and this is my first post. I am Perl newbie. Recently wrote a script to create an extract out of the existing extract. The requirement is to pickup few columns as specified by client. It sounds simple but here is the twist... we have to lookup more than one extract to create a final file. For example: account file will contain 100 columns , get few from main account file it. Then for the corresponding account's client, get client details from another file. This way we have to lookup 4 to 5 different files for complete information. The design we suggested - Have a configuration file for fields information. In that specify for a field - file name, which column to be extracted from this, if this main file or lookup file, if lookup then lookup key.
ACCT NUMBER|positionfile.delim|2|6
ACCT SHORT NAME|accountfile.delim|6|6
PO TYPE|positionfile.delim|2|3
CUSIP|securityfile.delim|8|5
LOC CODE|positionfile.delim|2|47
LOC NAME|locationfile.delim|47|4
How to read it - at the start it is specified main file is account file. start reading that file line by line Then for ACCT NUMBER read main file, second column in main file is primary key for looking up positionfile.delim. Grep that and extract 6th column. Similarly LOC NAME, get the 47 column from main file, use that value as key in locationfile.delim and extract 4th column. and so on... The main problem is though I have written a working code for this, the speed is lethargic. Please suggest how to improve it or any better design. Delimter is pipe which is provided from another config file to support various types of delimited files.

###Reading layout while (<$CONFIG>){ chomp; my @list=split(/$hash_ref->{DELIMITER}/); $file_list = $list[1]; $file_list =~ s/CCYYMMDD/$date/g; push @{$look_info{$list[0]}},$file_list."-".$list[2]." +-".$list[3]; push (@header, $list[0]); } ###reading data file while (<$SRC_FILE>){ @data=split(/$hash_ref->{DELIMITER}/,$_); while ( ( my $column) = each %look_info){ foreach my $pos_info (@{$look_info{$column}}){ @look_here = split(/\-/, $pos_info); #print "look here : @look_here \n"; $lkp_file_name = $look_here[0]; $key_location = $look_here[1]; $data_location = $look_here[2]; $key_string = $data[$key_location-1] ; chomp $key_string; print "key_string : $key_string \n"; my $key_pattern = "|".$key_string."|"; #print "key string is : $key_pattern \ +n"; $final_data = ` grep "$key_pattern" "$ +{indir}/${lkp_file_name}" |cut -d"|" -f"$data_location"`; #print "data value is : $final_data \n +"; chomp $final_data; push @{$final_record{$column}},$final_ +data."|"; } } } ###printing final file my $i =0; my @currdata; $headerstring = join('|', @header); print $OUTPUT_FILE $headerstring."\n"; while(1){ foreach my $head (@header){ chomp $head; $head =~ s/\\//g; @currdata = @{$final_record{$head}}; print $OUTPUT_FILE "$currdata[$i]"; } print $OUTPUT_FILE "\n"; $i++; if ($i>($#currdata)){ last; } }

Replies are listed 'Best First'.
Re: Perl Script performance issue
by Athanasius (Archbishop) on Dec 15, 2015 at 12:49 UTC

    Hello Tara, and welcome to the Monastery!

    One obvious slow-down is the shell-out call to system grep:

    $final_data = ` grep "$key_pattern" "${indir}/${lkp_file_name}" |cut - +d"|" -f"$data_location"`;

    which is expensive in itself and rendered especially so by being nested inside three loops. Try replacing it with a call to Perl’s inbuilt grep and see if that produces a significant speedup.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Perl's in built grep works on elements. How do i make i behave the way i intend? Do i have to store the line into an array or something? Confused..please help..

        Hello again Tara,

        You have the line:

        $final_data = ` grep "$key_pattern" "${indir}/${lkp_file_name}" |cut - +d"|" -f"$data_location"`;

        Here is one way to write this in pure Perl (untested):

        my $file = $indir . '/' . $lkp_file_name; open(my $fh, '<', $file) or die "Cannot open file '$file' for reading: $!"; my @matched_lines = grep { /\Q$key_pattern/ } <$fh>; close $fh or die "Cannot close file '$file': $!"; my @fields; for my $line (@matched_lines) { my $field = (split /\|/, $line)[$data_location - 1]; chomp $field; push @fields, $field; } $final_data = join "\n", @fields;

        (There should also be some error handling to deal with the possibility that a matched line contains an insufficient number of fields.)

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Perl Script performance issue
by Cristoforo (Curate) on Dec 15, 2015 at 15:12 UTC
    A comment on your lines that use split, split(/$hash_ref->{DELIMITER}/. If the delimiter is pipe, |, then your split is split /|/, .... That will not work as you want because the split pattern, '|', in this case says to split on 'nothing OR nothing'. (Equivalent to split //, ...

    To correct this, you should use the \Q escape sequence. That tells perl to treat the pipe as a regular character and not mean OR in the regular expression.

    I saved a file of what are called the 'dirty dozen' of metacharacters. The pipe is one of the dirty dozen. They are:

    \ | ( ) [ { ^ $ * + ? .

    They all need escaping if they are to be treated as a 'regular' character in a regular expression (not as a metacharacter). So, your split should look like
    split(/\Q$hash_ref->{DELIMITER}\E/

    (There also is the quotemeta built in function).

      See also Quote and Quote-like Operators (about two-thirds of the way through the section) for more info on  \Q and friends. (Update: There are also a few examples of use in perlretut Part 2, in the section "More on characters, strings, and character classes".)


      Give a man a fish:  <%-{-{-{-<

        Thanks for pointing that out. Since we are passing delimiter from config file, we are passing it as \|. That takes care of suppression automatically.

Re: Perl Script performance issue
by Laurent_R (Canon) on Dec 15, 2015 at 13:49 UTC
    If the content of the $data_location file(s) is not too large for the available memory, then the obvious cure would be to read its content only once and to load it into memory in an appropriate data structure (probably a hash table of some form or other), and then to read source file line by line and to lookup in memory for the additional data pieces required.

    Such a process can often be several orders of magnitude faster (i.e. hundreds or even thousands times faster, perhaps even more). But this is feasible only of the data in $data_location (or the part of it that you actually use) is not too large to fit in memory.

    Therefore the question asked by poj about the size of your data and about the current timings is really crucial.

    If the data is too large to fit into memory, there is still the possibility of storing the data in $data_location into a database to enable indexed access to the piece of data the you need. The performance gain would be much smaller, but can still be very significant and might be sufficient for your purpose.

      The data files are large. Initially i did consider storing them in memory, but discarded the option due to large size of the files. Account file is 182777579 byte, which varies daily, but remains more or less of the same size. Currently holds 62394 records.

        ACCT NUMBER read main file, second column in main file is primary key for looking up positionfile.delim.

        If value is held in 6th column, which column in positionfile.delim is the primary key. Is is column 1 ?

        Which column in main file are these held, they can't all be the second column, or am I missing something ?

        ACCT NUMBER|positionfile.delim|2|6 PO TYPE|positionfile.delim|2|3 LOC CODE|positionfile.delim|2|47
        poj
        The data files are large. Initially i did consider storing them in memory, but discarded the option due to large size of the files. Account file is 1827775ely tiv79 byte, which varies daily, but remains more or less of the same size. Currently holds 62394 records.
        This is indeed relatively large, but most probably small enough to fit into memory on a decently modern computer. That's what I would try anyway. Especially if you can decide to store in memory only the part of these files which is useful for your process.
Re: Perl Script performance issue
by GrandFather (Saint) on Dec 15, 2015 at 20:11 UTC

    This is a job for a database. In fact, very likely generating a database on the fly from your existing data then querying the database will be quicker and easier to maintain than trying to juggle a plethora of files.

    See Databases made easy to get your eye in with databases. Databases aren't as hard to use for this sort of job as you probably imagine and are much easier than trying to roll your own.

    Premature optimization is the root of all job security

      The problem description was kind of sounding like a database type structure. I haven't used this myself, but I believe that someone could use DBD::CSV to run SQL queries against the original files without needing to import into a database first.

        This makes things fairly easy to code, at least for relatively simple cases, but I very much doubt it would solve the performance problem.

        For me, the real solution is using either a hash structure in memory (super fast if feasible) or an actual database such as MySql or MariaDB (or possibly even SQLlite, but that might be a bit more complicated with several files) with full support to indexed data access enabling fast processing.

Re: Perl Script performance issue
by poj (Abbot) on Dec 15, 2015 at 12:51 UTC
    Please suggest how to improve it or any better design

    What size are these files approximately ? How many records and length of record. How long does the program currently take to execute ?

    poj