tannx has asked for the wisdom of the Perl Monks concerning the following question:

How can I make script faster? Right now it runs around 15 minutes. File is 500MB. Contains 30 columns and million rows. colums are separated by tab. Data inside is text and numbers escaped with ""

#!/usr/local/bin/perl -w # $fn1 = '/in.CSV'; open (INST,"$fn1"); open (ABI,">/out.ins"); while (<INST>) { s/\õ/\ä/g; s/\"-/\"Ä/g; s/\--/\-Ä/g; s/\ - /\*-*/g; s/\ -/\ Ä/g; s/\*-*/\ - /g; s/\"_/\"Ü/g; s/\ _/\ Ü/g; s/\__/\_Ü/g; s/\³/\ü/g; s/\§/\õ/g; chomp; chop; @array = ' '; @array = split(/\t/); if ($array[0] eq "\"Branchno\"") { next; } if ($array[0] eq "\"\"" ) {next;} $result = join ("|",@array)."|"; $result =~ s/\"//g; print ABI $result,"\n"; } close (INST); close (ABI);

Replies are listed 'Best First'.
Re: script optimization
by MidLifeXis (Monsignor) on Dec 15, 2011 at 14:15 UTC

    A couple of comments fist:

    • Are you converting character encodings with your list of s/// statements? There are better ways of doing that if you are. On the other hand, if you already have some form of mangled data, this may be the best that can be done.
    • Your formatting makes the block structure of your code unclear.
    • You are processing what looks like CSV data by hand. Take a look at Text::CSV and other modules that already handle this type of data much better.

    #!/usr/local/bin/perl -w # $fn1 = '/in.CSV'; # Useless stringification, and check your errors # open (INST,"$fn1") open(INST, '<', $fn1) or die "Unable to open '$fn1' for reading: $!"; # Preference, but possibly a better habit # consider: open(ABI, ">$foo") where $foo # contains ">blah". # Also - check your errors # open (ABI,">/out.ins"); open(ABI,">", "/out.ins") or die "Unable to open /out.ins: $!"; while (<INST>) { # Moved from below - fail fast [1] next if /^"Branchno"\t/; next if /^""\t/; # Not necessary [3] # chomp; # Fishy - does this remove the last \t? [2] # chop; s/\õ/\ä/g; s/\"-/\"Ä/g; s/\--/\-Ä/g; s/\ - /\*-*/g; s/\ -/\ Ä/g; s/\*-*/\ - /g; s/\"_/\"Ü/g; s/\ _/\ Ü/g; s/\__/\_Ü/g; s/\³/\ü/g; s/\§/\õ/g; # @array = ' '; # Not necessary, useless # no longer necessary [1,2] # @array = split(/\t/); # Moved to top of loop - fail fast [1] # if ($array[0] eq "\"Branchno\"") { next; } # if ($array[0] eq "\"\"" ) {next;} # No longer necessary, assuming chop above removed \t [2] # $result = join ("|",@array)."|"; # Replace split / join with another s/// [2] s/\t/|/g; # Work on $_, no longer need $result # $result =~ s/\"//g; # print ABI $result,"\n"; s/"//g; # newline not necessary since chomp removed [3] print; } close (INST); close (ABI);

    1. If you fail fast, you can avoid doing the s/// on the discarded lines.
    2. If "\t" was removed by the chop, this will handle it as well
    3. If you don't chomp, you don't need to print the newline, as it is still there

    There are also some tr/// uses that could possibly make this faster (see trizen's post above). However given the type of input data I am assuming from your code, this is a very fragile solution.

    --MidLifeXis

Re: script optimization
by Tux (Canon) on Dec 15, 2011 at 14:18 UTC

    your loop does chomp AND chop. Why? It there a trailing character that needs to be removed?

    Your loop initializes @array twice in every iteration. That takes unneeded time (you asked for speedups)

    You escape too many characters that do not need escaping.

    You can combine single character replacements into a single tr/// call

    use strict; use warnings; my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n"; while (<$hi>) { chomp; chop; # <-- is this really needed? tr/õ³§/äüõ/; s/ - /*-*/g; s/\*-*/ - /g; # <-- is this really what you want? s/([" _])-/$1Ä/g; s/([" _])_/$1Ü/g; my @array = split /\t/ => $_, -1; $array[0] =~ m/^"(?:Branchno)?"$/ and next; (my $result = join "|" => @array, "") =~ tr/"//d; print $ho $result, "\n"; } close $hi; close $ho;

    Enjoy, Have FUN! H.Merijn
    ); open my $hi,
Re: script optimization
by BrowserUk (Patriarch) on Dec 15, 2011 at 13:42 UTC

    A few lines of sample input would allow us to check that changes don't screw things up.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: script optimization
by marto (Cardinal) on Dec 15, 2011 at 13:46 UTC
Re: script optimization
by Anonymous Monk on Dec 15, 2011 at 13:45 UTC
Re: script optimization
by trizen (Hermit) on Dec 15, 2011 at 14:01 UTC
    #!/usr/bin/perl use warnings; use strict; my $fn1 = '/in.CSV'; open INST, '<', $fn1 or die $!; open ABI, '>', '/out.ins' or die $!; while (defined($_ = <INST>)) { next if substr($_, 0, 3) eq qq{""\t}; next if substr($_, 0, 11) eq qq{"Branchno"\t}; tr/õ³§/äüõ/; s/(["-])-/$1Ä/g; s/ - /*-*/g; s/ -/ Ä/g; s/\*-\*/ - /g; s/([" _])_/$1Ü/g; chomp $_; chop $_; tr/"//d; tr/\t/|/; print ABI "${_}|\n"; } close INST; close ABI;
Re: script optimization
by thargas (Deacon) on Dec 15, 2011 at 13:53 UTC

    Your script isn't doing much, so I'd guess you have a lot of data or a slow machine to run it on.

    Please provide:

    • sample data
    • expected output for that data
    • size of input file (bytes and records)

Re: script optimization
by Tux (Canon) on Dec 15, 2011 at 14:19 UTC

    your loop does chomp AND chop. Why? It there a trailing character that needs to be removed?

    Your loop initializes @array twice in every iteration. That takes unneeded time (you asked for speedups)

    You escape too many characters that do not need escaping.

    You can combine single character replacements into a single tr/// call

    use strict; use warnings; my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n"; while (<$hi>) { chomp; chop; # <-- is this really needed? tr/õ³§/äüõ/; s/ - /*-*/g; s/\*-*/ - /g; # <-- is this really what you want? s/([" _])-/$1Ä/g; s/([" _])_/$1Ü/g; my @array = split /\t/ => $_, -1; $array[0] =~ m/^"(?:Branchno)?"$/ and next; (my $result = join "|" => @array, "") =~ tr/"//d; print $ho $result, "\n"; } close $hi; close $ho;

    Enjoy, Have FUN! H.Merijn
      my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n";

      just my 2cents... apart of the chomp-chop issue and the other things said, can't see the point on to create two extra variables here, why simply not do this?

      open my $hi, "<", "/in.CSV" or die $!; open my $ho, ">", "/out.ins" or die $!;

      You could also do something with this:

      if ($array[0] eq "\"Branchno\"") { next; } if ($array[0] eq "\"\"" ) {next;}

      You want to discard this lines?, then do it as early as you can, put both before the substitution lines. In a very big file this can make a difference.

      While(<INST>) { next if /^\"(Branchno|)\"/; s/...

        Because the die message does not include the file name.

        Another reason to do so, is that when you would do it really neat, you'd also check the close calls, and you'd still have the filenames ready for the diagnostics if those fail.

        A third reason for this would be that it is now easy to rewrite the script to take arguments from the command line and replace the default names.


        Enjoy, Have FUN! H.Merijn
Re: script optimization
by mctaylor (Novice) on Dec 16, 2011 at 21:55 UTC
    In addition to the other suggestions, I would add that you want to compile the RegEx only once, since they appear to be unchanging, at least during the execution of the script. This can be done with either the
    /o
    modifier, which tells Perl to compile the regex only once. And potentially use with the
    qr
    "quote regex" (perlop#Regexp-Quote-Like-Operators).