script optimization

tannx has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: script optimization by MidLifeXis (Monsignor) on Dec 15, 2011 at 14:15 UTC
A couple of comments fist: Are you converting character encodings with your list of `s///` statements? There are better ways of doing that if you are. On the other hand, if you already have some form of mangled data, this may be the best that can be done. Your formatting makes the block structure of your code unclear. You are processing what looks like CSV data by hand. Take a look at Text::CSV and other modules that already handle this type of data much better. #!/usr/local/bin/perl -w # $fn1 = '/in.CSV'; # Useless stringification, and check your errors # open (INST,"$fn1") open(INST, '<', $fn1) or die "Unable to open '$fn1' for reading: $!"; # Preference, but possibly a better habit # consider: open(ABI, ">$foo") where $foo # contains ">blah". # Also - check your errors # open (ABI,">/out.ins"); open(ABI,">", "/out.ins") or die "Unable to open /out.ins: $!"; while (<INST>) { # Moved from below - fail fast [1] next if /^"Branchno"\t/; next if /^""\t/; # Not necessary [3] # chomp; # Fishy - does this remove the last \t? [2] # chop; s/\õ/\ä/g; s/\"-/\"Ä/g; s/\--/\-Ä/g; s/\ - /\-/g; s/\ -/\ Ä/g; s/\-/\ - /g; s/\"_/\"Ü/g; s/\ _/\ Ü/g; s/\__/\_Ü/g; s/\³/\ü/g; s/\§/\õ/g; # @array = ' '; # Not necessary, useless # no longer necessary [1,2] # @array = split(/\t/); # Moved to top of loop - fail fast [1] # if ($array[0] eq "\"Branchno\"") { next; } # if ($array[0] eq "\"\"" ) {next;} # No longer necessary, assuming chop above removed \t [2] # $result = join ("\|",@array)."\|"; # Replace split / join with another s/// [2] s/\t/\|/g; # Work on $_, no longer need $result # $result =~ s/\"//g; # print ABI $result,"\n"; s/"//g; # newline not necessary since chomp removed [3] print; } close (INST); close (ABI); [download] If you fail fast, you can avoid doing the s/// on the discarded lines. If "\t" was removed by the chop, this will handle it as well If you don't chomp, you don't need to print the newline, as it is still there There are also some `tr///` uses that could possibly make this faster (see trizen's post above). However given the type of input data I am assuming from your code, this is a very fragile solution. --MidLifeXis	[reply] [d/l] [select]
Re: script optimization by Tux (Canon) on Dec 15, 2011 at 14:18 UTC
your loop does `chomp` AND `chop`. Why? It there a trailing character that needs to be removed? Your loop initializes `@array` twice in every iteration. That takes unneeded time (you asked for speedups) You escape too many characters that do not need escaping. You can combine single character replacements into a single tr/// call use strict; use warnings; my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n"; while (<$hi>) { chomp; chop; # <-- is this really needed? tr/õ³§/äüõ/; s/ - /-/g; s/\-/ - /g; # <-- is this really what you want? s/([" _])-/$1Ä/g; s/([" _])_/$1Ü/g; my @array = split /\t/ => $_, -1; $array[0] =~ m/^"(?:Branchno)?"$/ and next; (my $result = join "\|" => @array, "") =~ tr/"//d; print $ho $result, "\n"; } close $hi; close $ho; [download] Enjoy, Have FUN! H.Merijn ); open my $hi,	[reply] [d/l] [select]
Re: script optimization by BrowserUk (Patriarch) on Dec 15, 2011 at 13:42 UTC
A few lines of sample input would allow us to check that changes don't screw things up. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re: script optimization by marto (Cardinal) on Dec 15, 2011 at 13:46 UTC
You may want to read How do I post a question effectively?, some sample data would be nice. See also Debugging and Optimization and Devel::NYTProf regarding profiling and benchmarking.	[reply]
Re: script optimization by Anonymous Monk on Dec 15, 2011 at 13:45 UTC
Use one of Regexp::Assemble/Regexp::Trie/Regex::PreSuf to assemble an efficient regex then make a single substitution pass `my %Replacement = { q/"-/ => q/"Ä/, ... ); ... while ... s/($rere)/$Replacement{$1}/g; ...` [download]	[reply] [d/l]
Re: script optimization by trizen (Hermit) on Dec 15, 2011 at 14:01 UTC
`#!/usr/bin/perl use warnings; use strict; my $fn1 = '/in.CSV'; open INST, '<', $fn1 or die $!; open ABI, '>', '/out.ins' or die $!; while (defined($_ = <INST>)) { next if substr($_, 0, 3) eq qq{""\t}; next if substr($_, 0, 11) eq qq{"Branchno"\t}; tr/õ³§/äüõ/; s/(["-])-/$1Ä/g; s/ - /-/g; s/ -/ Ä/g; s/\-\/ - /g; s/([" _])_/$1Ü/g; chomp $_; chop $_; tr/"//d; tr/\t/\|/; print ABI "${_}\|\n"; } close INST; close ABI;` [download]	[reply] [d/l]
Re: script optimization by thargas (Deacon) on Dec 15, 2011 at 13:53 UTC
Your script isn't doing much, so I'd guess you have a lot of data or a slow machine to run it on. Please provide: sample data expected output for that data size of input file (bytes and records)	[reply]
Re: script optimization by Tux (Canon) on Dec 15, 2011 at 14:19 UTC
your loop does `chomp` AND `chop`. Why? It there a trailing character that needs to be removed? Your loop initializes `@array` twice in every iteration. That takes unneeded time (you asked for speedups) You escape too many characters that do not need escaping. You can combine single character replacements into a single tr/// call use strict; use warnings; my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n"; while (<$hi>) { chomp; chop; # <-- is this really needed? tr/õ³§/äüõ/; s/ - /-/g; s/\-/ - /g; # <-- is this really what you want? s/([" _])-/$1Ä/g; s/([" _])_/$1Ü/g; my @array = split /\t/ => $_, -1; $array[0] =~ m/^"(?:Branchno)?"$/ and next; (my $result = join "\|" => @array, "") =~ tr/"//d; print $ho $result, "\n"; } close $hi; close $ho; [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^2: script optimization by pvaldes (Chaplain) on Dec 15, 2011 at 14:40 UTC
`my ($fi, $fo) = ("/in.CSV", "/out.ins"); open my $hi, "<", $fi or die "$fi: $!\n"; open my $ho, ">", $fo or die "$fo: $!\n";` [download] just my 2cents... apart of the chomp-chop issue and the other things said, can't see the point on to create two extra variables here, why simply not do this? `open my $hi, "<", "/in.CSV" or die $!; open my $ho, ">", "/out.ins" or die $!;` [download] You could also do something with this: `if ($array[0] eq "\"Branchno\"") { next; } if ($array[0] eq "\"\"" ) {next;}` [download] You want to discard this lines?, then do it as early as you can, put both before the substitution lines. In a very big file this can make a difference. `While(<INST>) { next if /^\"(Branchno\|)\"/; s/...` [download]	[reply] [d/l] [select]
Re^3: script optimization by Tux (Canon) on Dec 15, 2011 at 14:49 UTC
Because the `die` message does not include the file name. Another reason to do so, is that when you would do it really neat, you'd also check the `close` calls, and you'd still have the filenames ready for the diagnostics if those fail. A third reason for this would be that it is now easy to rewrite the script to take arguments from the command line and replace the default names. Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: script optimization by mctaylor (Novice) on Dec 16, 2011 at 21:55 UTC
In addition to the other suggestions, I would add that you want to compile the RegEx only once, since they appear to be unchanging, at least during the execution of the script. This can be done with either the /o modifier, which tells Perl to compile the regex only once. And potentially use with the qr "quote regex" (perlop#Regexp-Quote-Like-Operators).	[reply]