More efficient munging if infile very large

ybiC has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: More efficient munging if infile very large by synapse0 (Pilgrim) on Jul 26, 2001 at 15:40 UTC
Well, I think you're way overdoing it here.. if you're worried about efficiency and resources, you probably shouldn't be reading in the array at all.. there's no need. And, since you've already determined that $uclc is going to be either one of 'uc' or 'lc', you don't need the extra check in there.. Here's how I woulda done it.. #!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); my $infile = shift or Usage(); my $outfile = shift or Usage(); Usage() unless ($uclc eq 'lc' \|\| $uclc eq 'uc'); open (IN, "< $infile") or die "Error opening $infile for read: $!"; open (OUT, "> $outfile") or die "Error opening $outfile for write: $!" +; foreach (<IN>) { # just print directly on reading, easier on the mem.. # quicker too, and we know that $uclc is going to # be either lc or uc at this point. print OUT $uclc eq 'lc' ? lc() : uc(); } close IN or die "Error closing $infile after read: $!"; close OUT or die "Error closing $outfile after write: $ +!"; ###################################################################### +#### sub Usage { die "\n Usage: uclc.pl (lc\|uc) infile outfile\n"; } ###################################################################### +#### [download] -Syn0 Update: didn't look too closely at opening code before, but fixed same bug after Hofmator mentioned it. Also, his elimination of the string equality check in the loop was a good call. ++	[reply] [d/l]
Re: Re: More efficient munging if infile very large by Hofmator (Curate) on Jul 26, 2001 at 16:06 UTC
I would implement the algorithm in nearly the same way, but I think I would implement it as a filter-like command which you can easily call in pipe sequences. So read-in with the magic `<>` (this allows for STDIN or infiles) and print to STDOUT (which can be easily redirected into outfile from the command line). This has the additional advantage of shortening the program a bit: `#!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); # fixed small bug Usage() unless ($uclc eq 'lc' or $uclc eq 'uc'); # to speed up the loop (no string compare on every line) my $lc = $uclc eq 'lc'; while (<>) { print $lc ? lc : uc; } sub Usage { die << EO_USE; Usage: uclc.pl (lc\|uc) [infile] reads from STDIN if no infile given EO_USE }` [download] Update: Oops, thanks tilly, fixed: while instead of foreach. -- Hofmator	[reply] [d/l] [select]
Re (tilly) 3: More efficient munging if infile very large by tilly (Archbishop) on Jul 26, 2001 at 16:16 UTC
Change the foreach to a while. The difference is that foreach puts the file read into list context so the entire file has to be held in memory. The while prints as it reads and so will work much better on large files. Otherwise I like your changes.	[reply]
(code) Re: (2) More efficient munging... (improved lc() uc() snippet) by ybiC (Prior) on Jul 26, 2001 at 18:16 UTC
Thanks, synapse0 (and others who replied also) for the suggestions. 8^) I'm not familiar with the `eq 'lc' ? lc() : uc();` syntax from `print OUT $uclc eq 'lc' ? lc() : uc();`. Would you be so kind as to explain how that works? Updated: 2001-07-29 07:40 CDT Here's what I ended up using. Thanks again, and ++ to all who offered suggestions in nodes and CB! cheers, Don striving toward Perl Adept (it's pronounced "why-bick") #!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); Usage() unless ($uclc eq 'uc' or 'lc'); while(<>) { print $uclc eq 'uc' ? uc() : lc(); } sub Usage { print "\n Usage: uclc.pl (uc\|lc) < infile > outfile\n\n"; exit; } =head1 NAME uclc.pl =head1 SYNOPSIS uclc.pl (uc\|lc) < infile > outfile" Convert alpha characters in text file to uppercase (or lowercase). =head1 UPDATED 2001-07-26 16:30 CDT Simplify code for (in\|out)put of STD(IN\|OUT). Remove unnecessary "or Usage()" from first two input lines. Replace if/else with ternary "?:". my $munged; if ($uclc eq 'uc') { $munged = uc(); } else {$munged = lc(); } print OUT $munged; While instead of for so file read line-by-line not slurp entire fil +e. Post efficiency SoPW to PerlMonks 2001-07-25 Initial working code. =head1 TODOS None that I know of. =head1 TESTED ActivePerl 5.61 on Win2kPro =head1 AUTHOR ybiC =head1 CREDITS Thanks to synapse0, OeufMayo, crazyinsomniac, Masem, MZSanford, virtualsue, Hoffmater, tilly, clemburg, and ichimunki for suggestions. And to davorg for "Data Munging with Perl". Oh yeah, and to some guy named vroom. =cut [download]	[reply] [d/l] [select]
Re3: More efficient munging if infile very large (explain particular syntax?) by Hofmator (Curate) on Jul 26, 2001 at 18:35 UTC
from perlop: `Conditional Operator Ternary "?:" is the conditional operator, just as in C. It works mu +ch like an if-then-else. If the argument before the ? is true, the arg +ument before the : is returned, otherwise the argument after the : is returned. For example: printf "I have %d dog%s.\n", $n, ($n == 1) ? '' : "s";` [download] for the gory details read on there ... -- Hofmator	[reply] [d/l]
(ichimunki) re x 4: More efficient munging if infile very large (explain particular syntax?) by ichimunki (Priest) on Jul 26, 2001 at 18:31 UTC
see perlman:perlop and look for the section on Conditional Operator. It's basically an in-place if-then type of construct that evaluates the EXPR $uclc eq 'lc' and returns lc() if the EXPR is true or uc() if the EXPR is false.	[reply]
Re: More efficient munging if infile very large by OeufMayo (Curate) on Jul 26, 2001 at 15:54 UTC
Okay, here's my attempt at this one, it should be slightly more efficient than using arrays, but you really should benchmark that...: `#!/usr/bin/perl -w use strict; unless ($ARGV[0] eq '-lc' \|\| $ARGV[0] eq '-uc' ){ die "usage: lcuc [-uc \| -lc] file\n"; } my $op = substr(shift @ARGV, 1, 2); while(<>){ no strict 'refs'; print &{$op}($_) } sub lc{ lc shift } sub uc{ uc shift }` [download] TMTOWTDI Update (as suggested by tilly) I'll admit that this one is cleaner than the above code. :) `#!/usr/bin/perl -w use strict; my %ops = ( 'lc' => sub {lc shift }, 'uc' => sub {uc shift }, ); my $case = shift @ARGV; unless (exists $ops{$case}){ die "usage: lcuc [", join(' \| ', keys %ops), "] file\n"; } while(<>){ print $ops{$case}->($_); }` [download] <kbd>-- my $OeufMayo = new PerlMonger::Paris({http => 'paris.mongueurs.net'});</kbd>	[reply] [d/l] [select]
Re: More efficient munging if infile very large by clemburg (Curate) on Jul 26, 2001 at 16:27 UTC
I think your efficiency concern is about the wrong thing. Look at the results of this (call like your program but with one more arg that indicates iterations to perform): #!/usr/bin/perl -w use strict; use Benchmark; my $uclc = shift or Usage(); my $infile = shift or Usage(); my $outfile = shift or Usage(); my $iterations = shift \|\| 10; Usage() unless ($uclc eq 'lc' or 'uc'); timethese($iterations, { original => sub { setup(); original(); teardown(); }, Masem_1 => sub { setup(); Masem_1(); teardown(); }, Masem_2 => sub { setup(); Masem_2(); teardown(); }, clemburg => sub { setup(); clemburg(); teardown(); }, }); sub original { my @in = <IN>; my @munged; for(@in) { my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); push @munged, $munged; } print OUT for(@munged); } sub Masem_1 { my @in = <IN>; @in = map { $uclc eq 'lc' ? lc : uc } @in; print OUT for(@in); } sub Masem_2 { my @in = <IN>; @in = $uclc eq 'lc' ? map { lc } @in : map { uc } @in; print OUT for(@in); } sub clemburg { while (<IN>) { my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); print OUT $munged; } } sub setup { open (IN, "< $infile") or die "Error opening $infile for read: $!"; open (OUT, "> $outfile") or die "Error opening $outfile for write: $!"; } sub teardown { close IN or die "Error closing $infile after write: $!"; close OUT or die "Error closing $outfile after write: $!"; } ###################################################### sub Usage { die "\n Usage: uclc.pl (lc\|uc) infile outfile\n"; } ###################################################### [download] The results for 10 iterations on my machine are like this, using a 1MB text file with mixed case and some markup characters: `> perl benchmark.pl lc terms.por terms.try 10 Benchmark: timing 10 iterations of Masem_1, Masem_2, clemburg, origina +l... Masem_1: 14 wallclock secs (11.01 usr + 1.34 sys = 12.36 CPU) Masem_2: 16 wallclock secs ( 9.88 usr + 0.84 sys = 10.72 CPU) clemburg: 10 wallclock secs ( 5.11 usr + 0.75 sys = 5.86 CPU) original: 32 wallclock secs (10.52 usr + 0.98 sys = 11.50 CPU)` [download] Obviously, the approach you and Masem use is very resource-intensive, reading the whole file into memory. That could kill your process or cause worse things if you work with really large files. Christian Lemburg Brainbench MVP for Perl http://www.brainbench.com	[reply] [d/l] [select]
Re: More efficient munging if infile very large by Masem (Monsignor) on Jul 26, 2001 at 15:32 UTC
Why not: `@in = map { $uclc eq 'lc' ? lc : uc } @in;` [download] And avoid the extra variables althogether, and get $_ set for you all at once? Since you've already got $uclc checked for the right state, you don't need the second wasteful if. Or, even better (1 if, instead of N...) `@in = $uclc eq 'lc' ? map { lc } @in : map { uc } @in;` [download] ----------------------------------------------------- Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain	[reply] [d/l] [select]
Re: More efficient munging if infile very large by MZSanford (Curate) on Jul 26, 2001 at 15:34 UTC
I am unclear about the `$munged()` bit of code in your description. `&$munged()` would work, were `$munged` a code reference. As for the effeciency of your approach, i would run it all through Benchmark, for any other opinions would be but opinions. It maybe quicker to do something like `if ($uclc eq 'lc') { ... for loop with lc } elsif ($uclc eq 'uc') { ... for loop with uc } else { die "invalid arg"; }` [download] it may be slightly fatser, but check benchmark Thus spake the Master Programmer: "When you have learned to snatch the error code from the trap frame, it will be time for you to leave." -- The Tao of Programming	[reply] [d/l] [select]