Optimizing slow restructuring of delimited files

iKnowNothing has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have written a simple script to read a tab-delimited file and write out a few selected columns to another tab-delimited file. Although my script is working OK, it seems to take more time than it should. It took 44 seconds to do its thing on a 4.22 MB file (2796 lines). Any insight would be greatly appreciated. Here's the code:

while(<INFILE>)
{
    # get the current line and split into it's columns
    @Line = split /\s+/, $_;

    #print the selected columns to the output
    print OUTFILE join("\t",@Line[@ColumnNumbers]),"\n";
}
[download]

The @ColumnNumbers array has been defined previously, and would look something like: (1,10,32,69,200,291) UPDATE: The problem turned out not to be related to the code above. Thanks for the input though. Turns out I was running another algorithm every time through the loop that I thought I was only running the first time.

Retitled by davido from 'Why is this so slow?'.

Comment on Optimizing slow restructuring of delimited files Download Code

Replies are listed 'Best First'.
Re: Optimizing slow restructuring of delimited files by davido (Cardinal) on Jan 25, 2005 at 18:01 UTC
You don't have any seriously slow code in that snippet. One comment though; the snippet isn't doing what you stated to be your objective. Instead of splitting columns of a tab-delimited file, it's splitting on any amount of any kind of whitespace. I'm not sure that's your objective. For example, if your input string contained: `Hello world.\tThis is Dave` [download] Your snippet would create elements in @Line like this: `@Line = ( 'Hello', 'world.', 'This', 'is', 'Dave' );` [download] You stated that the objective was to load @Line with: `@Line = ( 'Hello world.', 'This is Dave' );` [download] Dave	[reply] [d/l] [select]
Re: Optimizing slow restructuring of delimited files by periapt (Hermit) on Jan 25, 2005 at 18:10 UTC
You didn't give much to go on but my first hypothesis would be disk I/O. If the file size is small, you could try assigning the entire file to an array and parsing that. Something like `@filetoread = <INFILE>; # read in file all at once my $linestooutput = ''; # place to save output until the en +d foreach (@filetoread){ @Line = split /\s+/; #split defaults to $_ $linestooutput .= join("\t",@Line[@ColumnNumbers])."\n"; } print OUTFILE $linestooutput; # write output # or even shorter @filetoread = <INFILE>; $linestooutput .= join("\t",(split /\s+/)[@ColumnNumbers])."\n" foreac +h (@filetoread); print OUTFILE $linestooutput;` [download] I'm not sure about the speed impact of interpolated splices? I don't imagine that is the issue but you could try something like this `while(<INFILE>) { # get the current line and split into it's columns @Line = split /\s+/, $_; #print the selected columns to the output my $outline .= @Line[$col]."\t" foreach my $col (@ColumnNumbers); print OUTFILE $outline,"\n"; }` [download] PJ use strict; use warnings; use diagnostics;	[reply] [d/l] [select]
Re: Optimizing slow restructuring of delimited files by BrowserUk (Patriarch) on Jan 25, 2005 at 18:43 UTC
You program would probably run a little more quickly if you used lexical (my) variables--assuming you are not already doing so. You may also gain a little performance from avoiding the join by setting `$, = "\t";` and just printing the splice. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply] [d/l]
Re: Optimizing slow restructuring of delimited files by Aristotle (Chancellor) on Jan 25, 2005 at 23:22 UTC
In such restricted cases awk might or not be a better bet. `awk -F'[[:space:]]+' 'BEGIN { OFS="\t" } { print $3, $6, $7 }' infile +> outfile` [download] Makeshifts last the longest.	[reply] [d/l]
Re: Optimizing slow restructuring of delimited files by holli (Abbot) on Jan 25, 2005 at 19:34 UTC
Here is a one-liner for you: `shell> perl -na -F\s+ -e "BEGIN{@S=(2,5)} print join(qq:\t:, @F[@S]), +qq:\n:" infile>outfile` [download] @S contains the columns to select. But this one should be faster `shell> perl -na -F\s+ -e "print join(qq:\t:, @F[1,2]), qq:\n:" infile> +outfile` [download] holli, regexed monk	[reply] [d/l] [select]
Re: Optimizing slow restructuring of delimited files by NateTut (Deacon) on Jan 25, 2005 at 19:30 UTC
You don't mention which platform you're on or how you are executing the script, but I have noticed a significant startup delay with ActiveState perlapp executables.	[reply]

use strict; use warnings; use diagnostics;