Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

File I/O Slow Down

by agentsim (Initiate)
on Aug 05, 2003 at 01:53 UTC ( [id://280858]=perlquestion: print w/replies, xml ) Need Help??

agentsim has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script which parses a ~50mb file, re-orders its fields, and performs some normalization.

Unfortunately it slows down very much after processing about 2% of the file.

I am certainly not good perl programmer, but I cannot see any algorithmic reason for this file I/O to slow down so dramatically...

Any suggestions?
BTW : I use split to seperate the fields, and store the file to be outputted in a hash temporarily. I have reduced the number of regexps to what I think is the bare minimum

Cheers,

-tjm

#!/bin/perl #use strict; #step 0 : setup some general variables ###Hardcoded names of files -- read in from command line later?? my $InputFeed = "OPMS_List.txt"; my $OutputFeed = "DB.txt"; #step 1 : convert input feed to nicer data ### Requires the InputFeed file to contain the fields labelled below ### Hardcodes the OutputFeed file format print "Opening Input ($InputFeed) and output ($OutputFeed)\n"; open (INPUT_FEED, $InputFeed) || die "Cannot open Input Feed ($!)\n"; #Some storage variables my $first = 1; my @fields; my %invalid; my @tmp; my %PostCodeStrings; my $counter = 0; my $max = 1448996; my $temp; print "Converting field ordering...\n"; while (<INPUT_FEED>) { if($counter % 1000 == 0) { $temp = $counter/$max*100; print "Have processed $counter lines\t"; print "$temp\%done\n"; } chomp; undef @fields; @fields = split($_,/\|/); if($first == 1) { $first = 0; $counter++; open(INPUT, "invalid") || die "Cannot open invalid input (inva +lid) for reading ($!)\n"; while(<INPUT>) { next if($_ =~ m/^#/); my @curr_invalid = split($_,/\t/); $invalid{$curr_invalid[1]} = $invalid{$curr_invalid[1]} . +"\t" . $curr_invalid[0]; undef @curr_invalid; } close INPUT; next; } foreach(keys(%invalid)) { undef @tmp; @tmp = split($invalid{$_},/\t/); while(@tmp) { next if($_ =~ m/$fields[$_]/); } } $fields[4] = expand_state($fields[4]); $PostCodeStrings{$fields[5]} = $PostCodeStrings{$fields[5]} . $fields[4] . "|" . $fields[5] . "|" . $fields[3] . "|" . $fields[2] . "|" . $fields[1] . "|" . $fields[0] . "\n"; $counter++; } close INPUT_FEED; print "Done Converting field ordering...\n"; print "Writing new field ordering...\n"; open (OUTPUT_FEED, "> $OutputFeed") || die "Cannot open Output Feed ($ +!)\n"; foreach (sort(keys(%PostCodeStrings))) { print OUTPUT_FEED $PostCodeStrings{$_}; } close OUTPUT_FEED; sub expand_state { my $state = pop(@_); if($state =~ m/NSW/) { $state = "New South Wales"; } elsif($state =~ m/VIC/) { $state = "Victoria"; } elsif($state =~ m/QLD/) { $state = "Queensland"; } return $state; }

Replies are listed 'Best First'.
Re: File I/O Slow Down
by dws (Chancellor) on Aug 05, 2003 at 02:09 UTC
        @fields = split($_,/\|/); isn't doing what you expect it to. Try reversing the arguments. Then run the script again and see if you see a similar slow-down. Depending on how fast %PostCodeStrings grows, you're going to be seeing some slow-down, but it may be insignificant once you're splitting records correctly.

      *blush* what dumb mistake :)
      Thank you... that has worked a charm :)
Re: File I/O Slow Down
by BrowserUk (Patriarch) on Aug 05, 2003 at 02:57 UTC

    dws' observation not withstanding, there are couple of other 'peculiarities' with your code.

    The first is this nested loop

    foreach( keys %invalid ) { undef @tmp; @tmp = split $invalid{$_}, /\t/; while( @tmp ) { next if $_ =~ m/$fields[$_]/; } }

    Apart from chewing a potentially large number of cycles, this doing nothing that I can see? I think I know what you are trying to do, but next will repeat the nearest enclosing loop.

    If you want to skip to the next line from INPUT_FEED, then you would need to use the next LABEL; form.

    RECORD: while( <INPUT_FEED> ) { .... foreach my $invalid (keys %invalid ) { ... while( @tmp ) { next RECORD if ...; } }

    Then there is this statement

    next if $_ = /$fields[$_]/;

    You have two references to $_ in that if clause, and I think you are expecting them to refer to different things? They won't!

    You have #use strict; at the top of your code. I cannot recommend enough that you uncomment that and add -w or use warnings. And then shut teh compiler up by correcting each problem is finds and yells about.

    The compilers attention to detail is pretty impecable. If it tells you there is something wrong, it is usually right:).

    Also,

    $PostCodeStrings{$fields[5]} = $PostCodeStrings{$fields[5]} . $fields[4] . "|" . $fields[5] . "|" . $fields[3] . "|" . $fields[2] . "|" . $fields[1] . "|" . $fields[0] . "\n";

    You could save yourself a lot of typing by using  $x .= ... rather than $x = $x . ..... And even more by using join and an array slice on @fields for the rest of the statement. This should be equivalent.

    $PostCodeStrings{$fields[5]} .= join( '|', @fields[4,5,3,2,1,0] ) . "\n";

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

      Thanks for the extra pointer -- I fixed those myself once the split(...) stuff was pointed out.

      One thing I did notice that I think is of note. I did as you suggest and used the join command instead of typing the whole thing out. My execution time fell from ~600 seconds to ~60 seconds.

      Is that because perl do not have to copy the string to memory temporarily and then reassign it when you use join?

Re: File I/O Slow Down
by diotalevi (Canon) on Aug 05, 2003 at 15:07 UTC

    You can't nest implicit while(<FH>) loops - both clobber the global $_ variable. If you're going to use nested while(<FH>) loops then you need to assign to a different variable or localize $_ before entering the inner while(). The easiest (but not the best) thing to do is just to localize $_ for the block. I copied right from your code, put this section in a block and localized both $_ and the filehandle (you know to do that too, right?).

    I also reversed your arguments to split() since you had those backwards and I removed the `undef @ary` line since that just wastes everyone's time. The my() call earlier in the loop already scopes the array to that block and ensures it is new each time the loop restarts.

    { local $_; local *INPUT; open(INPUT, "invalid") || die "Cannot open invalid input ( +invalid) for reading ($!)\n"; while(<INPUT>) { next if($_ =~ m/^#/); my @curr_invalid = split(/\t/, $_); $invalid{$curr_invalid[1]} = $invalid{$curr_invalid[1] +} . "\t" . $curr_invalid[0]; } close INPUT; }
Re: File I/O Slow Down
by Ryszard (Priest) on Aug 05, 2003 at 06:18 UTC
    Something i recently learned was the 'o' modifier in regex's.

    It tells perl you're not going to change the regex, so its only compiled once, not every time, (ie its no good if you're using variables in your regex).

    next if($_ =~ m/^#/); try as next if($_ =~ m/^#/o);

    Execute your script with time to benchmark the performance, ie time parser.pl. If you get better performance, cool, you've got it for almost no effort, if you dont, well, you've spent almost no time on it..

    While its not exactly IO related, it may speed up your program.

      blah, don't use /o use qr//. His issue was with the split as pointed out above and he may also be having an issue with the size of his hashes once he loads all of that data in. hashes tend to be fairly large in memory for the speedy lookup. classic perl memory for speed tradeoff.

      -Waswas
        Dunno about that, I learnt/found out about it via a presentation at YAPC::Europe::Paris. Perhaps i missunderstood the meaning of that part of the presentation...

        None the less, i'm off to look at your node..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://280858]
Approved by dws
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-03-28 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found