I have written a script which parses a ~50mb file, re-orders its fields, and performs some normalization.

Unfortunately it slows down very much after processing about 2% of the file.

I am certainly not good perl programmer, but I cannot see any algorithmic reason for this file I/O to slow down so dramatically...

Any suggestions?
BTW : I use split to seperate the fields, and store the file to be outputted in a hash temporarily. I have reduced the number of regexps to what I think is the bare minimum

Cheers,

-tjm

#!/bin/perl #use strict; #step 0 : setup some general variables ###Hardcoded names of files -- read in from command line later?? my $InputFeed = "OPMS_List.txt"; my $OutputFeed = "DB.txt"; #step 1 : convert input feed to nicer data ### Requires the InputFeed file to contain the fields labelled below ### Hardcodes the OutputFeed file format print "Opening Input ($InputFeed) and output ($OutputFeed)\n"; open (INPUT_FEED, $InputFeed) || die "Cannot open Input Feed ($!)\n"; #Some storage variables my $first = 1; my @fields; my %invalid; my @tmp; my %PostCodeStrings; my $counter = 0; my $max = 1448996; my $temp; print "Converting field ordering...\n"; while (<INPUT_FEED>) { if($counter % 1000 == 0) { $temp = $counter/$max*100; print "Have processed $counter lines\t"; print "$temp\%done\n"; } chomp; undef @fields; @fields = split($_,/\|/); if($first == 1) { $first = 0; $counter++; open(INPUT, "invalid") || die "Cannot open invalid input (inva +lid) for reading ($!)\n"; while(<INPUT>) { next if($_ =~ m/^#/); my @curr_invalid = split($_,/\t/); $invalid{$curr_invalid[1]} = $invalid{$curr_invalid[1]} . +"\t" . $curr_invalid[0]; undef @curr_invalid; } close INPUT; next; } foreach(keys(%invalid)) { undef @tmp; @tmp = split($invalid{$_},/\t/); while(@tmp) { next if($_ =~ m/$fields[$_]/); } } $fields[4] = expand_state($fields[4]); $PostCodeStrings{$fields[5]} = $PostCodeStrings{$fields[5]} . $fields[4] . "|" . $fields[5] . "|" . $fields[3] . "|" . $fields[2] . "|" . $fields[1] . "|" . $fields[0] . "\n"; $counter++; } close INPUT_FEED; print "Done Converting field ordering...\n"; print "Writing new field ordering...\n"; open (OUTPUT_FEED, "> $OutputFeed") || die "Cannot open Output Feed ($ +!)\n"; foreach (sort(keys(%PostCodeStrings))) { print OUTPUT_FEED $PostCodeStrings{$_}; } close OUTPUT_FEED; sub expand_state { my $state = pop(@_); if($state =~ m/NSW/) { $state = "New South Wales"; } elsif($state =~ m/VIC/) { $state = "Victoria"; } elsif($state =~ m/QLD/) { $state = "Queensland"; } return $state; }

In reply to File I/O Slow Down by agentsim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.