comment on

I have written a script which parses a ~50mb file, re-orders its fields, and performs some normalization.

Unfortunately it slows down very much after processing about 2% of the file.

I am certainly not good perl programmer, but I cannot see any algorithmic reason for this file I/O to slow down so dramatically...

Any suggestions?
BTW : I use split to seperate the fields, and store the file to be outputted in a hash temporarily. I have reduced the number of regexps to what I think is the bare minimum

Cheers,

-tjm



#!/bin/perl

#use strict;

#step 0 : setup some general variables
###Hardcoded names of files -- read in from command line later??
my $InputFeed = "OPMS_List.txt";
my $OutputFeed = "DB.txt";

#step 1 : convert input feed to nicer data
### Requires the InputFeed file to contain the fields labelled below
### Hardcodes the OutputFeed file format

print "Opening Input ($InputFeed) and output ($OutputFeed)\n";

open (INPUT_FEED, $InputFeed) || die "Cannot open Input Feed ($!)\n";

#Some storage variables
my $first = 1;
my @fields;
my %invalid;
my @tmp;

my %PostCodeStrings;

my $counter = 0;
my $max = 1448996;
my $temp;

print "Converting field ordering...\n";

while (<INPUT_FEED>)
{
    if($counter % 1000 == 0)
    {
        $temp = $counter/$max*100;
        print "Have processed $counter lines\t";
        print "$temp\%done\n";
    }

    chomp;

    undef @fields;
    @fields = split($_,/\|/);

    if($first == 1)
    {
        $first = 0;
        $counter++;

        open(INPUT, "invalid") || die "Cannot open invalid input (inva
+lid) for reading ($!)\n";
        while(<INPUT>)
        {
            next if($_ =~ m/^#/);
            my @curr_invalid = split($_,/\t/);
            $invalid{$curr_invalid[1]} = $invalid{$curr_invalid[1]} . 
+"\t" . $curr_invalid[0];
            undef @curr_invalid;
        }
        close INPUT;

        next;
    }

    foreach(keys(%invalid))
    {
        undef @tmp;
        @tmp = split($invalid{$_},/\t/);
        while(@tmp)
        {
            next if($_ =~ m/$fields[$_]/);
        }
    }

    $fields[4] = expand_state($fields[4]);

    $PostCodeStrings{$fields[5]} = 
                $PostCodeStrings{$fields[5]} . 
                $fields[4] . "|" .
                $fields[5] . "|" .
                $fields[3] . "|" .
                $fields[2] . "|" .
                $fields[1] . "|" .
                $fields[0] . "\n";
    $counter++;
}

close INPUT_FEED;

print "Done Converting field ordering...\n";


print "Writing new field ordering...\n";

open (OUTPUT_FEED, "> $OutputFeed") || die "Cannot open Output Feed ($
+!)\n";
foreach (sort(keys(%PostCodeStrings)))
{
    print OUTPUT_FEED $PostCodeStrings{$_};
}
close OUTPUT_FEED;

sub expand_state
{
    my $state = pop(@_);
    if($state =~ m/NSW/)
    {
        $state = "New South Wales";
    }
    elsif($state =~ m/VIC/)
    {
        $state = "Victoria";
    }
    elsif($state =~ m/QLD/)
    {
        $state = "Queensland";
    }
    return $state;
}
[download]

In reply to File I/O Slow Down by agentsim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.