in reply to Re: Formatting a large number of records
in thread Formatting a large number of records

Thanks pfaut,

The input file contents looks like this:-
Ref_no Supp_co Order_ Cat_num Carr_co Unit_pri Line_pric Ca User_ +Auth_dat B 0312003620 SLUM02 M0551 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003619 SLUM02 M0550 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003617 SLUM02 M0548 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003616 SLUM02 M0547 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003684 SLUM02 M0615 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E 0312003613 SLUM02 M0544 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E 0312003586 SLUM02 M0517 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E
I have to check each line is a valid record rather than a header line or blank line (either of which appears a few hundred times throughout the file).

The actual record formatting is to remove the decimal places and insert leading zeroes on fields 5,6 & 7. I also have to interrogate the year in the penultimate field - the year determining which file the record is written to.

The code I have so far is:-
while ($line = <INPUT>) { chomp $line; # Check for lines to be discarded or kept if (substr($line,0,9) =~ /[0-9]{9}/) { # Lines are valid entries to be written to file if first 9 characters +are # numeric, file used dependant on date of invoice details. ($newline, $year) = validLine($line); if ($year ne "02") { open (YEAR, ">>".$path."year$year.txt") || die "Cannot open file +: $!\n"; print YEAR "$newline\n"; $y_count++; close YEAR || die "Cannot close file: $!\n"; } else { print OUTPUT "$newline\n"; $o_count++; } } else { print DISCARD "$line\n"; $d_count++; next; } }
with a subroutine, validLine(), that breaks each line using substr to remove the decimals, insert leading zeroes, get the year and re construct the line (I was using a split on spaces at this point but have had to change it as not every record has the same number of fields and the line must be reconstructed to take this into account).

Appreciate any further comments!

elbow

Replies are listed 'Best First'.
Re: Re: Re: Formatting a large number of records
by gjb (Vicar) on Dec 30, 2002 at 14:35 UTC

    I don't think it's a good idea to open and close the file to write to each time. I'd initialize a hash with the year as key and the handle as value, opening a new one if none exists in the hash for that particular year. To keep things clean I'd use IO::File.

    Also, but this is minor, I'd not use the substr since this is waste of time: $line =~ /^[0-9]{9}/ should do nicely as condition to keep the line.

    Just my 2 cents, -gjb-

Re: Re: Re: Formatting a large number of records
by pfaut (Priest) on Dec 30, 2002 at 14:45 UTC

    You are opening and closing a file each time through the loop when the year isn't '02'. This could be a significant impact on your runtime if there are a lot of these records. It might be best to establish a hash of file handles and keep them open. Something like this (untested):

    my %handles; my %count; sub get_file_handle { my $yr = shift; $count{$yr}++; return $handles{$yr} if exists $handles{$yr}; $handles{$yr} = IO::File ">>".$path."year$yr.txt"; }

    I don't know what your OUTPUT handle is assigned to. If you put it in the hash with the key '02', you could remove the if ($year...) test inside your loop. Then, your valid line handling becomes:

    print get_file_handle($year) "$newline\n";

    You do a chomp on the record read from the file only to add a newline when you print to output. If the removal of the newline is not required by your validLine() routine, get rid of both.

    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';
      Except that will have to read print { get_file_handle($year) } "$newline\n";
      $ perl -le'sub x { \*STDOUT }; print x() "blah"' String found where operator expected at -e line 1, near ") "blah"" (Missing operator before "blah"?) syntax error at -e line 1, near ") "blah"" Execution of -e aborted due to compilation errors. $ perl -le'sub x { \*STDOUT }; print { x() } "blah"' blah

      Makeshifts last the longest.

Re^3: Formatting a large number of records
by Aristotle (Chancellor) on Dec 30, 2002 at 16:09 UTC

    You are chomping the line, even though you never look at anything other than the first record on the line if its not valid, then catenate a newline back to the end of it on saving it. You repeatedly open/close files for single records. Your regex has a quantifier of {9} although you already made sure only to look at the first nine characters - a double negative is more economic in that case (test for the absence of non-digit characters).

    Also, you can reduce quite a lot of duplication.

    my (%handle, $fh, $count); $handle{02} = \*OUTPUT; while(my $line = <INPUT>) { if(substr($_, 0, 9) !~ /\D/) { chomp $line; my $year; ($line, $year) = validLine($_); $line .= "\n"; $count = $year ne "02" ? \$y_count : \$o_count; $fh = $handle{$year} || do { open my($newfh), ">>", $path."year$year.txt" or die "Cannot open file: $!\n"; $newfh; }; } else { $fh = \*DISCARD; $count = \$d_count; } ++$$count; print $fh $line; }
    If you post your validLine, chances are improvements to it can also be suggested.

    Makeshifts last the longest.