Formatting a large number of records

elbow has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Formatting a large number of records by pfaut (Priest) on Dec 30, 2002 at 14:01 UTC
If each record needs to be processed on its own, there will be no gain by reading the whole file into memory before doing any processing. That road is more likely to make it take longer since you'll add page faulting and swapping to your existing problems. About all you can do is concentrate on what's inside the loop. If you post some code, maybe we can give some advice. `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l]
Re: Re: Formatting a large number of records by elbow (Scribe) on Dec 30, 2002 at 14:26 UTC
Thanks pfaut, The input file contents looks like this:- Ref_no Supp_co Order_ Cat_num Carr_co Unit_pri Line_pric Ca User_ +Auth_dat B 0312003620 SLUM02 M0551 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003619 SLUM02 M0550 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003617 SLUM02 M0548 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003616 SLUM02 M0547 RT3420 4.04 8.25 8.25 P UFJGC +04/01/98 E 0312003684 SLUM02 M0615 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E 0312003613 SLUM02 M0544 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E 0312003586 SLUM02 M0517 RT3420 4.04 11.90 11.90 P UFJGC +04/01/98 E [download] I have to check each line is a valid record rather than a header line or blank line (either of which appears a few hundred times throughout the file). The actual record formatting is to remove the decimal places and insert leading zeroes on fields 5,6 & 7. I also have to interrogate the year in the penultimate field - the year determining which file the record is written to. The code I have so far is:- while ($line = <INPUT>) { chomp $line; # Check for lines to be discarded or kept if (substr($line,0,9) =~ /[0-9]{9}/) { # Lines are valid entries to be written to file if first 9 characters +are # numeric, file used dependant on date of invoice details. ($newline, $year) = validLine($line); if ($year ne "02") { open (YEAR, ">>".$path."year$year.txt") \|\| die "Cannot open file +: $!\n"; print YEAR "$newline\n"; $y_count++; close YEAR \|\| die "Cannot close file: $!\n"; } else { print OUTPUT "$newline\n"; $o_count++; } } else { print DISCARD "$line\n"; $d_count++; next; } } [download] with a subroutine, validLine(), that breaks each line using substr to remove the decimals, insert leading zeroes, get the year and re construct the line (I was using a split on spaces at this point but have had to change it as not every record has the same number of fields and the line must be reconstructed to take this into account). Appreciate any further comments! elbow	[reply] [d/l] [select]
Re: Re: Re: Formatting a large number of records by gjb (Vicar) on Dec 30, 2002 at 14:35 UTC
I don't think it's a good idea to open and close the file to write to each time. I'd initialize a hash with the year as key and the handle as value, opening a new one if none exists in the hash for that particular year. To keep things clean I'd use IO::File. Also, but this is minor, I'd not use the `substr` since this is waste of time: `$line =~ /^[0-9]{9}/` should do nicely as condition to keep the line. Just my 2 cents, -gjb-	[reply] [d/l] [select]
Re: Re: Re: Formatting a large number of records by pfaut (Priest) on Dec 30, 2002 at 14:45 UTC
You are opening and closing a file each time through the loop when the year isn't '02'. This could be a significant impact on your runtime if there are a lot of these records. It might be best to establish a hash of file handles and keep them open. Something like this (untested): `my %handles; my %count; sub get_file_handle { my $yr = shift; $count{$yr}++; return $handles{$yr} if exists $handles{$yr}; $handles{$yr} = IO::File ">>".$path."year$yr.txt"; }` [download] I don't know what your OUTPUT handle is assigned to. If you put it in the hash with the key '02', you could remove the `if ($year...)` test inside your loop. Then, your valid line handling becomes: `print get_file_handle($year) "$newline\n";` [download] You do a chomp on the record read from the file only to add a newline when you print to output. If the removal of the newline is not required by your validLine() routine, get rid of both. `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l] [select]
Re^4: Formatting a large number of records by Aristotle (Chancellor) on Dec 30, 2002 at 15:45 UTC
Re^3: Formatting a large number of records by Aristotle (Chancellor) on Dec 30, 2002 at 16:09 UTC
You are chomping the line, even though you never look at anything other than the first record on the line if its not valid, then catenate a newline back to the end of it on saving it. You repeatedly open/close files for single records. Your regex has a quantifier of `{9}` although you already made sure only to look at the first nine characters - a double negative is more economic in that case (test for the absence of non-digit characters). Also, you can reduce quite a lot of duplication. `my (%handle, $fh, $count); $handle{02} = \OUTPUT; while(my $line = <INPUT>) { if(substr($_, 0, 9) !~ /\D/) { chomp $line; my $year; ($line, $year) = validLine($_); $line .= "\n"; $count = $year ne "02" ? \$y_count : \$o_count; $fh = $handle{$year} \|\| do { open my($newfh), ">>", $path."year$year.txt" or die "Cannot open file: $!\n"; $newfh; }; } else { $fh = \DISCARD; $count = \$d_count; } ++$$count; print $fh $line; }` [download] If you post your `validLine`, chances are improvements to it can also be suggested. Makeshifts last the longest.	[reply] [d/l]
Re: Re: Formatting a large number of records by shotgunefx (Parson) on Dec 30, 2002 at 14:15 UTC
To be honest, I have no idea what the diamond operator is doing behind the scenes (How much does it read at a time internally?), I would think there might be potentially some benefit to chunking it into an array. Not the whole thing of course but perhaps 100-500 records at a time. (It all depends I suppose on what buffer sizes IO is working with internally) -Lee "To be civilized is to deny one's nature."	[reply]
Re: Re: Re: Formatting a large number of records by pfaut (Priest) on Dec 30, 2002 at 14:30 UTC
IO::Handle appears to allow access to setbuf() and setvbuf(). From the perldoc: If the C functions setbuf() and/or setvbuf() are available, then "IO::Handle::setbuf" and "IO::Handle::setvbuf" set the buffering policy for an IO::Handle. The calling sequences for the Perl functions are the same as their C counterparts--including the constants "_IOFBF", "_IOLBF", and "_IONBF" for setvbuf()--except that the buffer parameter specifies a scalar variable to use as a buffer. You should only change the buffer before any I/O, or immediately after calling flush. `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l]
Re: Formatting a large number of records by talexb (Chancellor) on Dec 30, 2002 at 14:39 UTC
Perhaps this is a silly question, but I have to ask it. Why do you need to speed up this process? Don't you have to get it right once only? After that (I presume) you would be done. If you are wondering how to do your testing without going through the whole file, I would cut out a 1000 line chunk of the file (say) and use that for my testing. Once most everything looks good, run your script against the whole file. As far as I know, some very smart people have worked on getting Perl's "Read in a line and process it" code to run absolutely as fast as possible -- so I don't think tinkering with the buffering is going to have a significant impact on the speed. How long does it take to run the entire file? Can you get access to a faster machine, or one with more memory? --t. alex but my friends call me T.	[reply]
Re: Re: Formatting a large number of records by elbow (Scribe) on Dec 30, 2002 at 15:44 UTC
It's a valid question t. alex. Unfortunately this formatting has to be done between the end of one finance period and the start of the next and there is only a short period of time to get it run (a few hours). As for a better machine - not where I work! :-{ elbow	[reply]
Re: Re: Re: Formatting a large number of records by talexb (Chancellor) on Dec 30, 2002 at 15:57 UTC
Well, is Perl fast enough for the job? If not (your call, I guess), you'll have to code the equivalent functionality in C, which should give you a bit of an improvement in speed. What machine are you running this on? How long does the script take to run? I can't remember any benchmarks, but I know I've processed large files before and Perl whizzes through them. What you've discussed is not any heavy processing, so I can't imagine that your script would take more than a few minutes to process 1,000,000 lines. --t. alex but my friends call me T.	[reply]
Re: Formatting a large number of records by Zaxo (Archbishop) on Dec 30, 2002 at 17:14 UTC
Have you looked at XBase? From the pod: This module can read and write XBase database files, known as dbf in dBase and FoxPro world. It also reads memo fields from the dbt and fpt files, if needed. An alpha code of reading index support for ndx, ntx, mdx, idx and cdx is available for testing -- see the DBD::Index(3) man page. Module XBase provides simple native interface to XBase files. For DBI compliant database access, see the DBD::XBase and DBI modules and their man pages. It appears that something like: `use XBase; my ($file, $index) = ( '/path/to/foo.dbf', 0); my $table = XBase->new $file or die XBase->errstr; { local $\ = $record_sep; local $, = $field_sep; # or use a CSV module open my $outfile, '>', '/path/to/foo.txt' or die $!; do { my @foo = $table->get_record $index or die XBase->errstr; print $outfile @foo or die $!; } while $index++ < $table->last_record; close $outfile or die $!; } $table->close or die XBase->errstr;` [download] You will probably want to refine that to check for deleted records, reorder fields, etc. After Compline, Zaxo	[reply] [d/l]