Re: Formatting a large number of records
by pfaut (Priest) on Dec 30, 2002 at 14:01 UTC
|
If each record needs to be processed on its own, there will be no gain by reading the whole file into memory before doing any processing. That road is more likely to make it take longer since you'll add page faulting and swapping to your existing problems.
About all you can do is concentrate on what's inside the loop. If you post some code, maybe we can give some advice.
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] |
|
|
Thanks pfaut,
The input file contents looks like this:-
Ref_no Supp_co Order_ Cat_num Carr_co Unit_pri Line_pric Ca User_
+Auth_dat B
0312003620 SLUM02 M0551 RT3420 4.04 8.25 8.25 P UFJGC
+04/01/98 E
0312003619 SLUM02 M0550 RT3420 4.04 8.25 8.25 P UFJGC
+04/01/98 E
0312003617 SLUM02 M0548 RT3420 4.04 8.25 8.25 P UFJGC
+04/01/98 E
0312003616 SLUM02 M0547 RT3420 4.04 8.25 8.25 P UFJGC
+04/01/98 E
0312003684 SLUM02 M0615 RT3420 4.04 11.90 11.90 P UFJGC
+04/01/98 E
0312003613 SLUM02 M0544 RT3420 4.04 11.90 11.90 P UFJGC
+04/01/98 E
0312003586 SLUM02 M0517 RT3420 4.04 11.90 11.90 P UFJGC
+04/01/98 E
I have to check each line is a valid record rather than a header line or blank line (either of which appears a few hundred times throughout the file).
The actual record formatting is to remove the decimal places and insert leading zeroes on fields 5,6 & 7. I also have to interrogate the year in the penultimate field - the year determining which file the record is written to.
The code I have so far is:-
while ($line = <INPUT>) {
chomp $line;
# Check for lines to be discarded or kept
if (substr($line,0,9) =~ /[0-9]{9}/) {
# Lines are valid entries to be written to file if first 9 characters
+are
# numeric, file used dependant on date of invoice details.
($newline, $year) = validLine($line);
if ($year ne "02") {
open (YEAR, ">>".$path."year$year.txt") || die "Cannot open file
+: $!\n";
print YEAR "$newline\n";
$y_count++;
close YEAR || die "Cannot close file: $!\n";
} else {
print OUTPUT "$newline\n";
$o_count++;
}
} else {
print DISCARD "$line\n";
$d_count++;
next;
}
}
with a subroutine, validLine(), that breaks each line using substr to remove the decimals, insert leading zeroes, get the year and re construct the line (I was using a split on spaces at this point but have had to change it as not every record has the same number of fields and the line must be reconstructed to take this into account).
Appreciate any further comments!
elbow | [reply] [d/l] [select] |
|
|
I don't think it's a good idea to open and close the file to write to each time. I'd initialize a hash with the year as key and the handle as value, opening a new one if none exists in the hash for that particular year. To keep things clean I'd use IO::File.
Also, but this is minor, I'd not use the substr since this is waste of time: $line =~ /^[0-9]{9}/ should do nicely as condition to keep the line.
Just my 2 cents, -gjb-
| [reply] [d/l] [select] |
|
|
my %handles;
my %count;
sub get_file_handle {
my $yr = shift;
$count{$yr}++;
return $handles{$yr} if exists $handles{$yr};
$handles{$yr} = IO::File ">>".$path."year$yr.txt";
}
I don't know what your OUTPUT handle is assigned to. If you put it in the hash with the key '02', you could remove the if ($year...) test inside your loop. Then, your valid line handling becomes:
print get_file_handle($year) "$newline\n";
You do a chomp on the record read from the file only to add a newline when you print to output. If the removal of the newline is not required by your validLine() routine, get rid of both.
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] [select] |
|
|
|
|
You are chomping the line, even though you never look at anything other than the first record on the line if its not valid, then catenate a newline back to the end of it on saving it. You repeatedly open/close files for single records. Your regex has a quantifier of {9} although you already made sure only to look at the first nine characters - a double negative is more economic in that case (test for the absence of non-digit characters).
Also, you can reduce quite a lot of duplication.
my (%handle, $fh, $count);
$handle{02} = \*OUTPUT;
while(my $line = <INPUT>) {
if(substr($_, 0, 9) !~ /\D/) {
chomp $line;
my $year;
($line, $year) = validLine($_);
$line .= "\n";
$count = $year ne "02" ? \$y_count : \$o_count;
$fh = $handle{$year} || do {
open my($newfh), ">>", $path."year$year.txt"
or die "Cannot open file: $!\n";
$newfh;
};
}
else {
$fh = \*DISCARD;
$count = \$d_count;
}
++$$count;
print $fh $line;
}
If you post your validLine, chances are improvements to it can also be suggested.
Makeshifts last the longest. | [reply] [d/l] |
|
|
To be honest, I have no idea what the diamond operator is doing behind the scenes (How much does it read at a time internally?), I would think there might be potentially some benefit to chunking it into an array. Not the whole thing of course but perhaps 100-500 records at a time. (It all depends I suppose on what buffer sizes IO is working with internally)
-Lee
"To be civilized is to deny one's nature."
| [reply] |
|
|
---
print map { my ($m)=1<<hex($_)&11?' ':'';
$m.=substr('AHJPacehklnorstu',hex($_),1) }
split //,'2fde0abe76c36c914586c';
| [reply] [d/l] |
Re: Formatting a large number of records
by talexb (Chancellor) on Dec 30, 2002 at 14:39 UTC
|
Perhaps this is a silly question, but I have to ask it.
Why do you need to speed up this process? Don't you have to get it right once only? After that (I presume) you would be done.
If you are wondering how to do your testing without going through the whole file, I would cut out a 1000 line chunk of the file (say) and use that for my testing. Once most everything looks good, run your script against the whole file.
As far as I know, some very smart people have worked on getting Perl's "Read in a line and process it" code to run absolutely as fast as possible -- so I don't think tinkering with the buffering is going to have a significant impact on the speed.
How long does it take to run the entire file? Can you get access to a faster machine, or one with more memory?
--t. alex
but my friends call me T.
| [reply] |
|
|
It's a valid question t. alex.
Unfortunately this formatting has to be done between the end of one finance period and the start of the next and there is only a short period of time to get it run (a few hours).
As for a better machine - not where I work! :-{
elbow
| [reply] |
|
|
Well, is Perl fast enough for the job? If not (your call, I guess), you'll have to code the equivalent functionality in C, which should give you a bit of an improvement in speed.
What machine are you running this on? How long does the script take to run? I can't remember any benchmarks, but I know I've processed large files before and Perl whizzes through them. What you've discussed is not any heavy processing, so I can't imagine that your script would take more than a few minutes to process 1,000,000 lines.
--t. alex
but my friends call me T.
| [reply] |
Re: Formatting a large number of records
by Zaxo (Archbishop) on Dec 30, 2002 at 17:14 UTC
|
Have you looked at XBase? From the pod:
This module can read and write XBase database files, known
as dbf in dBase and FoxPro world. It also reads memo
fields from the dbt and fpt files, if needed. An alpha
code of reading index support for ndx, ntx, mdx, idx and
cdx is available for testing -- see the DBD::Index(3) man
page. Module XBase provides simple native interface to
XBase files. For DBI compliant database access, see the
DBD::XBase and DBI modules and their man pages.
It appears that something like:
use XBase;
my ($file, $index) = ( '/path/to/foo.dbf', 0);
my $table = XBase->new $file or die XBase->errstr;
{
local $\ = $record_sep;
local $, = $field_sep;
# or use a CSV module
open my $outfile, '>', '/path/to/foo.txt' or die $!;
do {
my @foo = $table->get_record $index
or die XBase->errstr;
print $outfile @foo
or die $!;
} while $index++ < $table->last_record;
close $outfile or die $!;
}
$table->close or die XBase->errstr;
You will probably want to refine that to check for deleted records, reorder fields, etc.
After Compline, Zaxo | [reply] [d/l] |