It is unsurprising that you are lost in your code. If there were an Ig Noble award for Software, this would have to be a front running contender.
Suggestion: Investigate strict and warnings, ditch the double spacing, and realise that you are not limited to 6 character variable names. (Why $Lngth and not $Length?).
By the time you have your code running clean of problems the compiler can tell you about, you will probably see a better way of doing what you are doing yourself, but if not, someone here will help you once they don't have to do what you can and should do yourself.
| [reply] |
Well, Mike, while the second Anonymous Monk in this thread was a tad heavy on snideness and trivial complaints, the truth remains that there is a lot more wrong with your code (and your post) than the problem that you sort of describe ("too slow, memory leak").
How big are the data files, and how many files (i.e. how much data, total)? If it's a lot, well, maybe it takes so long simply because there's a lot of data and you are reading it all twice. For that matter, how much elapsed time do you consider to be "slow"? How much faster do you think it should be? (update: and what sort of system are you running on?)
Based on what you've said and shown, I would guess the following:
- There's no evidence here of any sort of "memory leak" or using too much memory, because the code as shown is not keeping anything in memory (except a handful of scalar values).
- You are assigning the value returned by each "read()" call to a variable, but you never check the value. This is just one symptom of a rampant lack of error checking.
- There's no need to read the file twice; whatever you're doing in the second loop over the file, just do that in the first loop instead. It seems like the only information you derive from the first pass is the number of records (quaintly referred to as "$x"), and you don't use that information in the second loop for anything (except to say how many iterations to in reading the file -- but that's no different from using "eof" in the first loop).
- You probably don't need to use "sleep(1)" -- maybe that's what's slowing it down? ;-)
Anyway, why not print some useful messages to STDOUT -- show the current time before reading the file, show it after reading the file, show it again at some other point after you've done some other major processing step, and so on. It's even quite simple to show the elapsed time in seconds:
$start = time;
# do one big step
$now = time;
printf "Step one took %d sec\n", $now - $start;
$start = $now;
# do next big step
$now = time;
printf "Next step took %d sec\n", $now - $start'
# and so on...
That could tell you where most of the time is going, so you'll know which part of the code is taking the longest.
(update: fixed typo in last printf statement) | [reply] [d/l] |
I appreciate what both of you have said in your posts back to me. The first post about variable name length, I tend to keep variable names in a short hand style instead of spelling them out (just one of my quirks). The strict and the warn are in the code and I have done alot of clean-up with these 2 turned on of course, but that does not eliminate the problem that I have run into. I am on a Windows XP machine, lets not hear the grief over that. I found that with the eof from the first loop the file that I am parsing has some violations in it of its own accord that really sets the parser off into la-la land. Getting the record count fixes most of the problems but I still have a few things to fix in the data record. The checking of the values is in a later part of the routine that goes against the ARCGIS shape files, and boy is that fun as again the data is not as well adjusted as it should be.
Long story short I am a Software Tester and I have been given the dubious task of going through Company data and suggesting where to fix it. So far I have fixed 10 of 50 data streams in under 4 months saving the Company 6 million a month in lost revenue, but who cares back to task.
The data files are not really that big, about a meg for each file and the number of files varies between 8 and 300. Slow for me tends to be breaking one file in over a minute, I should be able to break a file every few seconds and have an error file gen'ed in about a minute for all of the files that are in the dataset.
All in all I again appreciate what both of you have said and I will continue to work over the code base. From what has been posted so far I may have more of an issue with the data being presented in the files than with the poor code I have written to parse said files.
Thanks, Mike
| [reply] |
This is much better. I took a closer look at your original code, and came up with a simplified version. If I understand the problem, you want to make sure you can read through the "AtrFile" once without any problems, to make sure you can get to the end of it correctly, and if that works, then you want to read it again and "decode" some of the content into another file. If so, then the biggest issue is to improve your error checking.
Also, I'm no jedi when it comes to pack/unpack, but I was struck by your use of "a" . $Lngth to unpack "$num_block" into "$LblVal". I think this is an unnecessary use of unpack, because the packed and unpacked values turn out to be identical. Consider:
$packed = join "", map { chr() } (0x01..0x20);
$plen = length( $packed );
$tmpl = "a" . $plen;
print "unpacking $plen bytes using (a): ";
$unpacked = unpack( $tmpl, $packed );
$result = ( $packed eq $unpacked ) ? "same" : "different";
print "$result\n";'
For me that outputs "unpacking 32 bytes using (a): same". (Update: you can change the first line to "0x00..0xff", and the result on 256 bytes will still be "same".)
Anyway, as for the simplified version of your code, you'll see that I prefer to write a subroutine rather than write the same block of code more than once. Also, I use "seek" to get back to the start of the file for the second pass, rather than closing and reopening. Finally, I add a lot of error checking, and "die" with suitable messages when things don't go as expected. (There are other ways, but this is a decent start.) I threw in enough bogus variable declarations so as to retain all your original variables and still pass "strict" (but apart from that, it's untested, of course):
my $iIcao = "???";
my $AtrFile = "some_file_name";
open( my $atrfh, $AtrFile ) or die "$AtrFile: $!";
binmode $atrfh;
my $base_readsize = 8;
my $base_block;
if (( read $atrfh, $base_block, $base_readsize ) != $base_readsize ) {
die "$AtrFile: read failed on first $base_readsize bytes: $!";
}
my ($Cksum, $NxIdx) = unpack( 'II', $base_block );
my $offset = $base_readsize;
my ( $lastrec, $bytecount ) = parse_records( $atrfh, $offset, "debug"
+);
# parse_records will die if there are problems with the file data
print "$AtrFile: $lastrec records, $bytecount bytes read okay\n";
# Now that AtrFile has passed the sanity checks, start print to Type09
+File;
# Just rewind $atrfh to the first LBL record and repeat the read loop
print Type09File "$iIcao Set_Header: $Cksum, $NxIdx\n";
seek $atrfh, $base_readsize, 0;
parse_records( $atrfh, $offset, "print" );
close( $atrfh );
sub parse_records
{
my ( $rfh, $offset, $mode ) = @_;
my $lbl_readsize = 20;
my ( $lbl_block, $num_block );
my $bytes_read;
my $recid = 0;
my $lbltmpl = 'i i b8 b16 b8 i C';
while (( $bytes_read = read $rfh, $lbl_block, $lbl_readsize ) == $
+lbl_readsize )
{
$recid++;
$offset += $lbl_readsize;
my ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) =
unpack( $lbltmpl, $lbl_block );
if (( read $rfh, $num_block, $Lngth ) != $Lngth ) {
die "$AtrFile: can't read $Lngth bytes at rec# $recid (off
+s: $offset): $!";
}
# the following assumes that there is an open file handle
# called "Type09File" -- might be better to make this a
# "my" variable and pass it as an arg...
if ( $mode eq 'print' ) {
print Type09File "Label_Header: " .
join( ", ", $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $L
+ngth, $Ornt,
"$num_block\n" );
}
}
if ( $bytes_read > 0 ) {
die "$AtrFile: got $bytes_read bytes, not $lbl_readsize ".
"after rec# $recid (offs: $offset)\n";
}
elsif ( $bytes_read < 0 ) {
die "$AtrFile: read error ($bytes_read) after rec# $recid (off
+s: $offset)\n";
}
return ( $recid, $offset );
}
As for improving overall speed, I don't have anything to offer on that -- if it's your code that's causing the delay, I'm guessing the problem is somewhere other than the part you've shown us. Again, if you scatter some timing reports around, you'll get a better idea where to look. | [reply] [d/l] [select] |