Re^2: File Parsing

I appreciate what both of you have said in your posts back to me.
The first post about variable name length, I tend to keep variable names in a short hand style instead of spelling them out (just one of my quirks). The strict and the warn are in the code and I have done alot of clean-up with these 2 turned on of course, but that does not eliminate the problem that I have run into. I am on a Windows XP machine, lets not hear the grief over that. I found that with the eof from the first loop the file that I am parsing has some violations in it of its own accord that really sets the parser off into la-la land. Getting the record count fixes most of the problems but I still have a few things to fix in the data record. The checking of the values is in a later part of the routine that goes against the ARCGIS shape files, and boy is that fun as again the data is not as well adjusted as it should be.

Long story short I am a Software Tester and I have been given the dubious task of going through Company data and suggesting where to fix it. So far I have fixed 10 of 50 data streams in under 4 months saving the Company 6 million a month in lost revenue, but who cares back to task.

The data files are not really that big, about a meg for each file and the number of files varies between 8 and 300. Slow for me tends to be breaking one file in over a minute, I should be able to break a file every few seconds and have an error file gen'ed in about a minute for all of the files that are in the dataset.

All in all I again appreciate what both of you have said and I will continue to work over the code base. From what has been posted so far I may have more of an issue with the data being presented in the files than with the poor code I have written to parse said files.

Thanks, Mike

Comment on Re^2: File Parsing

Replies are listed 'Best First'.
Re^3: File Parsing by graff (Chancellor) on Oct 09, 2005 at 14:17 UTC
This is much better. I took a closer look at your original code, and came up with a simplified version. If I understand the problem, you want to make sure you can read through the "AtrFile" once without any problems, to make sure you can get to the end of it correctly, and if that works, then you want to read it again and "decode" some of the content into another file. If so, then the biggest issue is to improve your error checking. Also, I'm no jedi when it comes to pack/unpack, but I was struck by your use of `"a" . $Lngth` to unpack "$num_block" into "$LblVal". I think this is an unnecessary use of unpack, because the packed and unpacked values turn out to be identical. Consider: `$packed = join "", map { chr() } (0x01..0x20); $plen = length( $packed ); $tmpl = "a" . $plen; print "unpacking $plen bytes using (a): "; $unpacked = unpack( $tmpl, $packed ); $result = ( $packed eq $unpacked ) ? "same" : "different"; print "$result\n";'` [download] For me that outputs "unpacking 32 bytes using (a): same". (Update: you can change the first line to "0x00..0xff", and the result on 256 bytes will still be "same".) Anyway, as for the simplified version of your code, you'll see that I prefer to write a subroutine rather than write the same block of code more than once. Also, I use "seek" to get back to the start of the file for the second pass, rather than closing and reopening. Finally, I add a lot of error checking, and "die" with suitable messages when things don't go as expected. (There are other ways, but this is a decent start.) I threw in enough bogus variable declarations so as to retain all your original variables and still pass "strict" (but apart from that, it's untested, of course): my $iIcao = "???"; my $AtrFile = "some_file_name"; open( my $atrfh, $AtrFile ) or die "$AtrFile: $!"; binmode $atrfh; my $base_readsize = 8; my $base_block; if (( read $atrfh, $base_block, $base_readsize ) != $base_readsize ) { die "$AtrFile: read failed on first $base_readsize bytes: $!"; } my ($Cksum, $NxIdx) = unpack( 'II', $base_block ); my $offset = $base_readsize; my ( $lastrec, $bytecount ) = parse_records( $atrfh, $offset, "debug" +); # parse_records will die if there are problems with the file data print "$AtrFile: $lastrec records, $bytecount bytes read okay\n"; # Now that AtrFile has passed the sanity checks, start print to Type09 +File; # Just rewind $atrfh to the first LBL record and repeat the read loop print Type09File "$iIcao Set_Header: $Cksum, $NxIdx\n"; seek $atrfh, $base_readsize, 0; parse_records( $atrfh, $offset, "print" ); close( $atrfh ); sub parse_records { my ( $rfh, $offset, $mode ) = @_; my $lbl_readsize = 20; my ( $lbl_block, $num_block ); my $bytes_read; my $recid = 0; my $lbltmpl = 'i i b8 b16 b8 i C'; while (( $bytes_read = read $rfh, $lbl_block, $lbl_readsize ) == $ +lbl_readsize ) { $recid++; $offset += $lbl_readsize; my ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) = unpack( $lbltmpl, $lbl_block ); if (( read $rfh, $num_block, $Lngth ) != $Lngth ) { die "$AtrFile: can't read $Lngth bytes at rec# $recid (off +s: $offset): $!"; } # the following assumes that there is an open file handle # called "Type09File" -- might be better to make this a # "my" variable and pass it as an arg... if ( $mode eq 'print' ) { print Type09File "Label_Header: " . join( ", ", $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $L +ngth, $Ornt, "$num_block\n" ); } } if ( $bytes_read > 0 ) { die "$AtrFile: got $bytes_read bytes, not $lbl_readsize ". "after rec# $recid (offs: $offset)\n"; } elsif ( $bytes_read < 0 ) { die "$AtrFile: read error ($bytes_read) after rec# $recid (off +s: $offset)\n"; } return ( $recid, $offset ); } [download] As for improving overall speed, I don't have anything to offer on that -- if it's your code that's causing the delay, I'm guessing the problem is somewhere other than the part you've shown us. Again, if you scatter some timing reports around, you'll get a better idea where to look.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: File Parsing
by graff (Chancellor) on Oct 09, 2005 at 14:17 UTC

Also, I'm no jedi when it comes to pack/unpack, but I was struck by your use of "a" . $Lngth to unpack "$num_block" into "$LblVal". I think this is an unnecessary use of unpack, because the packed and unpacked values turn out to be identical. Consider:

$packed = join "", map { chr() } (0x01..0x20);
$plen = length( $packed );
$tmpl = "a" . $plen;

print "unpacking $plen bytes using (a): ";
$unpacked = unpack( $tmpl, $packed );
$result = ( $packed eq $unpacked ) ? "same" : "different";
print "$result\n";'
[download]

For me that outputs "unpacking 32 bytes using (a): same". (Update: you can change the first line to "0x00..0xff", and the result on 256 bytes will still be "same".)

Anyway, as for the simplified version of your code, you'll see that I prefer to write a subroutine rather than write the same block of code more than once. Also, I use "seek" to get back to the start of the file for the second pass, rather than closing and reopening. Finally, I add a lot of error checking, and "die" with suitable messages when things don't go as expected. (There are other ways, but this is a decent start.) I threw in enough bogus variable declarations so as to retain all your original variables and still pass "strict" (but apart from that, it's untested, of course):

my $iIcao = "???";
my $AtrFile = "some_file_name";

open( my $atrfh, $AtrFile ) or die "$AtrFile: $!";
binmode $atrfh;

my $base_readsize = 8;
my $base_block;
if (( read $atrfh, $base_block, $base_readsize ) != $base_readsize ) {
   die "$AtrFile: read failed on first $base_readsize bytes: $!";
}

my ($Cksum, $NxIdx) = unpack( 'II', $base_block );

my $offset = $base_readsize;
my ( $lastrec, $bytecount ) = parse_records( $atrfh, $offset, "debug" 
+);

   # parse_records will die if there are problems with the file data

print "$AtrFile: $lastrec records, $bytecount bytes read okay\n";

# Now that AtrFile has passed the sanity checks, start print to Type09
+File;
# Just rewind $atrfh to the first LBL record and repeat the read loop

print Type09File "$iIcao Set_Header: $Cksum, $NxIdx\n";

seek $atrfh, $base_readsize, 0;

parse_records( $atrfh, $offset, "print" );

close( $atrfh );


sub parse_records
{
    my ( $rfh, $offset, $mode ) = @_;

    my $lbl_readsize = 20;
    my ( $lbl_block, $num_block );
    my $bytes_read;
    my $recid = 0;
    my $lbltmpl = 'i i b8 b16 b8 i C';

    while (( $bytes_read = read $rfh, $lbl_block, $lbl_readsize ) == $
+lbl_readsize )
    {
        $recid++;
        $offset += $lbl_readsize;

        my ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) =
            unpack( $lbltmpl, $lbl_block );

        if (( read $rfh, $num_block, $Lngth ) != $Lngth ) {
            die "$AtrFile: can't read $Lngth bytes at rec# $recid (off
+s: $offset): $!";
        }

# the following assumes that there is an open file handle
# called "Type09File" -- might be better to make this a
# "my" variable and pass it as an arg...

        if ( $mode eq 'print' ) {
            print Type09File "Label_Header: " .
                join( ", ", $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $L
+ngth, $Ornt,
                           "$num_block\n" );
        }
    }
    if ( $bytes_read > 0 ) {
        die "$AtrFile: got $bytes_read bytes, not $lbl_readsize ".
            "after rec# $recid (offs: $offset)\n";
    }
    elsif ( $bytes_read < 0 ) {
        die "$AtrFile: read error ($bytes_read) after rec# $recid (off
+s: $offset)\n";
    }
    return ( $recid, $offset );
}
[download]

[reply]
[d/l]
[select]