File Parsing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks I humbly ask for some help.
I am writing a program that breaks apart files at work, these files are extracts from ARCGIS shape files that have been extracted from the ARCGIS database we have here. The file that I am having trouble with is an attributes file. At this time the only thing I can think of to do is to read the file once to get the record count and then read the file a second time to dump the records out. What I have been running into looks like a memory leak in that my computer starts to run really slow. I am sure it has to do with the while loop that I am running and then with the second read, I'm just not at my best right now and I think I have looped into a corner. I can place more of the code out here if need be. But to keep it short I read in one file to get the file names and then I open each of the filenames with the appropriate extension to start reading the files, the files are all in binary by the way.

      binmode AtrFile;
      
      # Assign the variables, Base Atr size is 12

      $base_readsize = 8;

      $AtrTmp = "I I";

      $bytes_read = read AtrFile, $dat_block, $base_readsize;

      ($Cksum, $NxIdx) = unpack( $AtrTmp, $dat_block );
      
      # Assign the variables, Base Label size is 20

      $LblTmp = "i i b8 b16 b8 i C";
      $lbl_readsize = 20;

      while ($bytes_read = read AtrFile, $lbl_block, $lbl_readsize) {

        $lbl_readsize = 20;
       
        ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) = un
+pack( $LblTmp, $lbl_block );

        $num_readsize = $Lngth;
        
        $NumTmp = "a" . $Lngth;

        read AtrFile, $num_block, $num_readsize;

        ( $LblVal ) = unpack( $NumTmp, $num_block );

        $x++;

      }
      
      close( AtrFile );
      
      sleep 1;

print( STDOUT "MADE IT HERE AND $iIcao has $x records\n");
      
      # Open Attribute file for reading

      open( AtrFile, "<$AtrFile" );
 
      binmode AtrFile;
      
      # Assign the variables, Base Atr size is 12

      $base_readsize = 8;

      $AtrTmp = "I I";

      $bytes_read = read AtrFile, $dat_block, $base_readsize;

      ($Cksum, $NxIdx) = unpack( $AtrTmp, $dat_block );

print( Type09File "$iIcao Set_Header: $Cksum, $NxIdx\n");

      for ( $v = 0 ; $v < $x ; $v++ ) {
      
        # Assign the variables, Base Label size is 16
        
        $lbl_readsize = 20;

        $LblTmp = "i i b8 b16 b8 i C";
       
        $bytes_read = read AtrFile, $lbl_block, $lbl_readsize;
       
        ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) = un
+pack( $LblTmp, $lbl_block );
         
print( Type09File "Label_Header: $NumLbl, $LblKnd, $ZmLvl, $FntSz, $re
+s, $Lngth, $Ornt, ");

        $num_readsize = $Lngth;
        
        $NumTmp = "a" . $Lngth;
        
        $bytes_read = read AtrFile, $num_block, $num_readsize;
        
        ( $LblVal ) = unpack( $NumTmp, $num_block );
        
print ( Type09File "$LblVal\n");

      }

      # Close data file(s)

      close( GeoFile );
      close( AtrFile );

    }
[download]

Thanks for the help, Mike

Comment on File Parsing Download Code

Replies are listed 'Best First'.
Re: File Parsing by Anonymous Monk on Oct 07, 2005 at 20:54 UTC
It is unsurprising that you are lost in your code. If there were an Ig Noble award for Software, this would have to be a front running contender. Suggestion: Investigate strict and warnings, ditch the double spacing, and realise that you are not limited to 6 character variable names. (Why $Lngth and not $Length?). By the time you have your code running clean of problems the compiler can tell you about, you will probably see a better way of doing what you are doing yourself, but if not, someone here will help you once they don't have to do what you can and should do yourself.	[reply]
Re: File Parsing by graff (Chancellor) on Oct 08, 2005 at 02:30 UTC
Well, Mike, while the second Anonymous Monk in this thread was a tad heavy on snideness and trivial complaints, the truth remains that there is a lot more wrong with your code (and your post) than the problem that you sort of describe ("too slow, memory leak"). How big are the data files, and how many files (i.e. how much data, total)? If it's a lot, well, maybe it takes so long simply because there's a lot of data and you are reading it all twice. For that matter, how much elapsed time do you consider to be "slow"? How much faster do you think it should be? (update: and what sort of system are you running on?) Based on what you've said and shown, I would guess the following: There's no evidence here of any sort of "memory leak" or using too much memory, because the code as shown is not keeping anything in memory (except a handful of scalar values). You are assigning the value returned by each "read()" call to a variable, but you never check the value. This is just one symptom of a rampant lack of error checking. There's no need to read the file twice; whatever you're doing in the second loop over the file, just do that in the first loop instead. It seems like the only information you derive from the first pass is the number of records (quaintly referred to as "$x"), and you don't use that information in the second loop for anything (except to say how many iterations to in reading the file -- but that's no different from using "eof" in the first loop). You probably don't need to use "sleep(1)" -- maybe that's what's slowing it down? ;-) Anyway, why not print some useful messages to STDOUT -- show the current time before reading the file, show it after reading the file, show it again at some other point after you've done some other major processing step, and so on. It's even quite simple to show the elapsed time in seconds: `$start = time; # do one big step $now = time; printf "Step one took %d sec\n", $now - $start; $start = $now; # do next big step $now = time; printf "Next step took %d sec\n", $now - $start' # and so on...` [download] That could tell you where most of the time is going, so you'll know which part of the code is taking the longest. (update: fixed typo in last printf statement)	[reply] [d/l]
Re^2: File Parsing by Anonymous Monk on Oct 09, 2005 at 04:15 UTC
I appreciate what both of you have said in your posts back to me. The first post about variable name length, I tend to keep variable names in a short hand style instead of spelling them out (just one of my quirks). The strict and the warn are in the code and I have done alot of clean-up with these 2 turned on of course, but that does not eliminate the problem that I have run into. I am on a Windows XP machine, lets not hear the grief over that. I found that with the eof from the first loop the file that I am parsing has some violations in it of its own accord that really sets the parser off into la-la land. Getting the record count fixes most of the problems but I still have a few things to fix in the data record. The checking of the values is in a later part of the routine that goes against the ARCGIS shape files, and boy is that fun as again the data is not as well adjusted as it should be. Long story short I am a Software Tester and I have been given the dubious task of going through Company data and suggesting where to fix it. So far I have fixed 10 of 50 data streams in under 4 months saving the Company 6 million a month in lost revenue, but who cares back to task. The data files are not really that big, about a meg for each file and the number of files varies between 8 and 300. Slow for me tends to be breaking one file in over a minute, I should be able to break a file every few seconds and have an error file gen'ed in about a minute for all of the files that are in the dataset. All in all I again appreciate what both of you have said and I will continue to work over the code base. From what has been posted so far I may have more of an issue with the data being presented in the files than with the poor code I have written to parse said files. Thanks, Mike	[reply]
Re^3: File Parsing by graff (Chancellor) on Oct 09, 2005 at 14:17 UTC
This is much better. I took a closer look at your original code, and came up with a simplified version. If I understand the problem, you want to make sure you can read through the "AtrFile" once without any problems, to make sure you can get to the end of it correctly, and if that works, then you want to read it again and "decode" some of the content into another file. If so, then the biggest issue is to improve your error checking. Also, I'm no jedi when it comes to pack/unpack, but I was struck by your use of `"a" . $Lngth` to unpack "$num_block" into "$LblVal". I think this is an unnecessary use of unpack, because the packed and unpacked values turn out to be identical. Consider: `$packed = join "", map { chr() } (0x01..0x20); $plen = length( $packed ); $tmpl = "a" . $plen; print "unpacking $plen bytes using (a): "; $unpacked = unpack( $tmpl, $packed ); $result = ( $packed eq $unpacked ) ? "same" : "different"; print "$result\n";'` [download] For me that outputs "unpacking 32 bytes using (a): same". (Update: you can change the first line to "0x00..0xff", and the result on 256 bytes will still be "same".) Anyway, as for the simplified version of your code, you'll see that I prefer to write a subroutine rather than write the same block of code more than once. Also, I use "seek" to get back to the start of the file for the second pass, rather than closing and reopening. Finally, I add a lot of error checking, and "die" with suitable messages when things don't go as expected. (There are other ways, but this is a decent start.) I threw in enough bogus variable declarations so as to retain all your original variables and still pass "strict" (but apart from that, it's untested, of course): my $iIcao = "???"; my $AtrFile = "some_file_name"; open( my $atrfh, $AtrFile ) or die "$AtrFile: $!"; binmode $atrfh; my $base_readsize = 8; my $base_block; if (( read $atrfh, $base_block, $base_readsize ) != $base_readsize ) { die "$AtrFile: read failed on first $base_readsize bytes: $!"; } my ($Cksum, $NxIdx) = unpack( 'II', $base_block ); my $offset = $base_readsize; my ( $lastrec, $bytecount ) = parse_records( $atrfh, $offset, "debug" +); # parse_records will die if there are problems with the file data print "$AtrFile: $lastrec records, $bytecount bytes read okay\n"; # Now that AtrFile has passed the sanity checks, start print to Type09 +File; # Just rewind $atrfh to the first LBL record and repeat the read loop print Type09File "$iIcao Set_Header: $Cksum, $NxIdx\n"; seek $atrfh, $base_readsize, 0; parse_records( $atrfh, $offset, "print" ); close( $atrfh ); sub parse_records { my ( $rfh, $offset, $mode ) = @_; my $lbl_readsize = 20; my ( $lbl_block, $num_block ); my $bytes_read; my $recid = 0; my $lbltmpl = 'i i b8 b16 b8 i C'; while (( $bytes_read = read $rfh, $lbl_block, $lbl_readsize ) == $ +lbl_readsize ) { $recid++; $offset += $lbl_readsize; my ( $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $Lngth, $Ornt ) = unpack( $lbltmpl, $lbl_block ); if (( read $rfh, $num_block, $Lngth ) != $Lngth ) { die "$AtrFile: can't read $Lngth bytes at rec# $recid (off +s: $offset): $!"; } # the following assumes that there is an open file handle # called "Type09File" -- might be better to make this a # "my" variable and pass it as an arg... if ( $mode eq 'print' ) { print Type09File "Label_Header: " . join( ", ", $NumLbl, $LblKnd, $ZmLvl, $FntSz, $res, $L +ngth, $Ornt, "$num_block\n" ); } } if ( $bytes_read > 0 ) { die "$AtrFile: got $bytes_read bytes, not $lbl_readsize ". "after rec# $recid (offs: $offset)\n"; } elsif ( $bytes_read < 0 ) { die "$AtrFile: read error ($bytes_read) after rec# $recid (off +s: $offset)\n"; } return ( $recid, $offset ); } [download] As for improving overall speed, I don't have anything to offer on that -- if it's your code that's causing the delay, I'm guessing the problem is somewhere other than the part you've shown us. Again, if you scatter some timing reports around, you'll get a better idea where to look.	[reply] [d/l] [select]