THuG has asked for the wisdom of the Perl Monks concerning the following question:

Okay...

I have a script that is parsing through single line files we call DPFs (Data Parsing Failure). It is looking to see why the file failed. In each of these files is a series of records, of different lengths. Each record starts with a three letter identifier. The next four characters give the length of the record.

Some of these DPFs occur because a record gets truncated, or goes beyond its length. This means the next record identifier is not where it should be. I am currently hunting for this by reading the next three characters, checking to see if it is a valid ID, and backing up two spaces (seek $dpf, -2, 1;) if it is not. I continue this until I find the next record.

I want to use m// to find the next record. I originally didn't use m//g because I miss read the doco on pos(). I thought it returned the last match, as though it found every occurance and returned the last one. Merlyn set me straight, it returns the position of the match from the last (most recent) m//g.

Well, I start using m//g and now pos() keeps giving me the first match in the file. I don't follow why it doesn't start looking from where the file pointer currently is (from the last read $dpf), but I figure I can get around this. I keep doing m//g in a do {} until pos() >= $lastknownpos. And then try to match one more time.

This ofcourse doesn't work. I try to use m/\G/g, but just don't understand how to use that.

Here is what I have... (more or less, the script is very long)
while ($currpos < $filelen){ my ($readLen) = read $dpf, my ($recordID), 3; if (isValidID($recordID)) { #check the record length #if it is good, then parse the record for errors there #if the record is the wrong length, then skip parsing #when we read the next 3 chars we will probably not read a val +id ID #and should begin hunting } elsif ($readLen == 3) { do { $dpf =~ m/\G.*MEH|MED|MMD|MMS|CR1|FR1/ig; } until (pos() >= $currpos); #this should put me at the record ID of the last #record I tried to parse #I will imagine there is a better way, particularly if #I can start from where the file pointer is and not from t +he beginning of the file if ($dpf =~ m/\G.*MEH|MED|MMD|MMS|CR1|FR1/ig) { #matching one mor +e time will (hopefully) match the next record ID seek $dpf, pos(), 0; } #if it doesn't match, then this truncated record is the last r +ecord in the file (typical) else { seek $dpf, 0, 2; } } else { #if I didn't read three characters, then I hit EOF, seek + there so $currpos will be updated and we will fall out of the while{ +} seek $dpf, 0, 2; } $currpos = tell $dpf; } #end while

Of note:
$dpf is a filehandle open (my $dpf=\*FH, "file.dpf");
I am sure I can put MEH|MED|MMD|MMS|CR1|FR1 in a scalar so I don't have to copy it so often. Also makes future version easy to update when I add new record formats. But one thing at a time. (I would start playing with m//o if I put MEH|MED|MMD|MMS|CR1|FR1 in a scalar, and I'm liable to break something)

So, my problem is... after I find the record ID of a bad record (too long or too short) how do I easily find the next record ID (if one exsists)?

-Travis

Replies are listed 'Best First'.
RE: Parsing and \G and /g and Stupid User Tricks
by tye (Sage) on Aug 08, 2000 at 19:51 UTC

    Okay, I must be missing something big. You appear to be using $dpf alternately as a file handle (to read()) and as a string (something that you can match regular expressions against). You can't match a filehandle against a regular expression (unless I've missed a major added feature). If you don't pass an argument to pos(), then it works on $_, which you never use anywhere else in your program. I don't see how this can work at all!

    I suggest you read the entire contents of the file into a single string and do your processing on the string. Then you won't need \G nor .*.

            - tye (but my friends call me "Tye")

      I want to avoid reading in the entire file, since I am under the impression that I can save memory that way, and it is very easy to let Perl take care of where in the string I am (tell), let me step through it (read) and jump to where I need to be (seek). All of this is possible, I'm sure, in a string, but I'm guessing not as easy.

      Whether or not you are allowed to match via a pointer, I don't know. It isn't breaking anything, it just isn't working. I am not sure why it is not working, but I am under the impression that it is matching on the first record ID over and over again, and never moving on.

      I could be wrong. I'll play some more and see.

      -Travis

      v2:

      Okay, no, I can't match via the file handle. I can do <FIN>=~m//g;, but I imagine that is because it reads <FIN> into $_ and then does the m//g, which means I am reading in the entire file anyway! Nothing saved there, might as well read it all into $str.

      -Travis

        Yes, you can use tell, seek, and read easily on a file handle. But you can't use a regular expression on a file handle so you need a major change to your approach.

        How big are the files? Perl uses lots of memory for most things, so reading even a moderately large file into memory probably isn't go to make much of a difference on memory usage.

        In any case, you're going to have to read in a big chunk of the file into a string (say $str) so that you can match a regular expression against that chunk. If you use ($str =~ m/.../g), then you can use $pos= pos($str) to see where you currently are in the string and pos($str)= $newpos to change where in the string to start matching against.

                - tye (but my friends call me "Tye")
Parsing and \G and /g and Stupid User Tricks: Correction
by THuG (Beadle) on Aug 08, 2000 at 18:38 UTC
    Sorry... the m/\G.*MEH|MED|MMD|MMS|CR1|FR1/ig should read
    m/\G.*[MEH|MED|MMD|MMS|CR1|FR1]/ig

    Minor but.. it does affect things.

    -Travis
      that should be:
      m/\G.*(MEH|MED|MMD|MMS|CR1|FR1)/ig
      or
      m/\G.*(?:MEH|MED|MMD|MMS|CR1|FR1)/ig
      if you don't want to capture it. just a minor point, but that could break the code.

      also, instead of iterating repeatedly with the loop to find pos(), you could set pos() to $currpos and work from there (you'd have to match two more times, i think). that'll make it a little faster.

      jeff
        I can set pos() to $currpos? Rock on! I knew there had to be a way to start from where I left off, and not the beginning of the file.

        -Travis
Re: Parsing and \G and /g and Stupid User Tricks
by turnstep (Parson) on Aug 08, 2000 at 20:20 UTC

    > Some of these DPFs occur because a record gets truncated, or goes beyond its length.

    So, basically the length field is worthless, but always there? Why not something like this then:

    $TLI = "MEH|MED|MMD|MMS|CR1|FR1"; while(m/($TLI)(....)([^$TLI]+)/g) { $error=$1; $length=$2; $text=$3; ## Do what you need with them here. +.. }
      Ideally, the length field tells the parsing program how long the record is. The file is usually broken into its seperate records and put into MQ/Series to be sent to the database.

      There are a few things that will cause the file to be rejected. One is, it reads a record (or what it thinks is a record) and then tries to read the next three characters, expecting the next record ID. If they are there, the file is put aside for us to fix.

      Now... given what you are saying... will this work like I expect it to?

      $RID = "MEH|MED|MMD|MMS|CR1|FR1"; while(<$dpf>) { while(m/($RID)(....)(.*)($RID)/g) { $RecordID = $1; $RecordLen = $2; $RecordData = $3; if (len($RecordData) != $RecordLen) { #ERROR: Record is wrong length } } }

      What I am expecting is: it will step through the file (an example given below) over and over again, pulling each record into $RecordData. Do I need to use \G to get to continue where it left off? Do I need to do anything special for the last record in the file? Why did you use [^$TLI]+?

      -Travis
      PS: Example file:
      MEH0016BUNCHODATA123456MED0019BUNCHMOREDATA456789MED0018MOREDATAAGAIN4 +4568


      v0.2: changed while($dpf) to while(<$dpf>). Which brings us back to reading the file into $_. Thank, Tye.

        while($dpf) doesn't really do anything. Perhaps you meant while(<$dpf>), but that will only work if you have newlines at appropriate places in the file (which doesn't sound like it is the case) or if you have set $/ to your record separator (but you don't have a record separator, do you?).

                - tye (but my friends call me "Tye")
        that last parentheses, ($RID) should be an assertion, (?=$RID). if you don't do this, you will read in that ID and skip it the next time you read something. you don't need the \G unless there is pieces in the data that aren't going to match (it seems like everything in the data is a valid piece of data).

        perhaps you want to use split instead of a regex? if every part of your data is getting tested, you probably don't need, or want, a regular expression. split the data on your $rid codes, and test the rest of the data. or since you're reading from a file... read till you hit a $rid code. test what you've read. continue till EOF.
        s///
        by THuG (Beadle) on Aug 08, 2000 at 21:04 UTC
RE: Parsing and \G and /g and Stupid User Tricks: Final Solution, v2
by THuG (Beadle) on Aug 08, 2000 at 21:21 UTC
    Okay, here is what I've settled on doing. I will work on fine tuning it and cutting out unnecessary steps.

    $RID = "MEH|MED|MMD|MMS|CR1|FR1"; open (FIN, "<file.dpf"); my $filecontents = <FIN>; $filecontents =~ s/($RID)/\n$1/ig; my @filecontents = split('\n', $filecontents); my $count = 0; foreach my $record (@filecontents) { print ++$count . "\t$record\n"; }


    What the code is doing right now is spitting out each record, with a tally to the left, so I can see that it is working. Since each record will be on its own line, I don't have to hunt.

    Now, just for better understanding, can I do the split without the substitution?

    -Travis


    v2: Leave it to Tye to find a problem. Okay, I can't do a global split. I think I have a way around it, let me try something.
      local( $/ )= undef; @records= split /(?=($RID))/, <FIN>;

      Note that you need to set $/ if you want to read in the entire file at once.

      So a record can never have, for example, "MEH" in the middle of it? What kind of data is in these records that you can guarantee that these record IDs never appear in the middle of a record?

              - tye (but my friends call me "Tye")