comment on

Okay...

I have a script that is parsing through single line files we call DPFs (Data Parsing Failure). It is looking to see why the file failed. In each of these files is a series of records, of different lengths. Each record starts with a three letter identifier. The next four characters give the length of the record.

Some of these DPFs occur because a record gets truncated, or goes beyond its length. This means the next record identifier is not where it should be. I am currently hunting for this by reading the next three characters, checking to see if it is a valid ID, and backing up two spaces (seek $dpf, -2, 1;) if it is not. I continue this until I find the next record.

I want to use m// to find the next record. I originally didn't use m//g because I miss read the doco on pos(). I thought it returned the last match, as though it found every occurance and returned the last one. Merlyn set me straight, it returns the position of the match from the last (most recent) m//g.

Well, I start using m//g and now pos() keeps giving me the first match in the file. I don't follow why it doesn't start looking from where the file pointer currently is (from the last read $dpf), but I figure I can get around this. I keep doing m//g in a do {} until pos() >= $lastknownpos. And then try to match one more time.

This ofcourse doesn't work. I try to use m/\G/g, but just don't understand how to use that.

Here is what I have... (more or less, the script is very long)

while ($currpos < $filelen){
    my ($readLen) = read $dpf, my ($recordID), 3;
    if (isValidID($recordID)) {
        #check the record length
        #if it is good, then parse the record for errors there
        #if the record is the wrong length, then skip parsing
        #when we read the next 3 chars we will probably not read a val
+id ID
        #and should begin hunting
    }
    elsif ($readLen == 3) {
        do {
        $dpf =~ m/\G.*MEH|MED|MMD|MMS|CR1|FR1/ig;
    } until (pos() >= $currpos);
            #this should put me at the record ID of the last
            #record I tried to parse
            #I will imagine there is a better way, particularly if
            #I can start from where the file pointer is and not from t
+he beginning of the file
    if ($dpf =~ m/\G.*MEH|MED|MMD|MMS|CR1|FR1/ig) {  #matching one mor
+e time will (hopefully) match the next record ID
        seek $dpf, pos(), 0;
    }   #if it doesn't match, then this truncated record is the last r
+ecord in the file (typical)
    else {
        seek $dpf, 0, 2;
    }
    }            
    else {    #if I didn't read three characters, then I hit EOF, seek
+ there so $currpos will be updated and we will fall out of the while{
+}
        seek $dpf, 0, 2;
    }
    $currpos = tell $dpf;
}  #end while
[download]

Of note:
$dpf is a filehandle open (my $dpf=\*FH, "file.dpf");
I am sure I can put MEH|MED|MMD|MMS|CR1|FR1 in a scalar so I don't have to copy it so often. Also makes future version easy to update when I add new record formats. But one thing at a time. (I would start playing with m//o if I put MEH|MED|MMD|MMS|CR1|FR1 in a scalar, and I'm liable to break something)

So, my problem is... after I find the record ID of a bad record (too long or too short) how do I easily find the next record ID (if one exsists)?

-Travis

In reply to Parsing and \G and /g and Stupid User Tricks by THuG

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.