Survey file parsing

YYCseismic has asked for the wisdom of the Perl Monks concerning the following question:

I'm still working on my survey loading program, but now I've moved on to parsing the survey data files. This may be trivial to some, but I've never really done file parsing before.

The SEG-P1 format specifies that survey headers should be composed of lines that would be matched by the regex /^H/. Unfortunately not all survey companies adhere to this, only putting the 'H' at the start of the first header line. Also, it seems some places make 20-line headers, while others make 22-line headers.

I have two problems, but this may be able to solve both. My question is this: How can I parse out the header block correctly each time, regardless of the length or formatting? I include one example of each type of header (not looking at number of lines here) below.

First the format-specified version:

HLINE NUMBER     : ABCDE
HPROJECT ID      : 
HGROUP           : 
HAREA NAME       : *********
HOPERATOR        : *********
HCONTRACTOR      : ENERTEC
HSURVEY AUDITOR  : ACCU-AUDIT
HSURVEY DATE     : *********
HUTM ZONE        : 11
HSURVEY QUALITY  : ASCM,1
HCOMMENTS        : *********
H                : 
H                : 
H                : 
HLINE LENGTH (Km): 2.65
HGRID VERSION    : ATS 2.6
HDATUM           : NAD 27
HAUDIT DATE      : *********
H<....IDENTIFICATION....> <...GEOGRAPHICS...><.....UTMS.....>
H<.....LINE.....><..SP..>I<..LAT..><..LONG..><.EAST.><.NORT.><ELV><COM
+MENT>
[download]

Now the variant version:

HLINE NUMBER     : ABCDE
 PROJECT ID      : 
 GROUP           : 
 AREA NAME       : *********
 OPERATOR        : *********
 CONTRACTOR      : ENERTEC
 SURVEY AUDITOR  : ACCU-AUDIT
 SURVEY DATE     : *********
 UTM ZONE        : 11
 SURVEY QUALITY  : ASCM,1
 COMMENTS        : *********
                 : 
                 : 
                 : 
 LINE LENGTH (Km): 2.65
 GRID VERSION    : ATS 2.6
 DATUM           : NAD 27
 AUDIT DATE      : *********
 <....IDENTIFICATION....> <...GEOGRAPHICS...><.....UTMS.....>
 <.....LINE.....><..SP..>I<..LAT..><..LONG..><.EAST.><.NORT.><ELV><COM
+MENT>
[download]

The actual survey data (point coordinates) come starting on the line after the last line above.

Here's the code I have for getting the first (I'll call it "proper") version (for some reason I can't see, chomping wouldn't work, but push works well enough for me):

while (<IN>) {
     if (/^H/) {     ## Assumes all header lines start with 'H'
          push(@hdr, $_);
          next;      ## skip to next (possibly header) line
     }
     ##
     ## Capture each line of data in file
     ##
}
[download]

What can I do to make this work for both kinds of headers?

Update: Here's one

+ H PROSPECT    : ******* + H CONTRACTOR  : ***** + H SURVEY CO.  : ************ + H SURVEY DATE : DEC 1977 + H SURVEYOR    : _N/A + H ------------------------------- + H PRODUCED BY : DIVESTCO GEOMATICS + H  WEBSITE    : ********************** + H  EMAIL      : ********************** +M H DATE        : ************ + H JOB NUMBER  : ************ +*** H FILE NAME   : ******** + H MAPSHEET  : ************* + H ZONE      : Z11N : 117W +*** H GRID REF. : ATS 4.1 + H UNITS     : Decimeters + H ELLIPSOID : GRS 1980 +** H DATA QUALITY : Transcription 2D + H<LINE NAME     ><POINT >< LAT +>< 
H CLIENT      : **********                                            



LINE NAME      : *******  

UNIQUE ID      : *******  

ORIG.LINE NAME : *******  

ENERGY SOURCE  : DYNAMITE 

------------------------------------- ---------- FIRST SP : 101      

LAST SP : 222      

LINE LENGTH : 8.003   K

PROJECT NUMBER :          

AFE NUMBER : *********

CLIENT REFERENCE : *******  

DATUM      : NAD 1983 - Canada    

SOURCE INT.: ***      F STN INT.:  F HTKO :                            

VTKO :                            

SURVEY QUALITY CODE : *********



>< LONG   >< EAST ><NORTH ><ELE><    ><> href="?part=7;displaytype=displaycode;node_id=694401;abspart=1">[download]

Comment on Survey file parsing Select or Download Code

Replies are listed 'Best First'.
Re: Survey file parsing by punch_card_don (Curate) on Jun 27, 2008 at 17:56 UTC
Instead of identifying when the Header lines end, can you identify when the Data lines start, and assume everything up until then is a Header line? Forget that fear of gravity, Get a little savagery in your life.	[reply]
Re^2: Survey file parsing by YYCseismic (Beadle) on Jun 27, 2008 at 18:09 UTC
That could work, yes. I can't believe I hadn't thought of that. I'll give it a try and get back if I can't make it work, but I'm pretty sure it will. Thanks!	[reply]
Re: Survey file parsing by jds17 (Pilgrim) on Jun 27, 2008 at 17:58 UTC
In case no header line starts with whitespace followed by "<", but the first non-header line does, a simple solution would be as follows. (Maybe I misunderstood your notation, and the lines do not really start that way, but then either the must be identifiable using another regex or you must resort to counting lines.) `my $in_header++; while (<IN>) { if ($in_header && !/^\s+</) { push(@hdr, $_); } else { $in_header = 0; #process non-header lines (if needed) #... } }` [download]	[reply] [d/l]
Re^2: Survey file parsing by YYCseismic (Beadle) on Jun 27, 2008 at 22:02 UTC
For the SEG-P1 format specification, all non-header lines (read: data lines) start with a space. That zero-offset position is reserved for identifying header lines.	[reply]
Re: Survey file parsing by johngg (Canon) on Jun 27, 2008 at 18:44 UTC
A variant of samtregars's idea, all header lines look like they have a colon at offset 17 except the column headers which start with ' <' so `while ( <IN> ) { chomp; if ( substr( $_, 17, 1 ) eq ':' or /^ </ ) { # we are in the header } else { # now we are in the data } }` [download] might work for you. Cheers, JohnGG Update: Whoops, noticed that the column header lines actually start with ' <', corrected above	[reply] [d/l]
Re^2: Survey file parsing by YYCseismic (Beadle) on Jun 27, 2008 at 19:42 UTC
Okay. But here's yet another version of a header block: H--SEISMIC SURVEY DATA--SEG P1--test + LINE : ************ JOB NO. : ***** + CLIENT : ***************** + PROSPECT : ************ + CONTRACTOR : GEO STRATA RESOURCES INC. + FILENAME : ******* DATE : SEP 20, 2006 + PROJECTION : U.T.M. , S.F.=0.99960, NAD27, Clarke 1866 + ORIGIN : UTM ZONE 12 REF. MER. : 111.0000W + 0.99960000 DBS VERS. : ATS 2.6 + UNITS : GEOGRAPHICS: D.MS - COORD.: DECIMETERS - ELEV.: DECIMET +ERS SURVEYOR : MERCEDES SURVEYS COMPUTED BY : CAPELLA + KILOMETERS, LINE: 2.01 GROUP INTERVAL : 12.00 METER +S INTERPOLATED: ELEVATION = ^ ; HORIZONTAL = # ; BOTH = * + REM: SURVEY BY RTK GPS + + + + + + [ LINE ][ POINT ][ LAT ][ LONG ][ EAST ][ NORTH][ELE] *[ +COMMENT ] [download] The lines are padded with white space, but that's no problem. However, not all lines are always used. I don't need to keep blank lines, nor the very last line(s) that start with '<' or '['.	[reply] [d/l]
Re^2: Survey file parsing by YYCseismic (Beadle) on Jun 27, 2008 at 22:06 UTC
I did try that one out, and then I discovered the solution I'm using (at least for now) in chapter 6 of the Perl Cookbook, 2nd Edition. Thanks though; your technique generally worked.	[reply]
Re: Survey file parsing by samtregar (Abbot) on Jun 27, 2008 at 17:57 UTC
It looks like the variant version starts each header line with a space. Is it possible for the data to start with a space? If not: `if (/^H/ or /^ /) {` Alternately you could look for the "key : value" format that all the header lines seem to have: `if (/^[^:]+\s+: \S+/) {` But you'll have to be sure that your data lines can never match that format. -sam	[reply] [d/l] [select]
Re: Survey file parsing by YYCseismic (Beadle) on Jun 27, 2008 at 20:53 UTC
Okay, I think I've solved it. I just discovered Recipe 6.8 in the Perl Cookbook 2nd Edition (p. 199), which uses the `..` and `...` operators to extract a range of lines. So long as I know that the first header line will always have 'H' as the first character, and as long as I know what the last header line might look like, then I should have no problem. I've never seen them use anything other than '<' or '[' on that last line. Thanks for your help. I was actually looking through the cookbook for another problem, and this one just popped out. Go figure, eh?	[reply] [d/l] [select]