Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Survey file parsing

by YYCseismic (Beadle)
on Jun 27, 2008 at 17:39 UTC ( [id://694401]=perlquestion: print w/replies, xml ) Need Help??

YYCseismic has asked for the wisdom of the Perl Monks concerning the following question:

I'm still working on my survey loading program, but now I've moved on to parsing the survey data files. This may be trivial to some, but I've never really done file parsing before.

The SEG-P1 format specifies that survey headers should be composed of lines that would be matched by the regex /^H/. Unfortunately not all survey companies adhere to this, only putting the 'H' at the start of the first header line. Also, it seems some places make 20-line headers, while others make 22-line headers.

I have two problems, but this may be able to solve both. My question is this: How can I parse out the header block correctly each time, regardless of the length or formatting? I include one example of each type of header (not looking at number of lines here) below.

First the format-specified version:

HLINE NUMBER : ABCDE HPROJECT ID : HGROUP : HAREA NAME : ********* HOPERATOR : ********* HCONTRACTOR : ENERTEC HSURVEY AUDITOR : ACCU-AUDIT HSURVEY DATE : ********* HUTM ZONE : 11 HSURVEY QUALITY : ASCM,1 HCOMMENTS : ********* H : H : H : HLINE LENGTH (Km): 2.65 HGRID VERSION : ATS 2.6 HDATUM : NAD 27 HAUDIT DATE : ********* H<....IDENTIFICATION....> <...GEOGRAPHICS...><.....UTMS.....> H<.....LINE.....><..SP..>I<..LAT..><..LONG..><.EAST.><.NORT.><ELV><COM +MENT>

Now the variant version:

HLINE NUMBER : ABCDE PROJECT ID : GROUP : AREA NAME : ********* OPERATOR : ********* CONTRACTOR : ENERTEC SURVEY AUDITOR : ACCU-AUDIT SURVEY DATE : ********* UTM ZONE : 11 SURVEY QUALITY : ASCM,1 COMMENTS : ********* : : : LINE LENGTH (Km): 2.65 GRID VERSION : ATS 2.6 DATUM : NAD 27 AUDIT DATE : ********* <....IDENTIFICATION....> <...GEOGRAPHICS...><.....UTMS.....> <.....LINE.....><..SP..>I<..LAT..><..LONG..><.EAST.><.NORT.><ELV><COM +MENT>

The actual survey data (point coordinates) come starting on the line after the last line above.

Here's the code I have for getting the first (I'll call it "proper") version (for some reason I can't see, chomping wouldn't work, but push works well enough for me):

while (<IN>) { if (/^H/) { ## Assumes all header lines start with 'H' push(@hdr, $_); next; ## skip to next (possibly header) line } ## ## Capture each line of data in file ## }

What can I do to make this work for both kinds of headers?

Update: Here's one more sample header:

H CLIENT : ********** + H PROSPECT : ******* + H CONTRACTOR : ***** LINE NAME : ******* + H SURVEY CO. : ************ UNIQUE ID : ******* + H SURVEY DATE : DEC 1977 ORIG.LINE NAME : ******* + H SURVEYOR : _N/A ENERGY SOURCE : DYNAMITE + H -------------------------------------------------------------------- +---------- H PRODUCED BY : DIVESTCO GEOMATICS FIRST SP : 101 + H WEBSITE : ********************** LAST SP : 222 + H EMAIL : ********************** LINE LENGTH : 8.003 K +M H DATE : ************ PROJECT NUMBER : + H JOB NUMBER : ************ AFE NUMBER : ********* +*** H FILE NAME : ******** CLIENT REFERENCE : ******* + H MAPSHEET : ************* DATUM : NAD 1983 - Canada + H ZONE : Z11N : 117W SOURCE INT.: *** F STN INT.: +*** F H GRID REF. : ATS 4.1 HTKO : + H UNITS : Decimeters VTKO : + H ELLIPSOID : GRS 1980 SURVEY QUALITY CODE : ********* +** H DATA QUALITY : Transcription 2D + H<LINE NAME ><POINT >< LAT >< LONG >< EAST ><NORTH ><ELE>< +>< ><>

Replies are listed 'Best First'.
Re: Survey file parsing
by punch_card_don (Curate) on Jun 27, 2008 at 17:56 UTC
    Instead of identifying when the Header lines end, can you identify when the Data lines start, and assume everything up until then is a Header line?




    Forget that fear of gravity,
    Get a little savagery in your life.

      That could work, yes. I can't believe I hadn't thought of that. I'll give it a try and get back if I can't make it work, but I'm pretty sure it will. Thanks!

Re: Survey file parsing
by jds17 (Pilgrim) on Jun 27, 2008 at 17:58 UTC
    In case no header line starts with whitespace followed by "<", but the first non-header line does, a simple solution would be as follows. (Maybe I misunderstood your notation, and the lines do not really start that way, but then either the must be identifiable using another regex or you must resort to counting lines.)
    my $in_header++; while (<IN>) { if ($in_header && !/^\s+</) { push(@hdr, $_); } else { $in_header = 0; #process non-header lines (if needed) #... } }

      For the SEG-P1 format specification, all non-header lines (read: data lines) start with a space. That zero-offset position is reserved for identifying header lines.

Re: Survey file parsing
by johngg (Canon) on Jun 27, 2008 at 18:44 UTC
    A variant of samtregars's idea, all header lines look like they have a colon at offset 17 except the column headers which start with ' <' so

    while ( <IN> ) { chomp; if ( substr( $_, 17, 1 ) eq ':' or /^ </ ) { # we are in the header } else { # now we are in the data } }

    might work for you.

    Cheers,

    JohnGG

    Update: Whoops, noticed that the column header lines actually start with ' <', corrected above

      Okay. But here's yet another version of a header block:

      H--SEISMIC SURVEY DATA--SEG P1--test + LINE : ************** JOB NO. : ********* + CLIENT : ********************* + PROSPECT : **************** + CONTRACTOR : GEO STRATA RESOURCES INC. + FILENAME : ********* DATE : SEP 20, 2006 + PROJECTION : U.T.M. , S.F.=0.99960, NAD27, Clarke 1866 + ORIGIN : UTM ZONE 12 REF. MER. : 111.0000W + 0.99960000 DBS VERS. : ATS 2.6 + UNITS : GEOGRAPHICS: D.MS - COORD.: DECIMETERS - ELEV.: DECIMET +ERS SURVEYOR : MERCEDES SURVEYS COMPUTED BY : CAPELLA + KILOMETERS, LINE: 2.01 GROUP INTERVAL : 12.00 METER +S INTERPOLATED: ELEVATION = ^ ; HORIZONTAL = # ; BOTH = * + REM: SURVEY BY RTK GPS + + + + + + [ LINE ][ POINT ][ LAT ][ LONG ][ EAST ][ NORTH][ELE] *[ +COMMENT ]

      The lines are padded with white space, but that's no problem. However, not all lines are always used. I don't need to keep blank lines, nor the very last line(s) that start with '<' or '['.

      I did try that one out, and then I discovered the solution I'm using (at least for now) in chapter 6 of the Perl Cookbook, 2nd Edition. Thanks though; your technique generally worked.

Re: Survey file parsing
by samtregar (Abbot) on Jun 27, 2008 at 17:57 UTC
    It looks like the variant version starts each header line with a space. Is it possible for the data to start with a space? If not:

      if (/^H/ or /^ /) {

    Alternately you could look for the "key : value" format that all the header lines seem to have:

      if (/^[^:]+\s+: \S+/) {

    But you'll have to be sure that your data lines can never match that format.

    -sam

Re: Survey file parsing
by YYCseismic (Beadle) on Jun 27, 2008 at 20:53 UTC

    Okay, I think I've solved it. I just discovered Recipe 6.8 in the Perl Cookbook 2nd Edition (p. 199), which uses the .. and ... operators to extract a range of lines. So long as I know that the first header line will always have 'H' as the first character, and as long as I know what the last header line might look like, then I should have no problem. I've never seen them use anything other than '<' or '[' on that last line.

    Thanks for your help. I was actually looking through the cookbook for another problem, and this one just popped out. Go figure, eh?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://694401]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-24 02:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found