Just reading them and then throwing them away doesn't look too perlish.

The data file is stored on disk as just a linear sequence of bytes. The disk/file system doesn't know anything about "skip the first 2 lines" or "skip blank lines" or "skip delimiter lines". All it knows about is reading bytes sequentially from the file.

So one way or the other, the header lines and \n indicating blank lines and the delimiters have to be read from the disk. It is not possible to "not read the delimiter lines". Somebody has to decide which lines to "throw away" and that somebody is the user. The only question is what kind of technique and/or Perl module that the user wants to use. There isn't a single "right" answer. That's why you got a couple of responses with different ways. There are some, what I consider "less good" ways which weren't offered as possibilities.

The basic job is to decide whether you are inside the record or not? This means that there has to be some state information to know that a new record has started and when that record has ended.

One way is like I showed Re: Parsing text sections, call a subroutine when the record starts and have that subroutine finish reading the record. The fact that you are in the subroutine means that a record has started. A flag like "INSIDE_RECORD?", true/false is not needed as it is implicit by the fact of being in the "finish the record" subroutine. This is a common coding pattern for this task and would be seen in other languages like C. I didn't show the code for calling the sub-parser, but obviously you would call that based upon what I called the "header" (the record type info from @ line).

BTW, it wasn't needed here, but if what "ends the record" is the start of a new record, instead of "unreading" that line in various ways, another way is to set a "noread" flag: while ($noread && ($line=<IN>)). This keeps $line for another iteration of the loop. If you are designing the format, avoiding this "start of new record means end of previous record" saves grief. In this particular case having records separated only by an "----@ type" line would have made the record parsing more problematic.

You should note that regexes in Perl can be variables!! This is way cool and applicable to all techniques.

The second way is to use flags to indicate whether or not you are inside the record. You can do the logic for this yourself which I would consider a "not as good" way. Or as Grandfather did, use the triple dot, or "flip-flop" operator. Read his node about it: Flipin good, or a total flop?. Read the other posts on how to exclude the lines that trigger the record in various ways.

This very special Perl operator essentially sets up flags for you to keep track of where you are. This is a cool critter and it takes some experimentation to understand it. If you read carefully the above, you will see that it also keeps track of the line number within the record which can sometimes be very helpful.

So this was a long post to say: Yes, all the lines have to be read from the file and the "bad ones" thrown away. This node shows 2 ways to do that, one of which is very Perl specific. Which way you prefer is up to you and often depends upon hard to quantify factors like who is going to be maintaining this code?, etc.


In reply to Re^3: Parsing text sections by Marshall
in thread Parsing text sections by betacentauri

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.