Re^2: Parsing text sections

Replies are listed 'Best First'.
Re^3: Parsing text sections by Marshall (Canon) on Jul 05, 2010 at 10:32 UTC
Just reading them and then throwing them away doesn't look too perlish. The data file is stored on disk as just a linear sequence of bytes. The disk/file system doesn't know anything about "skip the first 2 lines" or "skip blank lines" or "skip delimiter lines". All it knows about is reading bytes sequentially from the file. So one way or the other, the header lines and \n indicating blank lines and the delimiters have to be read from the disk. It is not possible to "not read the delimiter lines". Somebody has to decide which lines to "throw away" and that somebody is the user. The only question is what kind of technique and/or Perl module that the user wants to use. There isn't a single "right" answer. That's why you got a couple of responses with different ways. There are some, what I consider "less good" ways which weren't offered as possibilities. The basic job is to decide whether you are inside the record or not? This means that there has to be some state information to know that a new record has started and when that record has ended. One way is like I showed Re: Parsing text sections, call a subroutine when the record starts and have that subroutine finish reading the record. The fact that you are in the subroutine means that a record has started. A flag like "INSIDE_RECORD?", true/false is not needed as it is implicit by the fact of being in the "finish the record" subroutine. This is a common coding pattern for this task and would be seen in other languages like C. I didn't show the code for calling the sub-parser, but obviously you would call that based upon what I called the "header" (the record type info from @ line). BTW, it wasn't needed here, but if what "ends the record" is the start of a new record, instead of "unreading" that line in various ways, another way is to set a "noread" flag: while ($noread && ($line=<IN>)). This keeps $line for another iteration of the loop. If you are designing the format, avoiding this "start of new record means end of previous record" saves grief. In this particular case having records separated only by an "----@ type" line would have made the record parsing more problematic. You should note that regexes in Perl can be variables!! This is way cool and applicable to all techniques. The second way is to use flags to indicate whether or not you are inside the record. You can do the logic for this yourself which I would consider a "not as good" way. Or as Grandfather did, use the triple dot, or "flip-flop" operator. Read his node about it: Flipin good, or a total flop?. Read the other posts on how to exclude the lines that trigger the record in various ways. This very special Perl operator essentially sets up flags for you to keep track of where you are. This is a cool critter and it takes some experimentation to understand it. If you read carefully the above, you will see that it also keeps track of the line number within the record which can sometimes be very helpful. So this was a long post to say: Yes, all the lines have to be read from the file and the "bad ones" thrown away. This node shows 2 ways to do that, one of which is very Perl specific. Which way you prefer is up to you and often depends upon hard to quantify factors like who is going to be maintaining this code?, etc.	[reply]

Replies are listed 'Best First'.

Re^3: Parsing text sections
by Marshall (Canon) on Jul 05, 2010 at 10:32 UTC

Just reading them and then throwing them away doesn't look too perlish.

The data file is stored on disk as just a linear sequence of bytes. The disk/file system doesn't know anything about "skip the first 2 lines" or "skip blank lines" or "skip delimiter lines". All it knows about is reading bytes sequentially from the file.

So one way or the other, the header lines and \n indicating blank lines and the delimiters have to be read from the disk. It is not possible to "not read the delimiter lines". Somebody has to decide which lines to "throw away" and that somebody is the user. The only question is what kind of technique and/or Perl module that the user wants to use. There isn't a single "right" answer. That's why you got a couple of responses with different ways. There are some, what I consider "less good" ways which weren't offered as possibilities.

The basic job is to decide whether you are inside the record or not? This means that there has to be some state information to know that a new record has started and when that record has ended.

One way is like I showed Re: Parsing text sections, call a subroutine when the record starts and have that subroutine finish reading the record. The fact that you are in the subroutine means that a record has started. A flag like "INSIDE_RECORD?", true/false is not needed as it is implicit by the fact of being in the "finish the record" subroutine. This is a common coding pattern for this task and would be seen in other languages like C. I didn't show the code for calling the sub-parser, but obviously you would call that based upon what I called the "header" (the record type info from @ line).

BTW, it wasn't needed here, but if what "ends the record" is the start of a new record, instead of "unreading" that line in various ways, another way is to set a "noread" flag: while ($noread && ($line=<IN>)). This keeps $line for another iteration of the loop. If you are designing the format, avoiding this "start of new record means end of previous record" saves grief. In this particular case having records separated only by an "----@ type" line would have made the record parsing more problematic.

You should note that regexes in Perl can be variables!! This is way cool and applicable to all techniques.

The second way is to use flags to indicate whether or not you are inside the record. You can do the logic for this yourself which I would consider a "not as good" way. Or as Grandfather did, use the triple dot, or "flip-flop" operator. Read his node about it: Flipin good, or a total flop?. Read the other posts on how to exclude the lines that trigger the record in various ways.

This very special Perl operator essentially sets up flags for you to keep track of where you are. This is a cool critter and it takes some experimentation to understand it. If you read carefully the above, you will see that it also keeps track of the line number within the record which can sometimes be very helpful.

So this was a long post to say: Yes, all the lines have to be read from the file and the "bad ones" thrown away. This node shows 2 ways to do that, one of which is very Perl specific. Which way you prefer is up to you and often depends upon hard to quantify factors like who is going to be maintaining this code?, etc.

[reply]