in reply to Re^2: How to select specific lines from a file
in thread How to select specific lines from a file

I like your comment, and intend to upvote it. In this specific case it appears that the data set is tame enough that the distinction between fixed-width and space-delimited is moot.

However, one general principle that I try to adhere to as much as possible is placing the fewest possible demands on a data set as possible. This concept can be generalized from some lessons I learned by reading Effective STL, where Steve Meyers makes some strong cases for why a template container class should place as few requirements on the objects it contains as possible. I'd love to go into the details, but it's a big enough concept that I probably wouldn't do it justice in a simple PerlMonks node.

Let's take it as a given, then, that the generalized practice of placing as few demands on an entity that we don't control as possible is "a good thing". In particular, doing so helps to simplify our parser, allows us to unambiguously reject data that is broken, and probably even makes it easier to generate valid data.

So what is the simplest, least demanding set of requirements that we can place on our OP's data? As we look it over, it becomes pretty obvious that it is of fixed-width, and that it is space delimited. ...or is it? What if one of those numeric fields (66, for example) extends to four digits? We already see in his data set places where it extends to three digits. A fourth would cause it to run up against our "[AB]" field. So there's one requirement we have to place on the data set; no column can become filled to the point that it touches the one next to it. 1000 is illegal for the 6th field. Maybe this is reasonable, but I don't know. I do know that as 66 grows to 100, the field widths haven't shifted, so that field size must always be four or less. But I don't know if four digits is a possible in-range value.

What about blank fields? The user's data set example has no blank fields (that I can detect, though there are some big gaps). \s+ delimited data requires that every field contain something. There's another demand placed on our data set, or if not placed on the data set, another ambiguity that our parser must deal with.

Next, by looking at his data it seems obvious that there cannot be embedded spaces. However, that is not just an observation, it's a requirement placed on the data. If a field ever changes such that it allows embedded spaces, our parser breaks. And if that ever happens, we run into all sorts of additional demands for our data; embedded spaces must be escaped or quoted, quotes must be balanced if used, embedded quotes must be escaped, and so on.

This will probably never happen with the user's data set; it may never morph into something more complex. Splitting on space may forever be fine. ...it will have to be fine because the parser now demands it. It can never be permitted to morph into something that includes, for example, a notes field (unless it's in the last position, which is another requirement placed on the data and another rule for the parser), completely full fields, or blank fields.

So here are the choices for how we can parse fixed-width data:

  1. As fixed width: Must be fixed width.
  2. As space delimited: No full fields, no blank fields, no embedded spaces.

The first rule seems to be the most likely for this data set. If we treat it as If it's fixed width, we impose only one requirement. And probably that requirement is already part of the implementation of the producer. If we treat fixed width data as space delimited, we impose three additional restrictions on the data. Treat fixed width as fixed width for the most robust solution.


Dave

Replies are listed 'Best First'.
Re^4: How to select specific lines from a file
by Laurent_R (Canon) on Apr 30, 2014 at 06:39 UTC
    Yes, Dave you are absolutely right. The thing that I did not say is that, usually when I have such cases of parameter file or reference data, I am usually extracting the data myself (or it is done by one of my colleagues) from another system, so that I know exactly what I can demand from the data, or we have something called an interface agreement specifying exactly how the data should look like.

    When the data comes from unknown source or the exact format cannot be certain, then we are left with trying our best to get the best out of it, and, in the case in point, I fully agree that considering the data as fixed format is the best that can be done on the basis of what the data looks like.