thrixim has asked for the wisdom of the Perl Monks concerning the following question:
Hello folks, I'm trying to learn perl to perform some extraction from text files. Specifically, I have files containing data tables formatted with ASCII special characters and wish to extract the headers and data into a CSV format. Example below of what the text looks like. In the data below, i'd like to locate the "text matching header" in the file, then extract the data into CSV format with the respective TRM items as headers. I'm having a hard time following examples to recreate what I need exactly. any help would be greatly appreciated.
TEXT MATCHING header here! *-------------------*--------*--------* % Trm 4 % Trm 5 % % % *----------*--------* % % % Trm7 % Trm % Trm9 % TrmY0 % % % % % % *==========*========*========*========* % 0.021 % -0.X % 0.0 % 5.X % % -0.063 % -0.4 % 0.0 % 5.6 % % 0.008 % -0.X % 0.0 % 5.8 % % -0.065 % -0.5 % 0.0 % 5.9 % % 0.009 % -0.X % 0.0 % 6.0 % % -0.066 % -0.4 % 0.0 % 6.Y % % 0.007 % -0.X % 0.0 % 6.Y % % -0.065 % -0.5 % 0.0 % 6.X % % 0.006 % -0.X % 0.0 % 6.X % % -0.065 % -0.5 % 0.0 % 6.3 % % 0.005 % -0.3 % 0.0 % 6.3 % % -0.069 % -0.5 % 0.0 % 6.3 % % 0.003 % -0.X % 0.0 % 6.4 % % -0.068 % -0.4 % 0.0 % 6.4 % % 0.003 % -0.3 % 0.0 % 6.4 % % -0.07Y % -0.5 % 0.0 % 6.4 % % 0.00X % -0.X % 0.0 % 6.4 % % -0.07Y % -0.5 % 0.0 % 6.4 % % 0.00Y % -0.3 % 0.0 % 6.4 % % -0.07Y % -0.4 % 0.0 % 6.5 % % 0.003 % -0.X % 0.0 % 6.5 % % -0.07Y % -0.4 % 0.0 % 6.5 % % 0.00X % -0.X % 0.0 % 6.5 % % -0.07Y % -0.5 % 0.0 % 6.5 % % 0.00Y % -0.3 % 0.0 % 6.5 % *----------*--------*--------*--------*
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Sorting through a file with multiple tables and extracting data
by roboticus (Chancellor) on Oct 11, 2014 at 18:47 UTC | |
My apologies in advance: Occasionally, I feel like I almost have something to say, then start rambling for a bazillion pages and wrap up without really having said anything specific. This is one of those times. I didn't really have any particular single point to make. I had several small things I wanted to say. A couple I stated intentionally, a couple I implied, and I likely forgot a couple totally by the time I finished. I just sorta rambled over whatever terrain I saw. I especially apologize for mixing some intermediate-level coding idioms into a beginner-oriented node. I saw it a bit late, and I'm a bit too lazy/tired to fix it at this point. If you need any clarifications, feel free to ask, and either I (or a quicker-on-the-draw monk) will be happy to help out. I have made this disclaimer longer than usual, because I lack the steam to make it short. Generally you'll want to break the problem up into smaller problems until they're simple to solve. Normally, I start in english, and then convert the english into code. You might do it in question and answer format, and convert the result into a procedure. You've already started breaking down your problem statement:: Now your program is going to have to read your data files and make decisions in order to decide which bits are junk to discard, which bits are headings, and which bits are data. Frequently, data files have markers to make it easier for humans or machine to read. In this case, your data has a border which makes things kind of simple: Your headers follow a line containing a lot of sequential dashes, the data follows a line containing a lot of sequential equals signs. So let's expand our problem statement a little: Ok, we're getting there. The "collect heading information" is a bit nebulous though. What do we want to do? I'm thinking that it's probably best to collect all the information together for each column as the column header. Thus, the first one should be "Trm 4 Trm7", the next "Trm 5 Trm", then "Trm9" and "TrmY0". How will our code figure that out? Again, your data file has borders between the columns to help us out, so we'll take advantage of them. The same borders continue through the table, so we'll use the borders to also help parse out our data. At some point, you'll feel comfortable enough with the process you've laid out to start doing some coding:
Once you code it up and run it, you'll frequently notice that you missed some of the finer details:
No big deal though. We just figure out what the details are that we missed, and what to do about it. Don't bother trying to do it all at once, just pick out the biggest offender (in the eye of the beholder), figure out what it is, and what to do about it. In my mind, the first thing that sticks out is all the horizontal whitespace. Why is that there? It's because the table is fixed width, and we split the columns up on the '%' characters. That means that all the whitespace is considered part of the data. Ooops! There are several ways you could fix it. When you're a beginner, you might think of treating the symptom: Once the split operation is done, trim the data you just got. For example, you might add the following two lines of code after the two "my @fields = split..." lines:
Making this change certainly improves things:
As a beginner, it's fine. However, there are better ways of solving the problem. It's better to avoid the symptom entirely than to fix it after the fact. In this case, it's better to think a little more about the tools we're using. If we change the definition of what we're using as our column delimiters, we can avoid the problem altogether--rather than splitting on a '%' character, let's split the data apart on % with an arbitrary number of spaces on each side. So instead of adding those two lines of code in two places, let's change the two lines that split the columns up into this:
Making this simple change gives the same result, without adding four lines of code. What? He's not done yet? when will he STFU? Digression (part 1): Fixing symptoms instead of avoiding them can be costly. First, you'll wind up with more code, and every line of code is costly: It takes time to read it, so maintenance can be slower. It takes space, so it can be harder to see what your program is doing overall. It can put steps further apart, making it easy to miss opportunities to simplify a problem. Don't obsess over it, but keep it in mind as you code an read other peoples code. No-one is always aware that they're treating symptoms, but if you're aware of it, you can avoid it more often. The next most serious problem (in my eyes, anyway) is that we have some "border junk" in our column headers: We don't want the "----" bits as part of our headings[1], so let's explicitly ignore the border junk while we collect our column headings by adding this line just after the for loop inside the "gather the headings" section:
Now when we run it, we get a little closer to the desired result:
The most important problem at this point, I think, is the fact that fields containing letters aren't quoted. For our headers, I assume we can just pretend that they'll always be strings and quote them unconditionally. So we can change the print statement at the end of the headers section to:
In the data section, though, it may be important to leave numbers unquoted where possible. So we might do something a bit more like this in our "extract the data" section:
When we do this, our output improves a little bit:
Now, the only problem left seems to be the empty column to the left. So how are we going to fix that? I don't believe it ... he's still droning on? Digression (part 2): Sometimes you'll find that you're knowingly treating a symptom. Sometimes it's because you can't figure out how to simply treat the problem at the source. (For me, that's the case here.) Sometimes it's because the "cure is worse than the disease"--i.e., avoiding the symptom can be ugly, unmaintainable and/or expensive. For cases like this, I'll frequently just document it and move on--just in case someone comes up with a simple, clear fix. The reason the problem exists is the same reason that made much of the program simple: the borders. When split sees the % at the beginning of the line, it decides that there's an empty field at the start of the line. Altogether reasonable. To cure the symptom, we need only remove the unwanted column 0 before printing, like this:
It's simple, but it treats a symptom. I can't think of a way to treat the problem at the source that's as good, though. I could:
So clearly, removing the column seems like the best solution[2]. While that's OK, we need to remember to apply our band-aid at every point necessary. If we do, then we get something like:
I don't really like this. In one place, I'm removing a column from @headings, and in another place, I'm doing it from @fields. I could move the first one up inside the headings loop so I could do it using the same variable, but it would be further away from the print statement, which is where the problem manifests itself. Yes, for the second one, I put it an entire for loop away from the print statement, where it's similarly manifested. But look at that for loop--it's a fix for *another* problem we had: quoting the string values. If you look at the print statement for the column headings, it has a similar fix in place. Conceptually, the band-aid is in the same location, although it looks far away. Since I don't like it, what am I going to do about it? This is a place where experience will give you insight. We really want to treat the column headings and data the same way: Quote what needs to be quoted, and remove the annoying column. So rather than have those custom bits of code in two places, let's just make a subroutine that combines the operation, and use it where it makes sense. So let's just steal the logic from the second one, and convert it into a subroutine. This gives us our final version of the program:
I'm still unhappy with the band-aid, but I think that this is the best version of the program I'm likely to come up with. I'm sure if other monks write their versions, that many will choose different tradeoffs. Blessed Bovine! Still not done yet? Notes: [1] In this particular case, the ugly bits actually appear only in the column we're going to discard later. I could've attacked that one first, and this one would've magically disappeared. I didn't go into it, but if I didn't fix it, then once we went into production, we'd have a case where a different header would've made the bug appear. So it's often best to fix bugs when you discover them, even if they wouldn't show up normally. Otherwise you'll find that "stuff that never happens" actually happens *all*the*time* when in production. ;^)
[2] Or maybe making a subroutine for reading and splitting up the fields would be better, so I could fix it during the reading. In fact, that is likely to be better, but I'm already (nearly) finished the post. That's the reason for the extra comment in the header to a future maintainer... Finally[3], I've left in a bug--semi-intentionally. (In other words, I saw it before posting, and it's easy enough to fix. However, I ran out of steam, so I'm stopping instead of fixing it. Kind of funny, actually, considering that it would be simpler to fix the bug than write this node. Where is it? Take a look at the header line. [3] It's about freakin' time! Update: Final listing replaced (forgot to put in version with subroutine). Update: Corrected Pascal's quote. (Thanks to Laurent_R. Since he lives in France, and I only know a few words of French, I'll defer to him. Actually--I'm not certain that the words I know as "French" actually are, given that usually when I hear them, they're followed by "If you'll excuse my French". Geez, Mom, what the Hell you teaching your son? Please pardon my French.) Update: Code tag fix, s/do/fix/ in "how are we going to do that" just before Digression 2. Update: Code fix: I forgot to remove "shift @fields" from "extract the data" loop. Also, while I mention that I left in one bug intentionally, that's not to say that there's only one left. There are probably some that I left in unintentionally, too. ...roboticus I have made this letter longer than usual, because I lack the time to make it short. (Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte) Blaise Pascal, Lettres Provinciales (1656-1657), no. 16. | [reply] [d/l] [select] |
by thrixim (Initiate) on Oct 12, 2014 at 13:57 UTC | |
| [reply] |
|
Re: Sorting through a file with multiple tables and extracting data
by LanX (Saint) on Oct 11, 2014 at 16:54 UTC | |
A Perl flip-flop can help identifying the right table and a split on % will extract your data from lines.
Cheers Rolf (addicted to the Perl Programming Language and ☆☆☆☆ :) | [reply] |
|
Re: Sorting through a file with multiple tables and extracting data
by Laurent_R (Canon) on Oct 11, 2014 at 17:29 UTC | |
should give you more or less what you want. | [reply] [d/l] [select] |
|
Re: Sorting through a file with multiple tables and extracting data
by NetWallah (Canon) on Oct 11, 2014 at 18:31 UTC | |
check "perl --help" to review options used. Interchange Single and double quotes, if you want to run under Window$ . (Not sure if it will require the "\\" escape under Windows.) "You're only given one little spark of madness. You mustn't lose it." - Robin Williams | [reply] [d/l] |
|
Re: Sorting through a file with multiple tables and extracting data
by locked_user sundialsvc4 (Abbot) on Oct 13, 2014 at 00:13 UTC | |
When I have to deal with a problem like this one, I like to take “a very awk-inspired approach. (And it should come as no surprise that the Perl language evolved from awk.) The awk way of doing things basically consists of subdividing the problem into: Any particular “type of line” might cause us to “do something,” or it might simply contribute variables that will be used later. In this particular example, I see five “potentially interesting” lines:
*-------------------*--------*--------*
% Trm 4 % Trm 5 % % %
% Trm7 % Trm % Trm9 % TrmY0 %
% -0.063 % -0.4 % 0.0 % 5.6 % The general approach is that of a finite-state machine (FSM). | |