Sorting through a file with multiple tables and extracting data

thrixim has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sorting through a file with multiple tables and extracting data by roboticus (Chancellor) on Oct 11, 2014 at 18:47 UTC
thrixim: My apologies in advance: Occasionally, I feel like I almost have something to say, then start rambling for a bazillion pages and wrap up without really having said anything specific. This is one of those times. I didn't really have any particular single point to make. I had several small things I wanted to say. A couple I stated intentionally, a couple I implied, and I likely forgot a couple totally by the time I finished. I just sorta rambled over whatever terrain I saw. I especially apologize for mixing some intermediate-level coding idioms into a beginner-oriented node. I saw it a bit late, and I'm a bit too lazy/tired to fix it at this point. If you need any clarifications, feel free to ask, and either I (or a quicker-on-the-draw monk) will be happy to help out. I have made this disclaimer longer than usual, because I lack the steam to make it short. Generally you'll want to break the problem up into smaller problems until they're simple to solve. Normally, I start in english, and then convert the english into code. You might do it in question and answer format, and convert the result into a procedure. You've already started breaking down your problem statement:: i'd like to locate the "text matching header" in the file extract the data with the respective TRM items as headers create a CSV file with the data Now your program is going to have to read your data files and make decisions in order to decide which bits are junk to discard, which bits are headings, and which bits are data. Frequently, data files have markers to make it easier for humans or machine to read. In this case, your data has a border which makes things kind of simple: Your headers follow a line containing a lot of sequential dashes, the data follows a line containing a lot of sequential equals signs. So let's expand our problem statement a little: Locate header: Read lines, discarding them until we find a line containing "-----". Read lines and collect heading information until we find a line containing "=====". Extract data Read lines and process them until we find a line containing "-----". Ok, we're getting there. The "collect heading information" is a bit nebulous though. What do we want to do? I'm thinking that it's probably best to collect all the information together for each column as the column header. Thus, the first one should be "Trm 4 Trm7", the next "Trm 5 Trm", then "Trm9" and "TrmY0". How will our code figure that out? Again, your data file has borders between the columns to help us out, so we'll take advantage of them. The same borders continue through the table, so we'll use the borders to also help parse out our data. Locate header: Until we find the header: Read a line If it contains "-----", we've found the heading area Gather the headings: Read a line If it contains "=====", we've found the end of the headings Split the line into the columns Add the information found to each column heading Extract the data Read a line If it contains "-----", we've found the end of the data Split the line into the columns At some point, you'll feel comfortable enough with the process you've laid out to start doing some coding: `use strict; use warnings; # Locate header: discard lines until we find # a line containing "-----" while (<DATA>) { last if /-----/; } # Gather the headings my @headings; while (<DATA>) { last if /=====/; my @fields = split /%/; for my $i (0 .. $#fields) { $headings[$i] .= $fields[$i]; } } print join(", ", @headings), "\n"; # Extract the data while (<DATA>) { last if /-----/; my @fields = split /%/; print join(", ", @fields), "\n"; }` [download] Once you code it up and run it, you'll frequently notice that you missed some of the finer details: $ perl 1103507.pl ------------------* , Trm 4 Trm7 , + Trm 5 Trm , Trm9 , TrmY0 , , 0.021 , -0.X , 0.0 , 5.X , , -0.063 , -0.4 , 0.0 , 5.6 , , 0.008 , -0.X , 0.0 , 5.8 , , -0.065 , -0.5 , 0.0 , 5.9 , , 0.009 , -0.X , 0.0 , 6.0 , , -0.066 , -0.4 , 0.0 , 6.Y , , 0.007 , -0.X , 0.0 , 6.Y , , -0.065 , -0.5 , 0.0 , 6.X , , 0.006 , -0.X , 0.0 , 6.X , , -0.065 , -0.5 , 0.0 , 6.3 , , 0.005 , -0.3 , 0.0 , 6.3 , , -0.069 , -0.5 , 0.0 , 6.3 , , 0.003 , -0.X , 0.0 , 6.4 , , -0.068 , -0.4 , 0.0 , 6.4 , , 0.003 , -0.3 , 0.0 , 6.4 , , -0.07Y , -0.5 , 0.0 , 6.4 , , 0.00X , -0.X , 0.0 , 6.4 , , -0.07Y , -0.5 , 0.0 , 6.4 , , 0.00Y , -0.3 , 0.0 , 6.4 , , -0.07Y , -0.4 , 0.0 , 6.5 , , 0.003 , -0.X , 0.0 , 6.5 , , -0.07Y , -0.4 , 0.0 , 6.5 , , 0.00X , -0.X , 0.0 , 6.5 , , -0.07Y , -0.5 , 0.0 , 6.5 , , 0.00Y , -0.3 , 0.0 , 6.5 , [download] No big deal though. We just figure out what the details are that we missed, and what to do about it. Don't bother trying to do it all at once, just pick out the biggest offender (in the eye of the beholder), figure out what it is, and what to do about it. In my mind, the first thing that sticks out is all the horizontal whitespace. Why is that there? It's because the table is fixed width, and we split the columns up on the '%' characters. That means that all the whitespace is considered part of the data. Ooops! There are several ways you could fix it. When you're a beginner, you might think of treating the symptom: Once the split operation is done, trim the data you just got. For example, you might add the following two lines of code after the two "my @fields = split..." lines: `s/^\s+// for @fields; # remove leading blanks s/\s+$// for @fields; # remove trailing blanks` [download] Making this change certainly improves things: $ perl 1103507.pl ------------------, Trm 4Trm7, Trm 5Trm, Trm9, TrmY0, , 0.021, -0.X, 0.0, 5.X, , -0.063, -0.4, 0.0, 5.6, , 0.008, -0.X, 0.0, 5.8, , -0.065, -0.5, 0.0, 5.9, , 0.009, -0.X, 0.0, 6.0, , -0.066, -0.4, 0.0, 6.Y, , 0.007, -0.X, 0.0, 6.Y, , -0.065, -0.5, 0.0, 6.X, , 0.006, -0.X, 0.0, 6.X, , -0.065, -0.5, 0.0, 6.3, , 0.005, -0.3, 0.0, 6.3, , -0.069, -0.5, 0.0, 6.3, , 0.003, -0.X, 0.0, 6.4, , -0.068, -0.4, 0.0, 6.4, , 0.003, -0.3, 0.0, 6.4, , -0.07Y, -0.5, 0.0, 6.4, , 0.00X, -0.X, 0.0, 6.4, , -0.07Y, -0.5, 0.0, 6.4, , 0.00Y, -0.3, 0.0, 6.4, , -0.07Y, -0.4, 0.0, 6.5, , 0.003, -0.X, 0.0, 6.5, , -0.07Y, -0.4, 0.0, 6.5, , 0.00X, -0.X, 0.0, 6.5, , -0.07Y, -0.5, 0.0, 6.5, , 0.00Y, -0.3, 0.0, 6.5, [download] As a beginner, it's fine. However, there are better ways of solving the problem. It's better to avoid the symptom entirely than to fix it after the fact. In this case, it's better to think a little more about the tools we're using. If we change the definition of what we're using as our column delimiters, we can avoid the problem altogether--rather than splitting on a '%' character, let's split the data apart on % with an arbitrary number of spaces on each side. So instead of adding those two lines of code in two places, let's change the two lines that split the columns up into this: `my @fields = split /\s%\s/;` [download] Making this simple change gives the same result, without adding four lines of code. What? He's not done yet? when will he STFU?* Digression (part 1): Fixing symptoms instead of avoiding them can be costly. First, you'll wind up with more code, and every line of code is costly: It takes time to read it, so maintenance can be slower. It takes space, so it can be harder to see what your program is doing overall. It can put steps further apart, making it easy to miss opportunities to simplify a problem. Don't obsess over it, but keep it in mind as you code an read other peoples code. No-one is always aware that they're treating symptoms, but if you're aware of it, you can avoid it more often. The next most serious problem (in my eyes, anyway) is that we have some "border junk" in our column headers: We don't want the "----" bits as part of our headings^[1], so let's explicitly ignore the border junk while we collect our column headings by adding this line just after the for loop inside the "gather the headings" section: `next if $fields[$i] =~ /----/;` [download] Now when we run it, we get a little closer to the desired result: $ perl 1103507.pl , Trm 4Trm7, Trm 5Trm, Trm9, TrmY0 , 0.021, -0.X, 0.0, 5.X , -0.063, -0.4, 0.0, 5.6 , 0.008, -0.X, 0.0, 5.8 , -0.065, -0.5, 0.0, 5.9 , 0.009, -0.X, 0.0, 6.0 , -0.066, -0.4, 0.0, 6.Y , 0.007, -0.X, 0.0, 6.Y , -0.065, -0.5, 0.0, 6.X , 0.006, -0.X, 0.0, 6.X , -0.065, -0.5, 0.0, 6.3 , 0.005, -0.3, 0.0, 6.3 , -0.069, -0.5, 0.0, 6.3 , 0.003, -0.X, 0.0, 6.4 , -0.068, -0.4, 0.0, 6.4 , 0.003, -0.3, 0.0, 6.4 , -0.07Y, -0.5, 0.0, 6.4 , 0.00X, -0.X, 0.0, 6.4 , -0.07Y, -0.5, 0.0, 6.4 , 0.00Y, -0.3, 0.0, 6.4 , -0.07Y, -0.4, 0.0, 6.5 , 0.003, -0.X, 0.0, 6.5 , -0.07Y, -0.4, 0.0, 6.5 , 0.00X, -0.X, 0.0, 6.5 , -0.07Y, -0.5, 0.0, 6.5 , 0.00Y, -0.3, 0.0, 6.5 [download] The most important problem at this point, I think, is the fact that fields containing letters aren't quoted. For our headers, I assume we can just pretend that they'll always be strings and quote them unconditionally. So we can change the print statement at the end of the headers section to: `print join(", ", map {qq("$_")} @headings), "\n";` [download] In the data section, though, it may be important to leave numbers unquoted where possible. So we might do something a bit more like this in our "extract the data" section: `# Extract the data while (<DATA>) { last if /-----/; my @fields = split /\s%\s/; for my $i (0 .. $#fields) { if ($fields[$i] =~ /[^-\+\.0-9]/) { # has non-numeric characters, so quote it $fields[$i] = qq("$fields[$i]"); } } print join(", ", @fields), "\n"; }` [download] When we do this, our output improves a little bit: "", "Trm 4Trm7", "Trm 5Trm", "Trm9", "TrmY0" , 0.021, "-0.X", 0.0, "5.X" , -0.063, -0.4, 0.0, 5.6 , 0.008, "-0.X", 0.0, 5.8 , -0.065, -0.5, 0.0, 5.9 , 0.009, "-0.X", 0.0, 6.0 , -0.066, -0.4, 0.0, "6.Y" , 0.007, "-0.X", 0.0, "6.Y" , -0.065, -0.5, 0.0, "6.X" , 0.006, "-0.X", 0.0, "6.X" , -0.065, -0.5, 0.0, 6.3 , 0.005, -0.3, 0.0, 6.3 , -0.069, -0.5, 0.0, 6.3 , 0.003, "-0.X", 0.0, 6.4 , -0.068, -0.4, 0.0, 6.4 , 0.003, -0.3, 0.0, 6.4 , "-0.07Y", -0.5, 0.0, 6.4 , "0.00X", "-0.X", 0.0, 6.4 , "-0.07Y", -0.5, 0.0, 6.4 , "0.00Y", -0.3, 0.0, 6.4 , "-0.07Y", -0.4, 0.0, 6.5 , 0.003, "-0.X", 0.0, 6.5 , "-0.07Y", -0.4, 0.0, 6.5 , "0.00X", "-0.X", 0.0, 6.5 , "-0.07Y", -0.5, 0.0, 6.5 , "0.00Y", -0.3, 0.0, 6.5 [download] Now, the only problem left seems to be the empty column to the left. So how are we going to fix that? I don't believe it ... he's still droning on? Digression (part 2): Sometimes you'll find that you're knowingly treating a symptom. Sometimes it's because you can't figure out how to simply treat the problem at the source. (For me, that's the case here.) Sometimes it's because the "cure is worse than the disease"--i.e., avoiding the symptom can be ugly, unmaintainable and/or expensive. For cases like this, I'll frequently just document it and move on--just in case someone comes up with a simple, clear fix. The reason the problem exists is the same reason that made much of the program simple: the borders. When split sees the % at the beginning of the line, it decides that there's an empty field at the start of the line. Altogether reasonable. To cure the symptom, we need only remove the unwanted column 0 before printing, like this: `shift @fields;` [download] It's simple, but it treats a symptom. I can't think of a way to treat the problem at the source that's as good, though. I could: Alter the split expression to not use the % at the start of the line as a delimiter. However, that gives us the problems of: (1) removing the % and associated whitespace after the fact. This simply trades one symptom for another, and it also complicates the split expression. We could trim the junk off the beginning of the line before splitting, but that's just a different treatment of the symptom. We could ask the originator of the data file to leave off the left and right borders, but that's bad because: (1) we try to foist our problem off on someone else (which may not be possible anyway), (2) may give complications to other consumers of the data file, etc. So clearly, removing the column seems like the best solution^[2]. While that's OK, we need to remember to apply our band-aid at every point necessary. If we do, then we get something like: `. . . snip . . . shift @headings; print join(", ", map {qq("$_")} @headings), "\n"; # Extract the data while (<DATA>) { last if /-----/; my @fields = split /\s%\s/; shift @fields; for my $i (0 .. $#fields) { if ($fields[$i] =~ /[^-\+\.0-9]/) { # has non-numeric characters, so quote it $fields[$i] = qq("$fields[$i]"); } } print join(", ", @fields), "\n"; } . . . snip . . .` [download] I don't really like this. In one place, I'm removing a column from @headings, and in another place, I'm doing it from @fields. I could move the first one up inside the headings loop so I could do it using the same variable, but it would be further away from the print statement, which is where the problem manifests itself. Yes, for the second one, I put it an entire for loop away from the print statement, where it's similarly manifested. But look at that for loop--it's a fix for another problem we had: quoting the string values. If you look at the print statement for the column headings, it has a similar fix in place. Conceptually, the band-aid is in the same location, although it looks far away. Since I don't like it, what am I going to do about it? This is a place where experience will give you insight. We really want to treat the column headings and data the same way: Quote what needs to be quoted, and remove the annoying column. So rather than have those custom bits of code in two places, let's just make a subroutine that combines the operation, and use it where it makes sense. So let's just steal the logic from the second one, and convert it into a subroutine. This gives us our final version of the program: # table_2_csv.pl <FName> # # Read a table (formatted similar to the one in the # DATA section) and reformat it into a CSV file. # # TODO: # 1) Make it actually open a file and process it, # instead of only looking at the DATA section. # 2) Maybe pull the file reading and column split # into a separate function to handle the extra # column 0 symptom. # use strict; use warnings; # Locate header: discard lines until we find # a line containing "-----" while (<DATA>) { last if /-----/; } # Gather the headings my @headings; while (<DATA>) { last if /=====/; my @fields = split /\s%\s/; for my $i (0 .. $#fields) { next if $fields[$i] =~ /----/; $headings[$i] .= $fields[$i]; } } print_as_csv(@headings); # Extract the data while (<DATA>) { last if /-----/; my @fields = split /\s%\s/; print_as_csv(@fields); } sub print_as_csv { my @data = @_; # Band-aid to handle our borders: first column # is always empty, so get rid of it shift @data; # Quote everything that's not a number (it is # supposed to be a CSV file!) for my $i (0 .. $#data) { if ($data[$i] =~ /[^-\+\.0-9]/) { # has non-numeric characters, so quote it $data[$i] = qq("$data[$i]"); } } print join(", ", @data), "\n"; } __DATA__ TEXT MATCHING header here! . . . snip: no reason to repeat the data . . . [download] I'm still unhappy with the band-aid, but I think that this is the best version of the program I'm likely to come up with. I'm sure if other monks write their versions, that many will choose different tradeoffs. Blessed Bovine! Still not done yet? Notes: [1] In this particular case, the ugly bits actually appear only in the column we're going to discard later. I could've attacked that one first, and this one would've magically disappeared. I didn't go into it, but if I didn't fix it, then once we went into production, we'd have a case where a different header would've made the bug appear. So it's often best to fix bugs when you discover them, even if they wouldn't show up normally. Otherwise you'll find that "stuff that never happens" actually happens allthetime when in production. ;^) `------%------%-------------% % % % Col3 % Col4 % % Col1 % Col2 ------------ % % % Bar % Baz % %========================` [download] [2] Or maybe making a subroutine for reading and splitting up the fields would be better, so I could fix it during the reading. In fact, that is likely to be better, but I'm already (nearly) finished the post. That's the reason for the extra comment in the header to a future maintainer... Finally^[3], I've left in a bug--semi-intentionally. (In other words, I saw it before posting, and it's easy enough to fix. However, I ran out of steam, so I'm stopping instead of fixing it. Kind of funny, actually, considering that it would be simpler to fix the bug than write this node. Where is it? Take a look at the header line. [3] It's about freakin' time! Update: Final listing replaced (forgot to put in version with subroutine). Update: Corrected Pascal's quote. (Thanks to Laurent_R. Since he lives in France, and I only know a few words of French, I'll defer to him. Actually--I'm not certain that the words I know as "French" actually are, given that usually when I hear them, they're followed by "If you'll excuse my French". Geez, Mom, what the Hell you teaching your son? Please pardon my French.) Update: Code tag fix, s/do/fix/ in "how are we going to do that" just before Digression 2. Update: Code fix: I forgot to remove "shift @fields" from "extract the data" loop. Also, while I mention that I left in one bug intentionally, that's not to say that there's only one left. There are probably some that I left in unintentionally, too. ...roboticus I have made this letter longer than usual, because I lack the time to make it short. (Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte) Blaise Pascal, Lettres Provinciales (1656-1657), no. 16.	[reply] [d/l] [select]
Re^2: Sorting through a file with multiple tables and extracting data by thrixim (Initiate) on Oct 12, 2014 at 13:57 UTC
Thank you for this incredible reply. It was like reading a book with 100% context for what I needed :) I used the ideas you highlighted and am a happy camper. Thx!	[reply]
Re: Sorting through a file with multiple tables and extracting data by LanX (Saint) on Oct 11, 2014 at 16:54 UTC
What did you try? A Perl flip-flop can help identifying the right table and a `split` on % will extract your data from lines. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply]
Re: Sorting through a file with multiple tables and extracting data by Laurent_R (Canon) on Oct 11, 2014 at 17:29 UTC
This seems to be quite easy. From looking at your data, you probably don't even need to distinguish between headers and data, but only to remove lines with dashes, lines of "=" signs and blank signs, to do a bit of cleanup, and then split on `/%\s/` pattern. For example this Perl two-liner under the shell command line: $ echo '-----------------------------------* % Trm 4 % Trm 5 % % % ------------------* % % % Trm7 % Trm % Trm9 % TrmY0 % % % % % % ==================================* % 0.021 % -0.X % 0.0 % 5.X % % -0.063 % -0.4 % 0.0 % 5.6 % % 0.008 % -0.X % 0.0 % 5.8 % % -0.065 % -0.5 % 0.0 % 5.9 % % 0.009 % -0.X % 0.0 % 6.0 % % -0.066 % -0.4 % 0.0 % 6.Y % % 0.007 % -0.X % 0.0 % 6.Y % % -0.065 % -0.5 % 0.0 % 6.X % % 0.006 % -0.X % 0.0 % 6.X % % -0.065 % -0.5 % 0.0 % 6.3 % % 0.005 % -0.3 % 0.0 % 6.3 % % -0.069 % -0.5 % 0.0 % 6.3 % % 0.003 % -0.X % 0.0 % 6.4 % % -0.068 % -0.4 % 0.0 % 6.4 % % 0.003 % -0.3 % 0.0 % 6.4 % % -0.07Y % -0.5 % 0.0 % 6.4 % % 0.00X % -0.X % 0.0 % 6.4 % % -0.07Y % -0.5 % 0.0 % 6.4 % % 0.00Y % -0.3 % 0.0 % 6.4 % % -0.07Y % -0.4 % 0.0 % 6.5 % % 0.003 % -0.X % 0.0 % 6.5 % % -0.07Y % -0.4 % 0.0 % 6.5 % % 0.00X % -0.X % 0.0 % 6.5 % % -0.07Y % -0.5 % 0.0 % 6.5 % % 0.00Y % -0.3 % 0.0 % 6.5 % ----------------------------------' \| perl -nE ' next if /----/ or /\={3}/ or /% % % % %/ +; s/^%\s+//; s/\s+//g; say join ",", split /%\s/, $_; ' Trm4,Trm5 Trm7,Trm,Trm9,TrmY0 0.021,-0.X,0.0,5.X -0.063,-0.4,0.0,5.6 0.008,-0.X,0.0,5.8 -0.065,-0.5,0.0,5.9 0.009,-0.X,0.0,6.0 -0.066,-0.4,0.0,6.Y 0.007,-0.X,0.0,6.Y -0.065,-0.5,0.0,6.X 0.006,-0.X,0.0,6.X -0.065,-0.5,0.0,6.3 0.005,-0.3,0.0,6.3 -0.069,-0.5,0.0,6.3 0.003,-0.X,0.0,6.4 -0.068,-0.4,0.0,6.4 0.003,-0.3,0.0,6.4 -0.07Y,-0.5,0.0,6.4 0.00X,-0.X,0.0,6.4 -0.07Y,-0.5,0.0,6.4 0.00Y,-0.3,0.0,6.4 -0.07Y,-0.4,0.0,6.5 0.003,-0.X,0.0,6.5 -0.07Y,-0.4,0.0,6.5 0.00X,-0.X,0.0,6.5 -0.07Y,-0.5,0.0,6.5 0.00Y,-0.3,0.0,6.5 [download] should give you more or less what you want.	[reply] [d/l] [select]
Re: Sorting through a file with multiple tables and extracting data by NetWallah (Canon) on Oct 11, 2014 at 18:31 UTC
One liner solution: `perl -anlF[^\\w\\.\-]+ -e 'next unless @F;shift @F;next if $F[0]=~/-- +/;;print join(",",@F)' testdata1.txt` [download] check "perl --help" to review options used. Interchange Single and double quotes, if you want to run under Window$ . (Not sure if it will require the "\\" escape under Windows.) "You're only given one little spark of madness. You mustn't lose it." - Robin Williams	[reply] [d/l]
Re: Sorting through a file with multiple tables and extracting data by locked_user sundialsvc4 (Abbot) on Oct 13, 2014 at 00:13 UTC
When I have to deal with a problem like this one, I like to take “a very `awk`-inspired approach. (And it should come as no surprise that the Perl language evolved from awk.) The `awk` way of doing things basically consists of subdividing the problem into: The things that you do at the `BEGIN`ning of the file ... The things that you do at the `END` ... and The things that you do when you encounter a line which matches `this particular regex pattern` ... and there can be any number of these. Any particular “type of line” might cause us to “do something,” or it might simply contribute variables that will be used later. In this particular example, I see five “potentially interesting” lines: `-----------------------------------` Clearly a group-separator. When you see one of these, “forget everything you know.” `% Trm 4 % Trm 5 % % %` Only you really know what this has to do with ... `% Trm7 % Trm % Trm9 % TrmY0 %` ... but both of them obviously contribute variable-names. `% -0.063 % -0.4 % 0.0 % 5.6 %` A line of data, to be classified according to the variable-names previously gathered. An obvious application of split(). The general approach is that of a finite-state machine (FSM).