I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me!
In a way, I guess it is, but another way of looking at it is to see it as starting with what you want the program to produce, the print statement, and then going through how to get that from your input, the data.
Just as some people and programming languages, start with the main() function at the top of the file, and then the functions called from main next, and any that they call lower down still.
The way I evolved this piece of code was to look for what I wanted to keep from the input--the lines with two columns containing the individual age stats. I also noted that if I split these into two after the first column, the second half of the lines became the same as the first, except where I didn't want to keep the remainder of the line anyway, which was convenient. So I started by getting the regex to match and split those lines correct:
print for map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>; ## Outputs stuff like 0 622.7 319.1 303.6 50 530.6 265.9 264.6 1 618.0 316.8 301.2 51 548.1 274.9 273.2 ... 42 530.6 267.3 263.2 Under 16 10,490.0 5,380.2 5,109.7 44 569.0 286.1 282.9 Under 18 12,130.5 6,223.6 5,906.9 46 558.8 281.7 277.0 45-64/59* 9,614.5 5,450.5 4,164.0 48 542.6 273.2 269.5 65/60** 8,993.0 2,947.2 6,045.7 49 526.8 264.7 262.2 and over
I then saw that I needed to remove any lines not fitting the pattern so I added the grep:
print for grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;
That left me with the right lines, but they needed sorting, so I added the sort
print for sort { local $^W; $a <=> $b } grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;
And that left me with the unwanted column, so a final s/// to get shot of that and (with the addition of a few comments), I ended up with what I posted. You could of course do it in individual stages rather than as a pipeline:
#! perl -slw use strict; my @lines = <DATA>; ## Split lines that fit the two column pattern into two my @data = map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } @lines; ## Remove any lines that don't fit the pattern my @filtered = grep{ m[^\d+\s] } @data; ## Sort the lines by age my @sorted = sort { local $^W; ## Ignoring non-numeric warnings $a <=> $b } @filtered; ## And strip out the combined total. @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1]; $_; } @sorted; print "Age\tMale\tFemale"; printf for @trimmed;
And if you prefer that, I don't think you would be alone.
To address the additions you want.
The easiest way to get the 90+ data is to extract it from the rest before you do anything else:
my( $age90line ) = grep{ m[90\s+and\s+over] } @lines; $age90line = join "\t", $age90line =~ m[(90\s+and\s+over)\s+\S+\s+(\S+ +)\s+(\S+)];
Adding the extra "Age end" column can be done at the same time you are removing the total:
## And strip out the combined total and add "Age end" my @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1\t$1]; $_; } @sorted;
Put that together with a modified header:
print "Age start\tAge end\tMale\tFemale"; print for @trimmed, $age90line;
Produces:
P:\test>501023-2 Age start Age end Male Female 0 0 319.1 303.6 1 1 316.8 301.2 2 2 321.6 306.8 ... 86 86 24.0 72.6 87 87 19.2 61.6 88 88 15.2 50.9 89 89 11.6 40.7 90 and over 31.8 131.3
Cleaning up the formatting is left as an exercise, but I would probably do something like:
print "Age start\tAge end\tMale\tFemale"; printf "%s\t\t%s\t\t%s\t\t%s\n", m[(\S+)\s*(\S+)\s+(\S+)\s+(\S+) for @trimmed, $age90line;
which won't be quite right, but tweak it from there.
In reply to Re^3: Regex problem
by BrowserUk
in thread Regex problem
by Win
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |