comment on

I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me!

In a way, I guess it is, but another way of looking at it is to see it as starting with what you want the program to produce, the print statement, and then going through how to get that from your input, the data.

Just as some people and programming languages, start with the main() function at the top of the file, and then the functions called from main next, and any that they call lower down still.

The way I evolved this piece of code was to look for what I wanted to keep from the input--the lines with two columns containing the individual age stats. I also noted that if I split these into two after the first column, the second half of the lines became the same as the first, except where I didn't want to keep the remainder of the line anyway, which was convenient. So I started by getting the regex to match and split those lines correct:

print for map{ 
    m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)]
} <DATA>;
## Outputs stuff like

0    622.7    319.1    303.6
50    530.6    265.9    264.6
1    618.0    316.8    301.2
51    548.1    274.9    273.2
...
42    530.6    267.3    263.2
Under 16    10,490.0    5,380.2    5,109.7
44    569.0    286.1    282.9
Under 18    12,130.5    6,223.6    5,906.9
46    558.8    281.7    277.0
45-64/59*    9,614.5    5,450.5    4,164.0
48    542.6    273.2    269.5
65/60**    8,993.0    2,947.2    6,045.7
49    526.8    264.7    262.2
and over
[download]

I then saw that I needed to remove any lines not fitting the pattern so I added the grep:

print for grep{
    m[^\d+\s]
} map{ 
    m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)]
} <DATA>;
[download]

That left me with the right lines, but they needed sorting, so I added the sort

print for sort { 
    local $^W; 
    $a <=> $b
}  grep{ 
    m[^\d+\s]
} map{ 
    m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)]
} <DATA>;
[download]

And that left me with the unwanted column, so a final s/// to get shot of that and (with the addition of a few comments), I ended up with what I posted. You could of course do it in individual stages rather than as a pipeline:

#! perl -slw
use strict;

my @lines = <DATA>;
## Split lines that fit the two column pattern into two
my @data = map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } @lines;

## Remove any lines that don't fit the pattern
my @filtered = grep{ m[^\d+\s] } @data;

## Sort the lines by age
my @sorted = sort { 
    local $^W; ## Ignoring non-numeric warnings
    $a <=> $b
} @filtered;

## And strip out the combined total.
@trimmed = map{ 
    s[(^\d+\s+)[\d\.]+\s+][$1];
    $_;
} @sorted;

print "Age\tMale\tFemale";
printf for @trimmed;
[download]

And if you prefer that, I don't think you would be alone.

To address the additions you want.

The easiest way to get the 90+ data is to extract it from the rest before you do anything else:

my( $age90line ) = grep{ m[90\s+and\s+over] } @lines;
$age90line = join "\t", $age90line =~ m[(90\s+and\s+over)\s+\S+\s+(\S+
+)\s+(\S+)];
[download]

Adding the extra "Age end" column can be done at the same time you are removing the total:

## And strip out the combined total and add "Age end"
my @trimmed = map{
    s[(^\d+\s+)[\d\.]+\s+][$1\t$1];
    $_;
} @sorted;
[download]

Put that together with a modified header:

print "Age start\tAge end\tMale\tFemale";
print for @trimmed, $age90line;
[download]

Produces:

P:\test>501023-2
Age start       Age end Male    Female
0       0    319.1    303.6
1       1    316.8    301.2
2       2    321.6    306.8
...
86      86    24.0    72.6
87      87    19.2    61.6
88      88    15.2    50.9
89      89    11.6    40.7
90 and over     31.8    131.3
[download]

Cleaning up the formatting is left as an exercise, but I would probably do something like:

print "Age start\tAge end\tMale\tFemale";
printf "%s\t\t%s\t\t%s\t\t%s\n", 
    m[(\S+)\s*(\S+)\s+(\S+)\s+(\S+) 
    for @trimmed, $age90line;
[download]

which won't be quite right, but tweak it from there.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

In reply to Re^3: Regex problem by BrowserUk
in thread Regex problem by Win

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.