in reply to Regex problem

Is this clearer to your eye's?

#! perl -slw use strict; print "Age\tMale\tFemale"; print for map{ ## And strip out the combined total. s[(^\d+\s+)[\d\.]+\s+][$1]; $_; } sort { ## Sort the lines by age local $^W; ## Ignoring non-numeric warnings $a <=> $b } grep{ ## Remove any lines that don't fit the pattern m[^\d+\s] } map{ ## Split lines that fit the two column pattern into two m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>; __DATA__ Your data here

Produces

P:\test>501023 Age Male Female 0 319.1 303.6 1 316.8 301.2 2 321.6 306.8 ... 86 24.0 72.6 87 19.2 61.6 88 15.2 50.9 89 11.6 40.7

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

Replies are listed 'Best First'.
Re^2: Regex problem
by Win (Novice) on Oct 19, 2005 at 09:44 UTC
    This works very well. I have to admit that I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me! I want to set up a ‘Age_start’ and a ‘Age_end’ column and have the age start and age ends being the same year for ages 0 to 89. In addition, I want to make a small change that will allow the '90 and over' to be substituted for '90\t114' in the output. I think that to do this I need to make a change to the grep section and to make the first change I need to change the 'print for map' section. At the moment the 90 and overs are lost from the output. Please enlighten me.
      I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me!

      In a way, I guess it is, but another way of looking at it is to see it as starting with what you want the program to produce, the print statement, and then going through how to get that from your input, the data.

      Just as some people and programming languages, start with the main() function at the top of the file, and then the functions called from main next, and any that they call lower down still.

      The way I evolved this piece of code was to look for what I wanted to keep from the input--the lines with two columns containing the individual age stats. I also noted that if I split these into two after the first column, the second half of the lines became the same as the first, except where I didn't want to keep the remainder of the line anyway, which was convenient. So I started by getting the regex to match and split those lines correct:

      print for map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>; ## Outputs stuff like 0 622.7 319.1 303.6 50 530.6 265.9 264.6 1 618.0 316.8 301.2 51 548.1 274.9 273.2 ... 42 530.6 267.3 263.2 Under 16 10,490.0 5,380.2 5,109.7 44 569.0 286.1 282.9 Under 18 12,130.5 6,223.6 5,906.9 46 558.8 281.7 277.0 45-64/59* 9,614.5 5,450.5 4,164.0 48 542.6 273.2 269.5 65/60** 8,993.0 2,947.2 6,045.7 49 526.8 264.7 262.2 and over

      I then saw that I needed to remove any lines not fitting the pattern so I added the grep:

      print for grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;

      That left me with the right lines, but they needed sorting, so I added the sort

      print for sort { local $^W; $a <=> $b } grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;

      And that left me with the unwanted column, so a final s/// to get shot of that and (with the addition of a few comments), I ended up with what I posted. You could of course do it in individual stages rather than as a pipeline:

      #! perl -slw use strict; my @lines = <DATA>; ## Split lines that fit the two column pattern into two my @data = map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } @lines; ## Remove any lines that don't fit the pattern my @filtered = grep{ m[^\d+\s] } @data; ## Sort the lines by age my @sorted = sort { local $^W; ## Ignoring non-numeric warnings $a <=> $b } @filtered; ## And strip out the combined total. @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1]; $_; } @sorted; print "Age\tMale\tFemale"; printf for @trimmed;

      And if you prefer that, I don't think you would be alone.

      To address the additions you want.

      The easiest way to get the 90+ data is to extract it from the rest before you do anything else:

      my( $age90line ) = grep{ m[90\s+and\s+over] } @lines; $age90line = join "\t", $age90line =~ m[(90\s+and\s+over)\s+\S+\s+(\S+ +)\s+(\S+)];

      Adding the extra "Age end" column can be done at the same time you are removing the total:

      ## And strip out the combined total and add "Age end" my @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1\t$1]; $_; } @sorted;

      Put that together with a modified header:

      print "Age start\tAge end\tMale\tFemale"; print for @trimmed, $age90line;

      Produces:

      P:\test>501023-2 Age start Age end Male Female 0 0 319.1 303.6 1 1 316.8 301.2 2 2 321.6 306.8 ... 86 86 24.0 72.6 87 87 19.2 61.6 88 88 15.2 50.9 89 89 11.6 40.7 90 and over 31.8 131.3

      Cleaning up the formatting is left as an exercise, but I would probably do something like:

      print "Age start\tAge end\tMale\tFemale"; printf "%s\t\t%s\t\t%s\t\t%s\n", m[(\S+)\s*(\S+)\s+(\S+)\s+(\S+) for @trimmed, $age90line;

      which won't be quite right, but tweak it from there.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.