Regex problem

Win has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a flat file that takes the basic form of: <background>

                                
Table 3    Mid-1983 Population Estimates: England and Wales; single ye
+ar of age and sex; estimated resident population,                    
+         
    revised in light of the results of the 2001 Census1               
+              
                                
England and Wales                                Thousands
                                
Age    Persons    Males    Females            Persons    Males    Fema
+les
                                
All ages    49,617.0    24,133.3    25,483.7                    
                                
0-4    3,116.4    1,597.3    1,519.1        50-54    2,748.0    1,367.
+5    1,380.5
0    622.7    319.1    303.6        50    530.6    265.9    264.6
1    618.0    316.8    301.2        51    548.1    274.9    273.2
2    628.4    321.6    306.8        52    558.8    278.5    280.3
3    634.9    325.3    309.6        53    559.3    276.0    283.3
4    612.4    314.4    298.0        54    551.2    272.2    279.1
                                
5-9    2,921.0    1,499.9    1,421.2        55-59    2,783.0    1,363.
+8    1,419.2
5    563.2    289.3    273.9        55    546.8    269.9    276.9
6    553.7    284.6    269.1        56    554.3    273.4    280.8
7    577.9    296.6    281.3        57    561.9    276.2    285.7
8    602.2    309.1    293.1        58    558.4    272.5    286.0
9    624.1    320.3    303.8        59    561.6    271.8    289.8
                                
10-14    3,670.5    1,882.5    1,787.9        60-64    2,823.6    1,33
+5.9    1,487.7
10    671.3    345.2    326.1        60    563.6    270.4    293.2
11    713.6    366.6    347.0        61    590.2    282.0    308.2
12    759.4    388.9    370.5        62    611.9    289.2    322.7
13    749.5    384.5    365.0        63    634.3    297.7    336.6
14    776.7    397.3    379.4        64    423.6    196.6    227.1
                                
15-19    4,120.8    2,111.9    2,008.9        65-69    2,283.7    1,03
+0.8    1,252.8
15    782.0    400.5    381.5        65    401.3    184.1    217.2
16    811.2    417.5    393.7        66    442.2    202.2    240.0
17    829.3    425.8    403.5        67    467.6    212.7    254.9
18    849.1    435.1    414.0        68    489.5    218.7    270.7
19    849.2    433.0    416.1        69    483.2    213.2    270.0
                                
20-24    3,922.3    1,976.0    1,946.3        70-74    2,136.5    903.
+8    1,232.7
20    829.9    419.4    410.5        70    465.5    203.5    262.0
21    812.0    408.8    403.2        71    442.5    190.8    251.7
22    790.4    398.3    392.2        72    423.8    179.3    244.5
23    756.1    381.0    375.1        73    411.4    171.0    240.4
24    733.9    368.6    365.3        74    393.3    159.2    234.1
                                
25-29    3,412.7    1,718.1    1,694.6        75-79    1,580.0    592.
+1    987.9
25    716.0    359.4    356.6        75    370.0    146.5    223.4
26    694.5    348.2    346.4        76    340.0    131.3    208.7
27    676.6    341.3    335.3        77    314.3    117.0    197.4
28    657.0    331.7    325.2        78    292.2    105.2    187.0
29    668.7    337.5    331.2        79    263.4    92.1    171.3
                                
30-34    3,403.3    1,710.6    1,692.7        80-84    933.7    289.4 
+   644.2
30    666.9    336.3    330.6        80    239.5    80.7    158.8
31    656.1    329.8    326.3        81    211.2    68.2    143.0
32    669.9    337.3    332.6        82    186.6    57.2    129.4
33    692.6    347.5    345.1        83    163.1    47.1    116.0
34    717.8    359.7    358.1        84    133.2    36.2    97.0
                                
35-39    3,580.0    1,796.0    1,784.0        85-89    408.3    99.3  
+  309.0
35    770.2    384.8    385.4        85    112.6    29.3    83.3
36    842.3    422.4    419.9        86    96.6    24.0    72.6
37    660.4    331.2    329.2        87    80.7    19.2    61.6
38    651.7    327.6    324.1        88    66.1    15.2    50.9
39    655.3    329.9    325.3        89    52.3    11.6    40.7
                                
40-44    2,862.5    1,443.2    1,419.3        90 and over    163.2    
+31.8    131.3
40    625.6    315.8    309.8                    
41    572.3    288.7    283.6                    
42    530.6    267.3    263.2        Under 16    10,490.0    5,380.2  
+  5,109.7
43    565.1    285.2    279.8                    
44    569.0    286.1    282.9        Under 18    12,130.5    6,223.6  
+  5,906.9
                                
45-49    2,747.7    1,383.3    1,364.3        16-44    20,519.6    10,
+355.4    10,164.2
45    567.8    286.3    281.5                    
46    558.8    281.7    277.0        45-64/59*    9,614.5    5,450.5  
+  4,164.0
47    551.6    277.5    274.2                    
48    542.6    273.2    269.5        65/60**    8,993.0    2,947.2    
+6,045.7
49    526.8    264.7    262.2        and over
[download]

I want to take the number of females and males for each age. The code that I have produced to do this is like:



my @files;
my $line;
my $file;


while (<*.txt>) {
   
  next if $_ eq 'out.txt';

  $file = $_;
  
  
  
  open (FILE, "$file");


  my $line_number = 0;
  my $run;

 LINE: while (<FILE>){

    my $line = $_;
    $line_number = $line_number +1;

    $run = 1;

    if ($line_number == 1){
      
    
      $run = 1 if $line !~ m/^Table\s/; # The function of this line wa
+s eliminated in order to allow the first line to be blanks.  ie now a
+ll text files are run
      last LINE;
    }
    else {

      $run = 1;
    }
    last LINE;
  }
  
  if ($run == 1){
    
    push (@files, $file);
  }
  else {
    #Do nothing
  }
}


print "@files";


my $line_B;
my $FH;
my $batch;

my @age;
my @number_of_males;
my @number_of_females;

my $out = "out.txt";
open (OUT, "+>$out");

my $year;

foreach my $filename ( @files ) {

  if ($filename =~ /^(\d{4})/){
    $year = $1;
  }

  open ( FILE, "<", $filename )
      or die( "Couldn't open $filename: $!" ); 

 
  my $batch = "\n";

 INNER: while ((<FILE>)) {
 
      s/^$//;
    s/\d{1,3}-\d{1,3}\b.{1,300}$//;
    s/Under\b.{1,500}//;
    s/revised in light of the results of the 2001 Census1//;
    s/England\band\bWales//;
    s/Thousands//;
    s/Age\bPersons\bMales\bFemales\bPersons\bMales\bFemales//;

    s/^\D.*$//; 
    s/(A-Z|a-z)//;
    s/^\D.*$//; 
   
    s/These//;
    s/\sare//;
    s/\sfinal//;
    s/\srevised//;
    s/\sestimates.//;
    s/\sThey//;
    s/\sreplace//;
    s/\sthe//;
    s/\sinterim//;
    s/\srevised//;
    s/\spopulation//;
    s/\sestimates//;
    s/\sthat//;
    s/\swere//;
    s/\spublished//;
    s/\son//;
    s/\s10//;
    s/\sOctober\s2002\sat\snational\slevel\sfor\sEngland\sand\sWales\.
+//;
    s/and\sover//;
    

     s/Table\b3\bMid\-\d\d\d\d\bPopulation\bEstimates\:\bEngland\band\
+bWales;\bsingle\byear\bof\bage\band\bsex\;\bestimated\bresident\bpopu
+lation\,//;
     s/revised\bin\blight\bof\bthe\bresults\bof\bthe\b2001\bCensus1//;
     s/Thousands//;
     s/Age\bPersons\bMales\bFemales\bPersons\bMales\bFemales//;
     s/^$//;
     s/(\d{1,4}\.\d)(\d\.\d{1,4})/$1\t$2/;
     s/^1\s*$//;
     s/65\/60.*//;
    

    if ($_ =~ /([\d|\.]*)\t([\d|\.]*)\t([\d|\.]*)\t([\d|\.]*)\t([\d|\.
+]*)\t([\d|\.]*)\t([\d|\.]*)\t([\d|\.]*)\t([\d|\.]*)/){  # I think tha
+t the main problem is here
      
      push (@age, $1);
      push (@number_of_males, $3);
      push (@number_of_females, $4);
      push (@age, $6);
      push (@number_of_males, $8);
      push (@number_of_females, $9);

     
    }
    
    $batch .= $_;
    
    last INNER if eof; ##|| m/^\s*go\s*$/i;
    
    
  }

  print "\n@age\n\n";
  print "\n@number_of_males\n\n";
  print "\n@number_of_females\n\n";


  my $counter;

  for($counter = 0; $counter <= @age; $counter++) {

    if ($age[$counter]) {

      print OUT $year."\t";
      print OUT $age[$counter];
      print OUT "\t1\t";
      print OUT $number_of_males[$counter];
      print OUT "\n";

      print OUT $year."\t";
      print OUT $age[$counter];
      print OUT "\t2\t";
      print OUT $number_of_females[$counter];
      print OUT "\n";

    }


  }

  undef(@age);
  undef(@number_of_males);
  undef(@number_of_females);
  
  
  print $batch; 
  sleep 2;

}
[download]

The problem is that I don't have figures returned for the ages 0, 42, 44, 46, 48 and > 90. Can anyone please suggest changes that will rectify this or create easier to follow code.

Comment on Regex problem Select or Download Code

Replies are listed 'Best First'.
Re: Regex problem by BrowserUk (Patriarch) on Oct 18, 2005 at 16:14 UTC
Is this clearer to your eye's? `#! perl -slw use strict; print "Age\tMale\tFemale"; print for map{ ## And strip out the combined total. s[(^\d+\s+)[\d\.]+\s+][$1]; $_; } sort { ## Sort the lines by age local $^W; ## Ignoring non-numeric warnings $a <=> $b } grep{ ## Remove any lines that don't fit the pattern m[^\d+\s] } map{ ## Split lines that fit the two column pattern into two m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>; __DATA__ Your data here` [download] Produces `P:\test>501023 Age Male Female 0 319.1 303.6 1 316.8 301.2 2 321.6 306.8 ... 86 24.0 72.6 87 19.2 61.6 88 15.2 50.9 89 11.6 40.7` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l] [select]
Re^2: Regex problem by Win (Acolyte) on Oct 19, 2005 at 09:44 UTC
This works very well. I have to admit that I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me! I want to set up a ‘Age_start’ and a ‘Age_end’ column and have the age start and age ends being the same year for ages 0 to 89. In addition, I want to make a small change that will allow the '90 and over' to be substituted for '90\t114' in the output. I think that to do this I need to make a change to the grep section and to make the first change I need to change the 'print for map' section. At the moment the 90 and overs are lost from the output. Please enlighten me.	[reply]
Re^3: Regex problem by BrowserUk (Patriarch) on Oct 19, 2005 at 10:59 UTC
I find it quite hard to get my head around how this bit of code is working. It seems to run backwards to me! In a way, I guess it is, but another way of looking at it is to see it as starting with what you want the program to produce, the print statement, and then going through how to get that from your input, the data. Just as some people and programming languages, start with the main() function at the top of the file, and then the functions called from main next, and any that they call lower down still. The way I evolved this piece of code was to look for what I wanted to keep from the input--the lines with two columns containing the individual age stats. I also noted that if I split these into two after the first column, the second half of the lines became the same as the first, except where I didn't want to keep the remainder of the line anyway, which was convenient. So I started by getting the regex to match and split those lines correct: `print for map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>; ## Outputs stuff like 0 622.7 319.1 303.6 50 530.6 265.9 264.6 1 618.0 316.8 301.2 51 548.1 274.9 273.2 ... 42 530.6 267.3 263.2 Under 16 10,490.0 5,380.2 5,109.7 44 569.0 286.1 282.9 Under 18 12,130.5 6,223.6 5,906.9 46 558.8 281.7 277.0 45-64/59* 9,614.5 5,450.5 4,164.0 48 542.6 273.2 269.5 65/60** 8,993.0 2,947.2 6,045.7 49 526.8 264.7 262.2 and over` [download] I then saw that I needed to remove any lines not fitting the pattern so I added the grep: `print for grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;` [download] That left me with the right lines, but they needed sorting, so I added the sort `print for sort { local $^W; $a <=> $b } grep{ m[^\d+\s] } map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } <DATA>;` [download] And that left me with the unwanted column, so a final s/// to get shot of that and (with the addition of a few comments), I ended up with what I posted. You could of course do it in individual stages rather than as a pipeline: #! perl -slw use strict; my @lines = <DATA>; ## Split lines that fit the two column pattern into two my @data = map{ m[(^\d+\s+(?:[\d\.]+\s+){3})(.+$)] } @lines; ## Remove any lines that don't fit the pattern my @filtered = grep{ m[^\d+\s] } @data; ## Sort the lines by age my @sorted = sort { local $^W; ## Ignoring non-numeric warnings $a <=> $b } @filtered; ## And strip out the combined total. @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1]; $_; } @sorted; print "Age\tMale\tFemale"; printf for @trimmed; [download] And if you prefer that, I don't think you would be alone. To address the additions you want. The easiest way to get the 90+ data is to extract it from the rest before you do anything else: `my( $age90line ) = grep{ m[90\s+and\s+over] } @lines; $age90line = join "\t", $age90line =~ m[(90\s+and\s+over)\s+\S+\s+(\S+ +)\s+(\S+)];` [download] Adding the extra "Age end" column can be done at the same time you are removing the total: `## And strip out the combined total and add "Age end" my @trimmed = map{ s[(^\d+\s+)[\d\.]+\s+][$1\t$1]; $_; } @sorted;` [download] Put that together with a modified header: `print "Age start\tAge end\tMale\tFemale"; print for @trimmed, $age90line;` [download] Produces: `P:\test>501023-2 Age start Age end Male Female 0 0 319.1 303.6 1 1 316.8 301.2 2 2 321.6 306.8 ... 86 86 24.0 72.6 87 87 19.2 61.6 88 88 15.2 50.9 89 89 11.6 40.7 90 and over 31.8 131.3` [download] Cleaning up the formatting is left as an exercise, but I would probably do something like: `print "Age start\tAge end\tMale\tFemale"; printf "%s\t\t%s\t\t%s\t\t%s\n", m[(\S+)\s*(\S+)\s+(\S+)\s+(\S+) for @trimmed, $age90line;` [download] which won't be quite right, but tweak it from there. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l] [select]
Re: Regex problem by inman (Curate) on Oct 18, 2005 at 16:00 UTC
You only seem to be interested in the data that matches a single age followed by three decimal numbers. The regex in the following code looks for this pattern and extracts it into an array. Once you have it in this form you can manipulate it as you like. `while (<FILE>) { my @data; if (@data = /\b(\d+)\b\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) { print "@data\n"; } }` [download] Updated I was on the right track but I hadn't spent enough time on the problem. I have corrected the issue for more than one set of data per line and added a rather ugly bit to capture the ' and over' part in the data... `my %info; while (<>) { my @data; while (/\b(\d+)\b(?: and over)?\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\. +\d+)/gc) { $info{$1} = [$2,$3,$4]; } } foreach (sort {$a<=>$b} keys %info) { print "$_ ", join(' ', @{$info{$_}}), "\n"; }` [download]	[reply] [d/l] [select]
Re^2: Regex problem by Roy Johnson (Monsignor) on Oct 18, 2005 at 16:14 UTC
Except there are some tricks. The data is formatted to contain (often) two data sets on the same line. And he does seem to want "90 and over" to show up. I got pretty good results with this: `my (@age, @number_of_males, @number_of_females); while (<FILE>) { if (/\b\d\d?(?:\d+\.\d+\s){3}\s/) { s/under\s+/</gi; s/\s+and over/+/gi; my @ar = split; while (@ar > 3) { (my ($age, undef, $males, $females), @ar) = @ar; next if $age =~ /\d\D\d/; push @age, $age; push @number_of_males, $males; push @number_of_females, $females; } } } print "$age[$_], $number_of_males[$_], $number_of_females[$_]\n" for sort {$age[$a] <=> $age[$b] } 0..$#age;` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]