barkingdoggy has asked for the wisdom of the Perl Monks concerning the following question:

I've got a regexp that I created which works in regexxer. It is supposed to extract matches for all 3-digit numbers from a large text document. It doesn't work in Perl. It does not handle repetitions. Here's the code that shows the regexp and demonstrates the problem. Suggestions, please.
#!/usr/bin/perl #use strict; use warnings; our $text= <<TEXT; Those APCs are APC 282, 376, 377 and 398. The APC assignments are +also shown in attachment K1. In the Final Rule, we indicated that cli +nical characteristics and expected resource use. Procedures are suff +iciently similar to those other procedures assigned to APC 282, 376, +377, and 398, and that we believe those APC assignments were appropri +ate. Specifically APCs 662 and APC 282. As shown in attachment K3 und +er option number 1, to be placed in APC 662. Our data analysis shows +that combining services currently assigned to APC 662 would result in + an APC median cost of about 302. The 6 CPT-Codes that would go into +APC 662 are: CPT-Codes 0145T through 0150T. The two other cardiac CT +codes, specifically 0144T and 0151T would be assigned to APC 282. The + inclusion of the two codes into APC 282 would result in... TEXT our @extracts; pos($text)=0; while (my @match = $text =~ m/(APC[s]?)\s(?:(\d{3})(?:\s|,\s|\.\s)) (?:(\d{3})(?:\s|,\s|\.\s)){0,} # (?:and\s([\d]{3})(?:\s|,\s|\.\s)){0,1}/xgc){ push @extracts, @match;} my $n=0; foreach my $extracts (@extracts){ print "Match $n= $extracts[$n] "; $n++; print "\n";}
Here's some of the output:
Match 0= APC Match 1= 282 Match 2= 377 Match 3= 398 Match 4= APC Match 5= 282 Match 6= 377 Match 7= 398 Match 8= APCs Match 9= 662 Use of uninitialized value in concatenation (.) or string at temp2.pl +line 32. Match 10= Use of uninitialized value in concatenation (.) or string at temp2.pl +line 32. Match 11= Match 12= APC Match 13= 282 Use of uninitialized value in concatenation (.) or string at temp2.pl +line 32. Match 14= Use of uninitialized value in concatenation (.) or string at temp2.pl +line 32. Match 15= Match 16= APC Match 17= 662

Replies are listed 'Best First'.
Re: regexp match repetition breaks in Perl
by ikegami (Patriarch) on Jul 11, 2007 at 12:50 UTC

    A match operation will always return exactly as many values as there are captures, so /(...)(?:(...))*(...)/ will always return exactly three results on a match. When using the g modifier, the match operation will always return an exact multiple of the number of captures. You need a parser.

    Here's a simple solution:

    while ($text =~ m/ (APC[s]?) ( \s \d{3} (?: (?: , \s \d{3} )* \s and \s \d{3} )? ) /xg) { my ($apc, $nums) = ($1, $2); my @nums = $nums =~ /(\d+)/g; push @extracts, $apc, @nums; }

    By the way, what's with pos($text) = 0 and the c switch? Removed!

    Update: Added a solution.

      Thanks to ikegami et al. for setting me on the path of righteousness!


      Re pos($text), I'm doing multiple passes through the document, picking out different things on each pass. So, I need to reset the pos before each pass. 3-digit APC codes are just one of the passes.

      I am/was extracting "APC" to help debug the perl/regex. In the final version, I just will extract the code/number.

      Thanks, again!

        So, I need to reset the pos before each pass. 3-digit APC codes are just one of the passes.

        Not quite.

        $_ = 'a1!b2.c3?'; print $1 while /([abc])/g; # abc print $1 while /([123])/g; # 123 print $1 while /([!.?])/g; # !.? print "\n";

        Note the lack of the c modifier. That does exactly the opposite of what you want. It's purpose is to prevent pos from getting reset.

Re: regexp match repetition breaks in Perl
by lima1 (Curate) on Jul 11, 2007 at 12:57 UTC
    So you want to extract all 3 digit numbers near the string "APCs?"?
    my %apc; while ($text =~ m{APC[s]? \s ((\s|\d{3}|,|and)+)}xmsg) { my $extract = $1; while ($extract =~ m{(\d{3})}xmsg) { $apc{$1} = 1; } } for my $apc (sort {$a <=> $b} keys %apc) { print "$apc\n"; }
Re: regexp match repetition breaks in Perl
by johngg (Canon) on Jul 11, 2007 at 13:44 UTC
    You need to capture exactly three digits that are preceded by a space and followed by a non-digit (or, possibly, end of string). As well as being preceded by the space, the digits are preceded by either 'APC', 'APCs', ',' (comma) or 'and' which you can specify as an alternation of look-behinds. A look-behind has to be of a fixed length which is why I use an alternation of four look-behinds rather than one look-behind containing four alternations.

    use strict; use warnings; my $text = <<'TEXT'; Those APCs are APC 282, 376, 377 and 398. The APC assignments are also + shown in attachment K1. In the Final Rule, we indicated that clinica +l characteristics and expected resource use. Procedures are sufficie +ntly similar to those other procedures assigned to APC 282, 376, 377, + and 398, and that we believe those APC assignments were appropriate. + Specifically APCs 662 and APC 282. As shown in attachment K3 under o +ption number 1, to be placed in APC 662. Our data analysis shows that + combining services currently assigned to APC 662 would result in an +APC median cost of about 302. The 6 CPT-Codes that would go into APC +662 are: CPT-Codes 0145T through 0150T. The two other cardiac CT code +s, specifically 0144T and 0151T would be assigned to APC 282. The inc +lusion of the two codes into APC 282 would result in... TEXT my $rxExtract = qr {(?x) (?: (?<=APC) | (?<=APCs) | (?<=,) | (?<=and) ) \s(\d{3})(?:\D|\z) }; my @extracts = $text =~ m{$rxExtract}g; print qq{Match $_: $extracts[$_]\n} for 0 .. $#extracts;

    The output is

    Match 0: 282 Match 1: 376 Match 2: 377 Match 3: 398 Match 4: 282 Match 5: 376 Match 6: 377 Match 7: 398 Match 8: 662 Match 9: 282 Match 10: 662 Match 11: 662 Match 12: 662 Match 13: 282 Match 14: 282

    I hope this is of use.

    Cheers,

    JohnGG

      That's looks too lax to me. /(?:,|and)\s\d{3}\D/ is too likely to exist. For example, your regexp would match "134" in "sections 3 and 134".
        Yes, although my way works for the data given it could easily break down as you point out. Your method is safer. I'm wondering why the OP captures the 'APC' and 'APCs' strings; they seem to have no bearing on how many sets of digits follow.

        Cheers,

        JohnGG

Re: regexp match repetition breaks in Perl
by wind (Priest) on Jul 11, 2007 at 14:42 UTC
    Hre is how I would suggest that you accomplish this matching.
    while ($text =~ m/(APCs?)((?:\s|,|and|\d{3})+)/xgc) { my ($apc, $nums) = ($1, $2); my @nums = $nums =~ m/(\d+)/g; push @extracts, $apc, @nums; } for my $n (0..$#extracts) { print "Match $n = $extracts[$n]\n"; }
    - Miller

    Update 14:59 UTC: my first method only captured last number.
    while (my @match = $text =~ m/(APCs?)(?:(?:\s|,|and)+(\d{3}))+/xgc) { push @extracts, @match; }
Re: regexp match repetition breaks in Perl
by Anonymous Monk on Jul 11, 2007 at 21:19 UTC
    my approach is similar to others, but more 'structured'.
    note that rules for accepting whitespace are more lax.

    use strict; use warnings; my $text= <<TEXT; Those APCs are APC 282, 376, 377 and 398. The APC assignments are also shown in attachment K1. In the Final Rule, we indicated that clinical characteristics and expected resource use. Procedures are sufficiently similar to those other procedures assigned to APC 282, 376, 377, and 398, and that we believe those APC assignments were appropriate. Specifically APCs 662 and APC 282. As shown in attachment K3 under option number 1, to be placed in APC 662. Our data analysis shows that combining services currently assigned to APC 662 would result in an APC median cost of about 302. The 6 CPT-Codes that would go into APC 662 are: CPT-Codes 0145T through 0150T. The two other cardiac CT codes, specifically 0144T and 0151T would be assigned to APC 282. The inclusion of the two codes into APC 282 would result in... and also APC 101,102or103, and not 666. But APC 6666 is not really an APC! How about APC 6666, 777? (Neither is parsed.) How about APCs 777, 6666? (Gets 777, ignores 6666; is this OK?) TEXT # define regex components # an APC number my $number = qr( \d{3} (?! \d ) )x; # 3 digits, not followed by a dig +it # required preamble to an APC number my $preamble = do { my $leadin = qr( APC s? )x; my $separator = qr( \s+ )x; qr( $leadin $separator )x; }; # additional APC numbers may follow after properly introduced number my $continuation = do { my $comma = qr( , )x; my $clause = qr( $comma? \s* (?: and | or ) )x; # \G means continue from point previous match ended qr( \G \s* (?: $comma | $clause ) \s* )x; }; # end regex component definitions # do test extraction my @extracts = $text =~ m{ (?: $preamble | $continuation) ($number) }xg; print "Extract $_ = $extracts[$_] \n" for 0 .. $#extracts;
    output:

    Extract 0 = 282 Extract 1 = 376 Extract 2 = 377 Extract 3 = 398 Extract 4 = 282 Extract 5 = 376 Extract 6 = 377 Extract 7 = 398 Extract 8 = 662 Extract 9 = 282 Extract 10 = 662 Extract 11 = 662 Extract 12 = 662 Extract 13 = 282 Extract 14 = 282 Extract 15 = 101 Extract 16 = 102 Extract 17 = 103 Extract 18 = 777
    hth -- bill
      Bill - It's beautiful! Thanks!
        you are very welcome. bill