in reply to regexp match repetition breaks in Perl

You need to capture exactly three digits that are preceded by a space and followed by a non-digit (or, possibly, end of string). As well as being preceded by the space, the digits are preceded by either 'APC', 'APCs', ',' (comma) or 'and' which you can specify as an alternation of look-behinds. A look-behind has to be of a fixed length which is why I use an alternation of four look-behinds rather than one look-behind containing four alternations.

use strict; use warnings; my $text = <<'TEXT'; Those APCs are APC 282, 376, 377 and 398. The APC assignments are also + shown in attachment K1. In the Final Rule, we indicated that clinica +l characteristics and expected resource use. Procedures are sufficie +ntly similar to those other procedures assigned to APC 282, 376, 377, + and 398, and that we believe those APC assignments were appropriate. + Specifically APCs 662 and APC 282. As shown in attachment K3 under o +ption number 1, to be placed in APC 662. Our data analysis shows that + combining services currently assigned to APC 662 would result in an +APC median cost of about 302. The 6 CPT-Codes that would go into APC +662 are: CPT-Codes 0145T through 0150T. The two other cardiac CT code +s, specifically 0144T and 0151T would be assigned to APC 282. The inc +lusion of the two codes into APC 282 would result in... TEXT my $rxExtract = qr {(?x) (?: (?<=APC) | (?<=APCs) | (?<=,) | (?<=and) ) \s(\d{3})(?:\D|\z) }; my @extracts = $text =~ m{$rxExtract}g; print qq{Match $_: $extracts[$_]\n} for 0 .. $#extracts;

The output is

Match 0: 282 Match 1: 376 Match 2: 377 Match 3: 398 Match 4: 282 Match 5: 376 Match 6: 377 Match 7: 398 Match 8: 662 Match 9: 282 Match 10: 662 Match 11: 662 Match 12: 662 Match 13: 282 Match 14: 282

I hope this is of use.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: regexp match repetition breaks in Perl
by ikegami (Patriarch) on Jul 11, 2007 at 14:10 UTC
    That's looks too lax to me. /(?:,|and)\s\d{3}\D/ is too likely to exist. For example, your regexp would match "134" in "sections 3 and 134".
      Yes, although my way works for the data given it could easily break down as you point out. Your method is safer. I'm wondering why the OP captures the 'APC' and 'APCs' strings; they seem to have no bearing on how many sets of digits follow.

      Cheers,

      JohnGG

        I suspect it's just for debugging and that lima1 had the right idea when he/she put the numbers in a hash to filter out duplicates.