lplo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to find an efficient way to match words that can contain digits at any point, but not pure numbers. In a simple example I can do this as follows:

#!/usr/bin/perl use strict; use warnings; my $string = "foo 1foo foo2 3foo4 foo5bar 87"; $string =~ s/\d*[a-z]\d*//g; print "$string\n";
The output is " 87", as desired.

The problem just gets ugly quickly, as I like to allow words to contain minus signs, underscores, umlauts, and so on. A more complex example would be:

#!/usr/bin/perl use strict; use warnings; my $string = 'foo 1foo; foo_2 foo-bar() 87 - _ !@#$% '; my $optional = qr/[\d_-]/; my $mandatory = qr/[a-zA-Z]/; $string =~ s/$optional*$mandatory$optional*//g; print "$string\n";
which results in " ; () 87 - _ !@#$%".
The words "foo", "1foo", "foo_2", and "foo-bar" are matched (replaced).

I am asking more generally:
Is there a way to a create a regular expression character class that has some mandatory and optional members? What would be your way to match (not necessarily replace) these "words"?
Further down the road the actual task is to find the position of the next "word".

Replies are listed 'Best First'.
Re: Regex matching words with numbers, but not numbers.
by Athanasius (Archbishop) on Jul 26, 2014 at 07:07 UTC

    Just use two regexen, and consider a match successful only if the first regex matches but the second doesn’t. For example, to match sequences containing only letters, digits, and hyphens, but not containing all digits:

    #! perl use strict; use warnings; my $re1 = qr{ ^ [\w-]+ $ }x; my $re2 = qr{ ^ \d+ $ }x; while (<DATA>) { chomp; printf "%s : %s\n", $_, /$re1/ && !/$re2/ ? 'yes' : 'no'; } __DATA__ foo 1foo foo2 3foo4 foo5bar 87 foo-bar foo42-baz foo17@12

    Output:

    17:03 >perl 950_SoPW.pl foo : yes 1foo : yes foo2 : yes 3foo4 : yes foo5bar : yes 87 : no foo-bar : yes foo42-baz : yes foo17@12 : no 17:04 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regex matching words with numbers, but not numbers.
by Anonymous Monk on Jul 26, 2014 at 08:06 UTC
    That depends on what you mean by 'pure numbers'. 87? 99.00? 0.5? Let's assume all of that are numbers...
    Is there a way to a create a regular expression character class that has some mandatory and optional members?
    Yes, but you should probably use the function "looks_like_number" from Scalar::Util
    What would be your way to match (not necessarily replace) these "words"?
    use 5.020; use warnings; # for umlauts and stuff... not really necessary # but a good idea regardless use utf8; use open qw{ :encoding(utf-8) :std }; use Scalar::Util 'looks_like_number'; my $string1 = 'foo 1foo; foo_2 foo-bar() 87 - _ !@#$% '; my $string2 = 'F? 1_1 99.00 .5 \\x87 14 fourteen !@#99$% 000'; my $test_string = $string1 . $string2; while ( $test_string =~ m/ (\S+) /gx ) { # or whatever is a "word" my ( $word, $start, $end ) = ( $1, $-[0], $+[0] ); next if $word !~ m/ \d+ /x or looks_like_number($word); say qq{"$word" has numbers, but doesn't look like number. Start: $ +start, end: $end}; }
    Output:
    "1foo;" has numbers, but doesn't look like number. Start: 4, end: 9 "foo_2" has numbers, but doesn't look like number. Start: 12, end: 17 "1_1" has numbers, but doesn't look like number. Start: 48, end: 51 "\x87" has numbers, but doesn't look like number. Start: 61, end: 65 "!@#99$%" has numbers, but doesn't look like number. Start: 78, end: 8 +5
    Further down the road the actual task is to find the position of the next "word".
    The positions are stored in magic arrays @- and @+
    @LAST_MATCH_START @- $-[0] is the offset of the start of the last successful match ... @LAST_MATCH_END @+ This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope.
Re: Regex matching words with numbers, but not numbers.
by AppleFritter (Vicar) on Jul 26, 2014 at 09:22 UTC

    The problem just gets ugly quickly, as I like to allow words to contain minus signs, underscores, umlauts, and so on.

    "and so on" is a rather nebulous specification, but I'd suggest using Unicode character classes, e.g. like this:

    #!/usr/bin/perl use strict; use warnings; use feature qw/say/; use utf8; use open IO => ':encoding(UTF-8)', ':std'; my $wordchars = qr/[\pL\pP\pS]/; my $regex = qr/\p{Nd}*$wordchars+\p{Nd}*/; while(<DATA>) { chomp; (my $string = $_) =~ s/$regex//g; $string = join " ", split " ", $string; # just to make the output +more readable say "'$_' became '$string'"; } __DATA__ foo 1foo foo2 3foo4 foo5bar 87 foo 1foo; foo_2 foo-bar() 87 - _ !@#$% augu mín sáu þig 12345

    See perluniprops for more on Unicode properties. Also see Unicode::Tussle for a bunch of useful scripts for Unicode wrangling, by Tom Christiansen; uniprops is particularly useful.

Re: Regex matching words with numbers, but not numbers.
by Anonymous Monk on Jul 26, 2014 at 06:50 UTC

    character class that has some mandatory and optional members

    No, character classes are sets

    What would be your way to match ...

    Tokenize, naturally, match things you know to match, then match what is left,... instead of looking for what you don't want, match what you do want , like pure numbers

    You can refine your search once you have a bunch of tokens

    see Re: Help required in find command. (read parse file tokenize m//gc)

Re: Regex matching words with numbers, but not numbers.
by AnomalousMonk (Archbishop) on Jul 27, 2014 at 00:25 UTC

    If you have access to Perl version 5.18+ (I have not ATM), you may find the experimental Extended Bracketed Character Classes provide some food for thought. I'm not sure just how they would apply to your problem as originally stated (and as I say, I couldn't test any application if I could see one), but they look really interesting...

      Hmm, even the extended character classes are just that, sets, they cannot have mandatory and optional parts, just like 1 matches 1 it can't sometimes optionally match 4, its 1 :) the options are in the set or not in the set ... the extended classes just let you define your sets easier