emav has asked for the wisdom of the Perl Monks concerning the following question:

This is driving me nuts! Here's the code:
my $test = "blah <span class='small'>13a</span> blah blah"; if ( $test =~ /([^>]\d+[a-z]*[^<])/ ) { print "No tag found: $1\n"; }
The outcome of this little code snippet is:

No tag found: 13a

Is this a bug or am I missing something? I'm running ActivePerl 5.8.7 Build 815 on Windows XP.

Replies are listed 'Best First'.
Re: Angle brackets and regex problem
by merlyn (Sage) on Jan 05, 2007 at 17:28 UTC
    makes sense to me:
    [^>] - matches 1 \d+ - matches 2 [a-z]* - matches empty string [^<] - matches a
    And it's the first place that you have digits. What are you thinking it should have matched?

    Also, generally, you don't want to use regex against angle-bracket stuff. Use something based on HTML::Parser, XML::Parser, or XML::LibXML, please.

Re: Angle brackets and regex problem
by Fletch (Bishop) on Jan 05, 2007 at 17:29 UTC

    You've told it to match one character not an > (which 1 is), one or more digits (which 3 is), zero or more lower case letters (which nothing is), and one character that's not < (which a is). Wherein lies the problem?

    And of course there's the standard jibe that in general you want to use something like HTML::TreeBuilder or the like to parse HTML, not regexen.

    Update: Curse you, Red Baronmerlyn. :)

    And a hopefully useful pointer: YAPE::Regex::Explain can be helpful for obtaining a prose explanation of just what your regex means. Using it on your regex produces this:

    perl -MYAPE::Regex::Explain -le 'print YAPE::Regex::Explain->new( qr/( +[^>]\d+[a-z]*[^<])/ )->explain' The regular expression: (?-imsx:([^>]\d+[a-z]*[^<])) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^>] any character except: '>' ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [a-z]* any character of: 'a' to 'z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [^<] any character except: '<' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
      It's official! I'm blind! :(

      I expected it not to find a match. Oh, dear! So embarrassing!

        If you ever wonder what something matched, you can use

        my $at = $-[0]; my $matched = substr($test, $at, $+[0]-$at); warn(Matched "$matched" at $at\n");

        Alternatively, you could pass the argument -Meval=debug to perl. (e.g. perl -Mre=debug script.pl). It produces a detailed (i.e. long) report of all regexp activities.

        References:

        • @- and @+ are documented in perlvar.
        • -Mre=debug adds use re 'debug'; to your program. See re.
Re: Angle brackets and regex problem
by ferreira (Chaplain) on Jan 05, 2007 at 17:59 UTC

    If you what you want (as hinted by Re^2: Angle brackets and regex problem) is to match the longest match to /\d+[a-z]*/ which is not surrounded by > and <, you may say:

    my $test = "blah <span class='small'>13a</span> blah blah"; if ( $test =~ /(.)(\d+[a-z]*)(.)/ && $1 ne '>' && $2 ne '<') { print "No tag found: $2\n"; } else { print "No match" }
    and it is going to say "No match". While using my $test = "blah 13a blah blah"; would result
    No tag found: 13a

    But if you want something more complex than this — like finding the content of XML tags which just contain text which matches /\d+[a-z]*/, listen to what merlyn said and use a HTML parser.

      Thank you all so much for the insightful replies. Problem solved.

      I may be blind but, thanks be to our wonderful brethren, I am not e-deaf. :)