Angle brackets and regex problem

emav has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Angle brackets and regex problem by merlyn (Sage) on Jan 05, 2007 at 17:28 UTC
makes sense to me: `[^>] - matches 1 \d+ - matches 2 [a-z]* - matches empty string [^<] - matches a` [download] And it's the first place that you have digits. What are you thinking it should have matched? Also, generally, you don't want to use regex against angle-bracket stuff. Use something based on HTML::Parser, XML::Parser, or XML::LibXML, please. -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Angle brackets and regex problem by Fletch (Bishop) on Jan 05, 2007 at 17:29 UTC
You've told it to match one character not an > (which `1` is), one or more digits (which `3` is), zero or more lower case letters (which nothing is), and one character that's not < (which `a` is). Wherein lies the problem? And of course there's the standard jibe that in general you want to use something like HTML::TreeBuilder or the like to parse HTML, not regexen. Update: Curse you, ~~Red Baron~~merlyn. :) And a hopefully useful pointer: YAPE::Regex::Explain can be helpful for obtaining a prose explanation of just what your regex means. Using it on your regex produces this: perl -MYAPE::Regex::Explain -le 'print YAPE::Regex::Explain->new( qr/( +[^>]\d+[a-z][^<])/ )->explain' The regular expression: (?-imsx:([^>]\d+[a-z][^<])) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^>] any character except: '>' ---------------------------------------------------------------------- \d+ digits (0-9) (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [a-z]* any character of: 'a' to 'z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [^<] any character except: '<' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l]
Re^2: Angle brackets and regex problem by emav (Pilgrim) on Jan 05, 2007 at 17:36 UTC
It's official! I'm blind! :( I expected it not to find a match. Oh, dear! So embarrassing!	[reply]
Re^3: Angle brackets and regex problem by ikegami (Patriarch) on Jan 05, 2007 at 17:40 UTC
If you ever wonder what something matched, you can use `my $at = $-[0]; my $matched = substr($test, $at, $+[0]-$at); warn(Matched "$matched" at $at\n");` [download] Alternatively, you could pass the argument `-Meval=debug` to `perl`. (e.g. `perl -Mre=debug script.pl`). It produces a detailed (i.e. long) report of all regexp activities. References: `@-` and `@+` are documented in perlvar. `-Mre=debug` adds `use re 'debug';` to your program. See re.	[reply] [d/l] [select]
Re: Angle brackets and regex problem by ferreira (Chaplain) on Jan 05, 2007 at 17:59 UTC
If you what you want (as hinted by Re^2: Angle brackets and regex problem) is to match the longest match to `/\d+[a-z]/` which is not surrounded by > and <, you may say: `my $test = "blah <span class='small'>13a</span> blah blah"; if ( $test =~ /(.)(\d+[a-z])(.)/ && $1 ne '>' && $2 ne '<') { print "No tag found: $2\n"; } else { print "No match" }` [download] and it is going to say "No match". While using `my $test = "blah 13a blah blah";` would result `No tag found: 13a` [download] But if you want something more complex than this — like finding the content of XML tags which just contain text which matches `/\d+[a-z]*/`, listen to what merlyn said and use a HTML parser.	[reply] [d/l] [select]
Re^2: Angle brackets and regex problem by emav (Pilgrim) on Jan 05, 2007 at 18:08 UTC
Thank you all so much for the insightful replies. Problem solved. I may be blind but, thanks be to our wonderful brethren, I am not e-deaf. :)	[reply]