Escaping Regex Expressions

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Escaping Regex Expressions by davido (Cardinal) on Aug 21, 2004 at 15:02 UTC
If what you mean by "doesn't match" is "spews lots of errors", read on... use strict; use warnings; $_ = 'Page 1 of 5'; if(Page\&nbsp\;1\&nbsp\;of\&nbsp\;(/d+)) { print "Number of Pages = ".$1; } __OUTPUT__ Backslash found where operator expected at test.pl line 9, near "Page\ +" Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "1\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "of\" Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) syntax error at test.pl line 9, near "Page\" Search pattern not terminated at test.pl line 9. [download] Ok, let's solve this one step at a time. First, the regexp operator is m//, which can be abbreviated as // most of the time. I don't see any regexp operator in your code. So we'll correct that part... `use strict; use warnings; $_ = 'Page 1 of 5'; if(/Page\&nbsp\;1\&nbsp\;of\&nbsp\;(/d+)/) { print "Number of Pages = ".$1; } __OUTPUT__ Unmatched ( in regex; marked by <-- HERE in m/Page 1 of&nbsp +;( <-- HER E / at test.pl line 9.` [download] Hmmm, what's this unmatched ( in regexp business? Oh, I see. You've got (/d+)/. The regexp thinks that the '/' in /d+ is the end of the regexp. You probably really meant the \d+ metacharacter and quantifier. So we'll fix that... `use strict; use warnings; $_ = 'Page 1 of 5'; if(/Page\&nbsp\;1\&nbsp\;of\&nbsp\;(\d+)/) { print "Number of Pages = ".$1; } __OUTPUT__ Number of Pages = 5` [download] Viola, it works! Of course this makes the assumption that you're testing your regexp against a string held in $_. If instead you're testing against a string held in some other scalar variable, such as $string, you'll need to use the binding operator also. The binding operator is '=~', and is used like this: `$string =~ m/regexp goes here/` [download] See perlretut and perlrequick for an introduction to Perl's regular expressions. For additional reading, you can dive into perlre and perlop. Update: I see I've wasted my time, because your original question wasn't really the question you wanted to ask. It is foolish to retype your code when inserting it here. Cut and paste it, or boil it down to a tiny script that replicates the behavior and cut and paste that. Retyping it obviously introduced numerous other errors and led us down the wrong path toward correcting them. Your real problem, assuming you've now typed it correctly, is probably that your input text is not what you think it is. Dave	[reply] [d/l] [select]
Re: Escaping Regex Expressions by Eimi Metamorphoumai (Deacon) on Aug 21, 2004 at 14:43 UTC
Your code, as written, won't compile. You don't have any `//` delimiting the regexp, and you have `/d` instead of `\d`. You don't actually have to escape either `&` or `;`, though there isn't any actual harm in doing so. Here's code that works. `$_ = "Page 1 of 47"; if (/Page 1 of (\d+)/){ print "Number of Pages = $1"; }` [download]	[reply] [d/l]
Re^2: Escaping Regex Expressions by Anonymous Monk on Aug 21, 2004 at 14:52 UTC
Sorry, typo THis is what I'm using - `while (my $token = $p->get_tag("font")) { my $text = $p->get_trimmed_text("/font"); print $text."\n\n"; if($text =~ /Page 1 of (\d+)/) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }` [download] The print text line above is giving - `Pageá1áofá3` [download] and the regex wont match this despit the html being - `Page 1 of 3`	[reply] [d/l] [select]
Re^3: Escaping Regex Expressions by tachyon (Chancellor) on Aug 21, 2004 at 15:09 UTC
The print text line above is giving - Pageá1áofá3 and the regex wont match this despit the html being - `Page 1 of 3` Does not compute. HTML of `Page 1 of 3` looks like this: Page 1 of 3 You probably have a locale issue. Evidently ` ` is being transliterated to 'á'. To make it simple, if you are screen scraping for a given fixed expresssion why not just use a straight regex, sans toke parser? Of course if you are into hack kludge solutions it will match /Pageá1áofá(\d+)/ and work as you desire. Failing that fix your locale issue. cheers tachyon	[reply] [d/l] [select]
Re^3: Escaping Regex Expressions by Eimi Metamorphoumai (Deacon) on Aug 21, 2004 at 15:13 UTC
If the print is showing that, then that's what the variable is holding, not the ` `. I'm not sure how to help you, not knowing what module you're using for tokenizing that's mangling your text like that. About the best I can think of would be to make your regexp a lot more accepting. `if($text =~ /Page (?: \|\W)+ 1 (?: \|\W)+ of (?: \|\W)+ (\d+)/x) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }` [download] Should work (accepts the literal '`&nbsp`', or anything else that isn't a word character). The problem doesn't appear to be with your regexp, but with whatever's mangling the html before it gets to your regexp.	[reply] [d/l]