Re^2: Escaping Regex Expressions

Sorry, typo

THis is what I'm using -

while (my $token = $p->get_tag("font")) 
{
    my $text = $p->get_trimmed_text("/font");
    print $text."\n\n";
      
    if($text =~ /Page&nbsp;1&nbsp;of&nbsp;(\d+)/)
    {
        print "\n\nNumber of Pages: $areaCode$1\n\n";
    }
[download]

The print text line above is giving -

Pageá1áofá3
[download]

and the regex wont match this despit the html being -
Page 1 of 3

Comment on Re^2: Escaping Regex Expressions Select or Download Code

Replies are listed 'Best First'.
Re^3: Escaping Regex Expressions by tachyon (Chancellor) on Aug 21, 2004 at 15:09 UTC
The print text line above is giving - Pageá1áofá3 and the regex wont match this despit the html being - `Page 1 of 3` Does not compute. HTML of `Page 1 of 3` looks like this: Page 1 of 3 You probably have a locale issue. Evidently ` ` is being transliterated to 'á'. To make it simple, if you are screen scraping for a given fixed expresssion why not just use a straight regex, sans toke parser? Of course if you are into hack kludge solutions it will match /Pageá1áofá(\d+)/ and work as you desire. Failing that fix your locale issue. cheers tachyon	[reply] [d/l] [select]
Re^3: Escaping Regex Expressions by Eimi Metamorphoumai (Deacon) on Aug 21, 2004 at 15:13 UTC
If the print is showing that, then that's what the variable is holding, not the ` `. I'm not sure how to help you, not knowing what module you're using for tokenizing that's mangling your text like that. About the best I can think of would be to make your regexp a lot more accepting. `if($text =~ /Page (?: \|\W)+ 1 (?: \|\W)+ of (?: \|\W)+ (\d+)/x) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }` [download] Should work (accepts the literal '`&nbsp`', or anything else that isn't a word character). The problem doesn't appear to be with your regexp, but with whatever's mangling the html before it gets to your regexp.	[reply] [d/l]