in reply to Re: Escaping Regex Expressions
in thread Escaping Regex Expressions

Sorry, typo

THis is what I'm using -

while (my $token = $p->get_tag("font")) { my $text = $p->get_trimmed_text("/font"); print $text."\n\n"; if($text =~ /Page 1 of (\d+)/) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }
The print text line above is giving -
Pageá1áofá3

and the regex wont match this despit the html being -
Page 1 of 3

Replies are listed 'Best First'.
Re^3: Escaping Regex Expressions
by tachyon (Chancellor) on Aug 21, 2004 at 15:09 UTC

    The print text line above is giving - Pageá1áofá3
    and the regex wont match this despit the html being - Page 1 of 3

    Does not compute. HTML of Page 1 of 3 looks like this:

    Page 1 of 3

    You probably have a locale issue. Evidently   is being transliterated to 'á'. To make it simple, if you are screen scraping for a given fixed expresssion why not just use a straight regex, sans toke parser?

    Of course if you are into hack kludge solutions it will match /Pageá1áofá(\d+)/ and work as you desire. Failing that fix your locale issue.

    cheers

    tachyon

Re^3: Escaping Regex Expressions
by Eimi Metamorphoumai (Deacon) on Aug 21, 2004 at 15:13 UTC
    If the print is showing that, then that's what the variable is holding, not the  . I'm not sure how to help you, not knowing what module you're using for tokenizing that's mangling your text like that. About the best I can think of would be to make your regexp a lot more accepting.
    if($text =~ /Page (?: |\W)+ 1 (?: |\W)+ of (?: |\W)+ (\d+)/x) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }
    Should work (accepts the literal '&nbsp', or anything else that isn't a word character). The problem doesn't appear to be with your regexp, but with whatever's mangling the html before it gets to your regexp.