Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I'm trying to match the segment of html code as follows - Page 1 of 5 with this expression -
if(Page\&nbsp\;1\&nbsp\;of\&nbsp\;(/d+)) { print "Number of Pages = ".$1; }
but it doesn't match
Why?

Replies are listed 'Best First'.
Re: Escaping Regex Expressions
by davido (Cardinal) on Aug 21, 2004 at 15:02 UTC

    If what you mean by "doesn't match" is "spews lots of errors", read on...

    use strict; use warnings; $_ = 'Page 1 of 5'; if(Page\&nbsp\;1\&nbsp\;of\&nbsp\;(/d+)) { print "Number of Pages = ".$1; } __OUTPUT__ Backslash found where operator expected at test.pl line 9, near "Page\ +" Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "1\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) Backslash found where operator expected at test.pl line 9, near "of\" Backslash found where operator expected at test.pl line 9, near "&nbsp +\" (Missing operator before \?) syntax error at test.pl line 9, near "Page\" Search pattern not terminated at test.pl line 9.

    Ok, let's solve this one step at a time. First, the regexp operator is m//, which can be abbreviated as // most of the time. I don't see any regexp operator in your code. So we'll correct that part...

    use strict; use warnings; $_ = 'Page&nbsp;1&nbsp;of&nbsp;5'; if(/Page\&nbsp\;1\&nbsp\;of\&nbsp\;(/d+)/) { print "Number of Pages = ".$1; } __OUTPUT__ Unmatched ( in regex; marked by <-- HERE in m/Page&nbsp;1&nbsp;of&nbsp +;( <-- HER E / at test.pl line 9.

    Hmmm, what's this unmatched ( in regexp business? Oh, I see. You've got (/d+)/. The regexp thinks that the '/' in /d+ is the end of the regexp. You probably really meant the \d+ metacharacter and quantifier. So we'll fix that...

    use strict; use warnings; $_ = 'Page&nbsp;1&nbsp;of&nbsp;5'; if(/Page\&nbsp\;1\&nbsp\;of\&nbsp\;(\d+)/) { print "Number of Pages = ".$1; } __OUTPUT__ Number of Pages = 5

    Viola, it works!

    Of course this makes the assumption that you're testing your regexp against a string held in $_. If instead you're testing against a string held in some other scalar variable, such as $string, you'll need to use the binding operator also. The binding operator is '=~', and is used like this:

    $string =~ m/regexp goes here/

    See perlretut and perlrequick for an introduction to Perl's regular expressions. For additional reading, you can dive into perlre and perlop.

    Update: I see I've wasted my time, because your original question wasn't really the question you wanted to ask. It is foolish to retype your code when inserting it here. Cut and paste it, or boil it down to a tiny script that replicates the behavior and cut and paste that. Retyping it obviously introduced numerous other errors and led us down the wrong path toward correcting them. Your real problem, assuming you've now typed it correctly, is probably that your input text is not what you think it is.


    Dave

Re: Escaping Regex Expressions
by Eimi Metamorphoumai (Deacon) on Aug 21, 2004 at 14:43 UTC
    Your code, as written, won't compile. You don't have any // delimiting the regexp, and you have /d instead of \d. You don't actually have to escape either & or ;, though there isn't any actual harm in doing so. Here's code that works.
    $_ = "Page&nbsp;1&nbsp;of&nbsp;47"; if (/Page&nbsp;1&nbsp;of&nbsp;(\d+)/){ print "Number of Pages = $1"; }
      Sorry, typo

      THis is what I'm using -

      while (my $token = $p->get_tag("font")) { my $text = $p->get_trimmed_text("/font"); print $text."\n\n"; if($text =~ /Page&nbsp;1&nbsp;of&nbsp;(\d+)/) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }
      The print text line above is giving -
      Pageá1áofá3

      and the regex wont match this despit the html being -
      Page&nbsp;1&nbsp;of&nbsp;3

        The print text line above is giving - Pageá1áofá3
        and the regex wont match this despit the html being - Page&nbsp;1&nbsp;of&nbsp;3

        Does not compute. HTML of Page&nbsp;1&nbsp;of&nbsp;3 looks like this:

        Page 1 of 3

        You probably have a locale issue. Evidently &nbsp; is being transliterated to 'á'. To make it simple, if you are screen scraping for a given fixed expresssion why not just use a straight regex, sans toke parser?

        Of course if you are into hack kludge solutions it will match /Pageá1áofá(\d+)/ and work as you desire. Failing that fix your locale issue.

        cheers

        tachyon

        If the print is showing that, then that's what the variable is holding, not the &nbsp;. I'm not sure how to help you, not knowing what module you're using for tokenizing that's mangling your text like that. About the best I can think of would be to make your regexp a lot more accepting.
        if($text =~ /Page (?:&nbsp;|\W)+ 1 (?:&nbsp;|\W)+ of (?:&nbsp;|\W)+ (\d+)/x) { print "\n\nNumber of Pages: $areaCode$1\n\n"; }
        Should work (accepts the literal '&nbsp', or anything else that isn't a word character). The problem doesn't appear to be with your regexp, but with whatever's mangling the html before it gets to your regexp.