agynr has asked for the wisdom of the Perl Monks concerning the following question:

The code given below searches something in any html file. my $pattern='(\n\s*[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*-[0-9]{1,3}-\s*\n\s*<page>\s*\n|\n\s*[A-Za-z]-[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*page\s*[0-9]{1,3}\s*of\s*[0-9]{1,3}\s*\n)'; Can u plz find out what the above string searches for in any html file.......

Replies are listed 'Best First'.
Re: what the pattern searches for
by ysth (Canon) on Dec 17, 2004 at 10:17 UTC
    Sure. To get an annotated description:
    use YAPE::Regex::Explain; my $pattern='(\n\s*[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*-[0-9]{1,3}-\s*\ +n\s*<page>\s*\n|\n\s*[A-Za-z]-[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*page +\s*[0-9]{1,3}\s*of\s*[0-9]{1,3}\s*\n)'; print YAPE::Regex::Explain->new($pattern)->explain;
    which produces:
    The regular expression: (?-imsx:(\n\s*[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*-[0-9]{1,3}-\s*\n\s*< +page>\s*\n|\n\s*[A-Za-z]-[0-9]{1,3}\s*\n\s*<page>\s*\n|\n\s*page\s*[0 +-9]{1,3}\s*of\s*[0-9]{1,3}\s*\n)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [0-9]{1,3} any character of: '0' to '9' (between 1 and 3 times (matching the most amount possible)) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- <page> '<page>' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- - '-' ---------------------------------------------------------------------- [0-9]{1,3} any character of: '0' to '9' (between 1 and 3 times (matching the most amount possible)) ---------------------------------------------------------------------- - '-' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- <page> '<page>' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [A-Za-z] any character of: 'A' to 'Z', 'a' to 'z' ---------------------------------------------------------------------- - '-' ---------------------------------------------------------------------- [0-9]{1,3} any character of: '0' to '9' (between 1 and 3 times (matching the most amount possible)) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- <page> '<page>' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- page 'page' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [0-9]{1,3} any character of: '0' to '9' (between 1 and 3 times (matching the most amount possible)) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- of 'of' ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [0-9]{1,3} any character of: '0' to '9' (between 1 and 3 times (matching the most amount possible)) ---------------------------------------------------------------------- \s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- \n '\n' (newline) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
    Now, if you would tell us something about the context in which this is used (other than just that it is used on html), maybe a higher-level description would be possible.
Re: what the pattern searches for
by BrowserUk (Patriarch) on Dec 17, 2004 at 10:24 UTC

    It appears to be looking for and capturing lines containing page numbers in one of 4 forms. Roughly:

    ' nnn <page> ' ' - nnn - <page> ' ' A-nnn <page> ' ' page nnn of mmm '

    This becomes fairly clear if you break up and expand the regex a little using /x.

    my $pattern = qr[ ( \n\s* [0-9]{1,3} \s*\n\s* <page> \s*\n | \n\s* - [0-9]{1,3} - \s*\n\s* <page> \s*\n | \n\s* [A-Za-z]-[0-9]{1,3} \s*\n\s* <page> \s*\n | \n\s* page \s* [0-9]{1,3} \s* of \s* [0-9]{1,3} \s*\n ) ]x;

    There are many ways the regex could be improved, but that isn't what you asked :)


    Examine what is said, not who speaks.        The end of an era!
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon