Ignatius Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brothers on code:
Actually, I'm trying to make an expresion that matches the text between the A anchor and the /A on a HTML page that matches a particular path (by example /path/) It works perfectly doing:

/\">$d<\/A>/
But I cant find the regexp that not only match the path else all what begins with it and follows, (that is, files and subdirectores of that directory), by example /path/subpath/ or /path/subpath/file.html, It fails on all probes I made, by example
/\">$d(.*)<\/A>/
Any suggestion to do It? Im getting crazy, and It must be a silly question :(
Best Regards
Ignatius Monk
The Ciberlibrarian Brother of the Perl Order

Replies are listed 'Best First'.
Re: regexp problem
by Beatnik (Parson) on Jul 06, 2001 at 14:51 UTC
    check HTML::LinkExtor so you wont have to parse HTML tags manually (which can be a pain in the proverbial ass)

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: regexp problem
by davorg (Chancellor) on Jul 06, 2001 at 14:54 UTC

    Parsing HTML using regular expressions is a really bad idea. It will almost always end up with over-complex regexes which deal with all sorts of unlikely scenarios.

    Much better to use the right tool for the job - HTML::Parser (or, in this case it looks like HTML::LinkExtor might be a better bet).

    --
    <http://www.dave.org.uk>

    Perl Training in the UK <http://www.iterative-software.com>

Re: regexp problem
by tachyon (Chancellor) on Jul 06, 2001 at 16:15 UTC

    First let me agree with the above advice that usnig a specialist HTML parser is the best option. This said here is the solution you want:

    $_ = '<a href="foo">/path/sub/dir</a>'; $d = '/path'; $d = quotemeta $d; print $1 if /"\s*>.*?$d(.*?)<\/A>/i; # prints /sub/dir # here it is with the regex expanded $_ = '<a href="foo">/path/sub/dir</a>'; $d = '/path'; $d = quotemeta $d; print $1 if m # match regex / # opening delim " # literal " \s* # plus 0 or more spaces > # end of <a href tag .*? # some leading stuff (minimum) $d # our path (.*?) # out subdirs captured into $1 <\/A> # the closing tag /ix; # /i => case insensitive for </a> tag # /x => allow comments

    You need the quotemeta to make $d safe to interpolate into the regex, otherwise the / used as a path delimiter will be misinterpreted as the closing regex delimiter ie after interpolating $d the regex would be /"\s*>.*?/path(.*?)<\/A>/i

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Dear bro:
      Thanks, your solution was exactly what I needed :)
      Thanks to all the brothers that answered me too.
      Ignatius Monk
      The Ciberlibrarian Brother of the Perl Order