regexp problem

Ignatius Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brothers on code:
Actually, I'm trying to make an expresion that matches the text between the A anchor and the /A on a HTML page that matches a particular path (by example /path/) It works perfectly doing:

/\">$d<\/A>/
[download]

But I cant find the regexp that not only match the path else all what begins with it and follows, (that is, files and subdirectores of that directory), by example /path/subpath/ or /path/subpath/file.html, It fails on all probes I made, by example

/\">$d(.*)<\/A>/
[download]

Any suggestion to do It? Im getting crazy, and It must be a silly question :(
Best Regards
Ignatius Monk
The Ciberlibrarian Brother of the Perl Order

Comment on regexp problem Select or Download Code

Replies are listed 'Best First'.
Re: regexp problem by Beatnik (Parson) on Jul 06, 2001 at 14:51 UTC
check HTML::LinkExtor so you wont have to parse HTML tags manually (which can be a pain in the proverbial ass) Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: regexp problem by davorg (Chancellor) on Jul 06, 2001 at 14:54 UTC
Parsing HTML using regular expressions is a really bad idea. It will almost always end up with over-complex regexes which deal with all sorts of unlikely scenarios. Much better to use the right tool for the job - HTML::Parser (or, in this case it looks like HTML::LinkExtor might be a better bet). -- <http://www.dave.org.uk> Perl Training in the UK <http://www.iterative-software.com>	[reply]
Re: regexp problem by tachyon (Chancellor) on Jul 06, 2001 at 16:15 UTC
First let me agree with the above advice that usnig a specialist HTML parser is the best option. This said here is the solution you want: $_ = '<a href="foo">/path/sub/dir</a>'; $d = '/path'; $d = quotemeta $d; print $1 if /"\s>.?$d(.?)<\/A>/i; # prints /sub/dir # here it is with the regex expanded $_ = '<a href="foo">/path/sub/dir</a>'; $d = '/path'; $d = quotemeta $d; print $1 if m # match regex / # opening delim " # literal " \s # plus 0 or more spaces > # end of <a href tag .? # some leading stuff (minimum) $d # our path (.?) # out subdirs captured into $1 <\/A> # the closing tag /ix; # /i => case insensitive for </a> tag # /x => allow comments [download] You need the quotemeta to make $d safe to interpolate into the regex, otherwise the / used as a path delimiter will be misinterpreted as the closing regex delimiter ie after interpolating $d the regex would be `/"\s>.?/path(.*?)<\/A>/i` cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: regexp problem by Ignatius Monk (Novice) on Jul 09, 2001 at 11:22 UTC
Dear bro: Thanks, your solution was exactly what I needed :) Thanks to all the brothers that answered me too. Ignatius Monk The Ciberlibrarian Brother of the Perl Order	[reply]