Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I realise this is an incredibly basic question, but after reading the relevant man pages, chapters from Perl books, and hours of experimenting, I still can't work out the answer... I want to write a regex that will search a multi-line string -- starting with '<noframes>' and ending with '</noframes>' -- for hypertext links that do not include either 'netscape.com' or 'microsoft.com' anywhere in the URL portion. I then want to extract this URL using the standard $<integer_here> method. The match should be case-insensitive.

Sample input (value of $page) is:
===

<noframes> <a HREF="http://www.microsoft.com/browser"> <A href ="http://perlmonks +.com/" </noframes>
===
My latest, yet clearly erroneous, attempt is:
===
$page =~m|<noframes>(?:.*?)<a href(?:\s?)=(?:\s?)"(http://(?:.*?)[^net +scape\.com|^microsoft\.com](?:.*?))"|is; $url_containing_neither_value = $1;
===
I would appreciate any help you can provide.

Thank you. :)

Replies are listed 'Best First'.
Re: Question Regarding Regular Expressions and Negated Character Classes
by thpfft (Chaplain) on Jul 14, 2002 at 17:43 UTC

    Any time I see regexes and html in the same question I get a queasy feeling born of unpleasant experience. So instead of answering your question, and since ovid doesn't seem to be here just now, how about (untested):

    use HTML::TokeParser::Simple; my $page = ...; my $alternative = find_alternate_page( $page ); sub find_alternate_page { my $page = shift; return undef unless $page; my $p = HTML::TokeParser::Simple->new( \$page ); my $looking = 0; while ( my $token = $p->get_token ) { $looking = 1 if $token->is_start_tag( 'noframes' ); return undef if $token->is_end_tag( 'noframes' ); if ( $looking && $token->is_start_tag( 'a' ) ) { return $token->return_attr->{href}; } } }

    Decisions about which urls interest you and which don't are easy to make once the address is retrieved, and probably best done separately from the retrieval itself, since you'll be wanting to change that policy at some point.

    update. tested after all. seems to work.

      Hi,
      Thank you for your reply. :) If at all possible, I was looking for something far shorter than that.

      I doubt that the strings to be excluded will need to be changed any time soon, so hardcoding them in isn't really a problem.

      Thanks again.

        it depends how important it is that you get every link, i suppose. Using the parser has the great advantage that you don't have to worry about anything to do with spacing, attribute order or case, so you won't miss any links. it's very efficient, too, and most of all it lets you make decisions in perl rather than in (?:^\*.^\)

        In your regex, for example, there's nothing to detect a </noframes> tag, and it's only convention that puts the noframes content at the end of the page. Adding the test won't be pretty.

        But if you want something quick and dirty that gets most of the links and mostly the right ones, then I suppose you could use regexes. Some pointers:

        [^netscape\.com|^microsoft\.com]

        should probably be done with zero-width negative lookaheads:

        (?!netscape.com)(?!microsoft.com)

        And this:

        http://(?:.*?)

        apart from being rather inefficient (consider instead (?:\w+\.)?), is no respecter of levels: in the case of http://microsoft.com it eats the microsoft part before you check whether it's there or not. I'm not sure there's anything you can do about that inside the pattern, except to assume that subdomains have short names or perhaps repeat the test. This works, in a slapdash way:

        $page =~ m%<noframes>.*?<a href *= *"(http://(?!netscape.com)(?!micros +oft.com)(?:\w+\.)(?!netscape.com)(?!microsoft.com)[^"]+)"%is;

        But i dread to think how much effort it puts in before discarding the first match, and it's not anything i'd want to try and read. I don't think I would try and pack all the logic into one line.

        And I'd still urge you to try the parser. that code can be made more compact, if you prefer, and it'll save you lots of grief.

Re: Question Regarding Regular Expressions and Negated Character Classes
by flocto (Pilgrim) on Jul 15, 2002 at 07:24 UTC

    I don't think a single regular expression is the best way to solve this problem, since it either matches, or it doesn't. You can't print/log/whatever any information as of why there is no result. The shortest solution I would want to use still uses three regexen:

    if ($html =~ m#<noframes>(.+?)</noframes>#is) { @urls = grep { $_ !~ m#microsoft\.com|netscape\.com# } $& =~ m#<a +href="([^"]+)"#gi; }

    However, there are still issues with this: It assumes, that "href" directly follows "<a", which, by no means is neccessary. So a little longer, but better readable and more clear solution is the following:

    if ($html =~ m#<noframes>(.+?)</noframes>#is) { $noframes = $1; } else { die "Couldn't find noframes tags"; } while ($noframes =~ m#<a[^>]+>#gis) { my $link = $&; my ($url) = $link =~ m#href\s*=\s*"([^"]+)"#i; if ($url and $url !~ m#microsoft\.com|netscape\.com#i) { push (@urls, $url); } }

    I didn't come up with a regex-only solution, mainly to the lack of time, but the only reason I would write a solution for this problem entirely as regex was to develop regex-skills ;) Hope this helps :)

    Regards,
    -octo