Re: Question Regarding Regular Expressions and Negated Character Classes

Any time I see regexes and html in the same question I get a queasy feeling born of unpleasant experience. So instead of answering your question, and since ovid doesn't seem to be here just now, how about (untested):

use HTML::TokeParser::Simple;

my $page = ...;
my $alternative = find_alternate_page( $page );

sub find_alternate_page {
    my $page = shift;
    return undef unless $page;
    my $p = HTML::TokeParser::Simple->new( \$page );
    my $looking = 0;
    while ( my $token = $p->get_token ) {
         $looking = 1 if $token->is_start_tag( 'noframes' );
         return undef if $token->is_end_tag( 'noframes' );
        if ( $looking && $token->is_start_tag( 'a' ) ) {
            return $token->return_attr->{href};
        }
    }
}
[download]

Decisions about which urls interest you and which don't are easy to make once the address is retrieved, and probably best done separately from the retrieval itself, since you'll be wanting to change that policy at some point.

update. tested after all. seems to work.

Comment on Re: Question Regarding Regular Expressions and Negated Character Classes Download Code

Replies are listed 'Best First'.
Re: Re: Question Regarding Regular Expressions and Negated Character Classes by Anonymous Monk on Jul 14, 2002 at 17:50 UTC
Hi, Thank you for your reply. :) If at all possible, I was looking for something far shorter than that. I doubt that the strings to be excluded will need to be changed any time soon, so hardcoding them in isn't really a problem. Thanks again.	[reply]
Re: Re: Re: Question Regarding Regular Expressions and Negated Character Classes by thpfft (Chaplain) on Jul 14, 2002 at 18:59 UTC
it depends how important it is that you get every link, i suppose. Using the parser has the great advantage that you don't have to worry about anything to do with spacing, attribute order or case, so you won't miss any links. it's very efficient, too, and most of all it lets you make decisions in perl rather than in (?:^\.^\) In your regex, for example, there's nothing to detect a </noframes> tag, and it's only convention that puts the noframes content at the end of the page. Adding the test won't be pretty. But if you want something quick and dirty that gets most of the links and mostly the right ones, then I suppose you could use regexes. Some pointers: `[^netscape\.com\|^microsoft\.com]` [download] should probably be done with zero-width negative lookaheads: `(?!netscape.com)(?!microsoft.com)` [download] And this: `http://(?:.?)` [download] apart from being rather inefficient (consider instead `(?:\w+\.)?`), is no respecter of levels: in the case of http://microsoft.com it eats the microsoft part before you check whether it's there or not. I'm not sure there's anything you can do about that inside the pattern, except to assume that subdomains have short names or perhaps repeat the test. This works, in a slapdash way: `$page =~ m%<noframes>.?<a href = *"(http://(?!netscape.com)(?!micros +oft.com)(?:\w+\.)(?!netscape.com)(?!microsoft.com)[^"]+)"%is;` [download] But i dread to think how much effort it puts in before discarding the first match, and it's not anything i'd want to try and read. I don't think I would try and pack all the logic into one line. And I'd still urge you to try the parser. that code can be made more compact, if you prefer, and it'll save you lots of grief.	[reply] [d/l] [select]