in reply to Question Regarding Regular Expressions and Negated Character Classes
Any time I see regexes and html in the same question I get a queasy feeling born of unpleasant experience. So instead of answering your question, and since ovid doesn't seem to be here just now, how about (untested):
use HTML::TokeParser::Simple; my $page = ...; my $alternative = find_alternate_page( $page ); sub find_alternate_page { my $page = shift; return undef unless $page; my $p = HTML::TokeParser::Simple->new( \$page ); my $looking = 0; while ( my $token = $p->get_token ) { $looking = 1 if $token->is_start_tag( 'noframes' ); return undef if $token->is_end_tag( 'noframes' ); if ( $looking && $token->is_start_tag( 'a' ) ) { return $token->return_attr->{href}; } } }
Decisions about which urls interest you and which don't are easy to make once the address is retrieved, and probably best done separately from the retrieval itself, since you'll be wanting to change that policy at some point.
update. tested after all. seems to work.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Question Regarding Regular Expressions and Negated Character Classes
by Anonymous Monk on Jul 14, 2002 at 17:50 UTC | |
by thpfft (Chaplain) on Jul 14, 2002 at 18:59 UTC |