Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh learned Monks, I'm trying to inject a string at the begining of each link, so that a call to <a href="abc.com"> becomes <a href="fooabc.com">
I think I'm on the right track, but my code so far has two failings:
1. It doesn't work if there are more than one href on a line
2. I can't figure how to inject the 'foo' backinto the string.
Any tips or pointers in the right direction would be much appreciated:
my $browser = LWP::UserAgent->new; my $response = $browser->get( $in{url} ); if ( $response->status_line eq '200 OK' ) { my $html = $response->content; print "fetching $in{url}"; while( $html=~ m/<a href="\s*(.+)\s*"/g ) { print print "links $1<br>\n"; } }

Replies are listed 'Best First'.
Re: regex question
by Corion (Patriarch) on Jul 10, 2007 at 07:51 UTC

    Your regex m/<a href="\s*(.+)\s*"/g is too greedy - the .* gobbles up as much of the string as possible, so for the line

    <a href="link1">link 1</a>Some text<a href="link2">link 2</a>

    $1 will be

    link1">link 1</a>Some text<a href="link2"

    instead of what you expected:

    link1

    What you want to do in the first step is to make the match non-greedy, or make it stop at the first double-quote it encounters:

    m/<a href="\s*(.+?)\s*"/g # non-greedy due to .*? m/<a href="\s*([^"]+)\s*"/g # stops at first "

    Neither of these solutions will work with arbitrary HTML, so if you want to rewrite HTML pages from more than one source, you will likely be better off with a HTML parser like HTML::TokeParser::Simple.

    The question of how to stuff the changes back into the string is best answered by printing out the tokens you get from HTML::TokeParser::Simple, but if you first want to stick with your regular expression approach, the s/// operator will do a search-and-replace operation:

    $html =~ s!(<a href="\s*)([^"]+)(\s*")!${1}foo$2$3!g;
Re: regex question
by Ovid (Cardinal) on Jul 10, 2007 at 10:31 UTC

    You'll need to customize it, but ...

    use strict; use HTML::TokeParser::Simple 3.15; my $p = HTML::TokeParser::Simple->new( url => $in{$url} ); while ( my $token = $p->get_token ) { if ( $token->is_start_tag('a') ) { my $href = $token->get_attr('href'); $token->set_attr('href', "foo$href"); } print $token->as_is; }

    Cheers,
    Ovid

    New address of my CGI Course.

Re: regex question
by Anonymous Monk on Jul 10, 2007 at 08:24 UTC
    use strict; use warnings; use WWW::Mechanize; my $url = 'http://cpan.org/'; my $mech = WWW::Mechanize->new(); if ( $mech->get( $url ) ){ for my $l ( $mech->links() ){ print "link ", $l->url; } } __END__
      I've tested both methods,and the regex one has some limitations, so I think the mechanise method looks more robust.
      Using the mechanise method, how would you do the replacement using foo?