regex question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh learned Monks, I'm trying to inject a string at the begining of each link, so that a call to <a href="abc.com"> becomes <a href="fooabc.com">
I think I'm on the right track, but my code so far has two failings:
1. It doesn't work if there are more than one href on a line
2. I can't figure how to inject the 'foo' backinto the string.
Any tips or pointers in the right direction would be much appreciated:

  my $browser = LWP::UserAgent->new;
  my $response = $browser->get( $in{url} );

  if ( $response->status_line eq '200 OK' ) {
    my $html = $response->content;
    print "fetching $in{url}";
    while( $html=~ m/<a href="\s*(.+)\s*"/g ) {
      print
      print "links $1<br>\n";
    }
  }
[download]

Comment on regex question Select or Download Code

Replies are listed 'Best First'.
Re: regex question by Corion (Patriarch) on Jul 10, 2007 at 07:51 UTC
Your regex `m/<a href="\s(.+)\s"/g` is too greedy - the `.` gobbles up as much of the string as possible, so for the line `<a href="link1">link 1</a>Some text<a href="link2">link 2</a>` [download] $1 will be `link1">link 1</a>Some text<a href="link2"` [download] instead of what you expected: `link1` [download] What you want to do in the first step is to make the match non-greedy, or make it stop at the first double-quote it encounters: `m/<a href="\s(.+?)\s"/g # non-greedy due to .? m/<a href="\s([^"]+)\s"/g # stops at first "` [download] Neither of these solutions will work with arbitrary HTML, so if you want to rewrite HTML pages from more than one source, you will likely be better off with a HTML parser like HTML::TokeParser::Simple. The question of how to stuff the changes back into the string is best answered by printing out the tokens you get from HTML::TokeParser::Simple, but if you first want to stick with your regular expression approach, the `s///` operator will do a search-and-replace operation: `$html =~ s!(<a href="\s)([^"]+)(\s")!${1}foo$2$3!g;` [download]	[reply] [d/l] [select]
Re: regex question by Ovid (Cardinal) on Jul 10, 2007 at 10:31 UTC
You'll need to customize it, but ... `use strict; use HTML::TokeParser::Simple 3.15; my $p = HTML::TokeParser::Simple->new( url => $in{$url} ); while ( my $token = $p->get_token ) { if ( $token->is_start_tag('a') ) { my $href = $token->get_attr('href'); $token->set_attr('href', "foo$href"); } print $token->as_is; }` [download] Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: regex question by Anonymous Monk on Jul 10, 2007 at 08:24 UTC
`use strict; use warnings; use WWW::Mechanize; my $url = 'http://cpan.org/'; my $mech = WWW::Mechanize->new(); if ( $mech->get( $url ) ){ for my $l ( $mech->links() ){ print "link ", $l->url; } } __END__` [download]	[reply] [d/l]
Re^2: regex question by Anonymous Monk on Jul 10, 2007 at 11:11 UTC
I've tested both methods,and the regex one has some limitations, so I think the mechanise method looks more robust. Using the mechanise method, how would you do the replacement using foo?	[reply]