in reply to Re^2: Continue reading regex to next line
in thread Continue reading regex to next line

If only ... It only works "perfectly" if you are certain that you will never have two lines like this in a row:
more junk http://www.foo.com http://www.fiddle.com

The line $string =~ s/\n//g; removes the new line after each URL. After removal, there is no white space after either of the above URLs so the regex (which terminates the URL with whitespace) doesn't match and nothing gets printed out. Here's a demo:

use strict; use warnings; undef $/; my $string = <DATA>; $string =~ s/\n//g; while ($string =~ /(http:\S*)\s/g) { print $1, "\n"; } __DATA__ http://www.baz.com xxx more junk http://www.foo.com http://www.fiddle.com

outputs only http://www.foo.com

Both this example and JavaFan's post underscore the fact that one needs to define the difference between a new line that ends a URL and a new line that should be ignored because the URL continues onto the next line.

Best, beth

Replies are listed 'Best First'.
Re^4: Continue reading regex to next line
by kennethk (Abbot) on Mar 02, 2009 at 23:15 UTC
    You are correct that my code fails on your test case, however that is out of spec for the OP. In particular, note that no URI either starts or ends the file and they are all separated from surrounding text by non-newline white space. I designed my regex to match identically to the original, since I don't know the real source of the data for comparison. I agree that the final \s is potentially problematic (and updated accordingly), but the OP has it in there. Was it because he didn't realize \S* is greedy? Probably, but how I am to know that? The provided file is highly unlikely to produce the desired result, since http://www.website1.com/getme.html is not present anywhere in the file. Without knowing the actual data source/file format, any answer given here can fail. What if the OP also wants https: to match? ftp:? Best solution I've seen is in CountZero's comment, since Regexp::Common should give some resilience, but it fails your test as well.