in reply to Continue reading regex to next line

You cannot get your regex to match across newlines since your input is breaking on them. One solution would be to change your input record separator by setting $/. You could then get your desired result with a code like:

#!/usr/bin/perl use strict; use warnings; undef $/; open my $fh, "<", '/tmp/sub_url_test/'; my $string = <$fh>; $string =~ s/\n//g; while ($string =~ /(http:\S*)\s/g) { print $1, "\n"; }

Update: Note, as per ELISHEVA comment, the trailing \s might bite you. I've copied the regex in the original post and for this reason am leaving it as is, but beware that this might not do what you want if your data set differs significantly from what you posted.

Replies are listed 'Best First'.
Re^2: Continue reading regex to next line
by learningperl01 (Beadle) on Mar 02, 2009 at 22:11 UTC
    Great thanks, your suggestions worked perfectly. Thanks again!!
      If only ... It only works "perfectly" if you are certain that you will never have two lines like this in a row:
      more junk http://www.foo.com http://www.fiddle.com

      The line $string =~ s/\n//g; removes the new line after each URL. After removal, there is no white space after either of the above URLs so the regex (which terminates the URL with whitespace) doesn't match and nothing gets printed out. Here's a demo:

      Both this example and JavaFan's post underscore the fact that one needs to define the difference between a new line that ends a URL and a new line that should be ignored because the URL continues onto the next line.

      Best, beth

        You are correct that my code fails on your test case, however that is out of spec for the OP. In particular, note that no URI either starts or ends the file and they are all separated from surrounding text by non-newline white space. I designed my regex to match identically to the original, since I don't know the real source of the data for comparison. I agree that the final \s is potentially problematic (and updated accordingly), but the OP has it in there. Was it because he didn't realize \S* is greedy? Probably, but how I am to know that? The provided file is highly unlikely to produce the desired result, since http://www.website1.com/getme.html is not present anywhere in the file. Without knowing the actual data source/file format, any answer given here can fail. What if the OP also wants https: to match? ftp:? Best solution I've seen is in CountZero's comment, since Regexp::Common should give some resilience, but it fails your test as well.