Continue reading regex to next line

learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, hoping someone can point me in the right direction.
I have the following code which finds files that end in .txt then tries to match on a regx. The scripts works fine, the only problem is how can I get the code if no space is found continue reading after the carriage return and the next whitespace? Or is there a better way to do this?
The main issues are:

*That the script prints the entire line not just the URL
*if the URL continues to the next line then it cuts the URL/Link short.

I guess I can check if the TLD's exist in the current line before printing the line and if not continue reading to after the <CR> carriage return until a white space? I guessing that I am making this harder that it can be. Thanks for the help everyone!

#!/usr/bin/perl

use File::Find;
find(\&url_find, "/tmp/sub_url_test/");

sub url_find() {
    if ( -f && \.txt$) {  #Find files ending in .txt
       open(LOG, "< $File::Find::name") or return(0);
       while ( my $LINE = <LOG> ) {
               if ( $LINE =~ m/(http:\S*\s)/ ) {
               print $LINE;
           }
        }
    }
}

==============
FILE CONTENTS
==============
This is a test this is a test test test http://www.website.
com/getme.html this is test this is a test
This is a test this is a test test test http://www.website2.com/
getme.html this is test this is a test
This is a test this is a test test test http://www.website3.com/getme
.html this is test this is a test
This is a test this is a test test test http://www.website4.com/getme.
+h
tml this is test this is a test

==============
CURRENT OUTPUT
==============
This is a test this is a test test test http://www.website.
This is a test this is a test test test http://www.website2.com/
This is a test this is a test test test http://www.website3.com/getme
This is a test this is a test test test http://www.website4.com/getme.
+h

==============================
OUTPUT THAT I AM HOPING TO SEE
==============================
http://www.website.com/getme.html
http://www.website1.com/getme.html
http://www.website2.com/getme.html
http://www.website3.com/getme.html
http://www.website4.com/getme.html
[download]

Comment on Continue reading regex to next line Download Code

Replies are listed 'Best First'.
Re: Continue reading regex to next line by ELISHEVA (Prior) on Mar 02, 2009 at 21:34 UTC
`if ( $LINE =~ m/(http:\S\s)/ ) { print $LINE` should be something like: `if ( $LINE =~ m/(http:\S+)/ ) { print $1` That will take care of the problem of grabbing more than the URL, but only if there is one URL per line and what follows the `http:` is always a URL. The problem with your original regex is that it "captured" the spaces after the URL. It also considered "http:" all on its lonesome a valid URL - somewhat improbable. You also were printing out the line, rather than the part you "captured". To solve the problem of URL's across lines, is a bit more complicated. You would need to (a) cache each line where a URL start is found (b) have a mechanism to determine the difference between a URL terminated by an end of line and a URL terminated by a run of spaces on the following line. You seem to want to use spaces between the the last letter of the URL and the new line as your test, but I'm not sure that would be reliable - couldn't a URL just end when the line ended? Best, beth Update:* further explanations of issues, including need to print out captured portion rather than whole line.	[reply] [d/l] [select]
Re^2: Continue reading regex to next line by learningperl01 (Beadle) on Mar 02, 2009 at 21:40 UTC
Thanks for the quick reply. I've updated the regex but the output is exactly the same as in the original post.	[reply]
Re^3: Continue reading regex to next line by ikegami (Patriarch) on Mar 02, 2009 at 22:01 UTC
You missed the change of `print $LINE` to `print $1`	[reply] [d/l] [select]
Re: Continue reading regex to next line by kennethk (Abbot) on Mar 02, 2009 at 21:46 UTC
You cannot get your regex to match across newlines since your input is breaking on them. One solution would be to change your input record separator by setting `$/`. You could then get your desired result with a code like: `#!/usr/bin/perl use strict; use warnings; undef $/; open my $fh, "<", '/tmp/sub_url_test/'; my $string = <$fh>; $string =~ s/\n//g; while ($string =~ /(http:\S)\s/g) { print $1, "\n"; }` [download] Update:* Note, as per ELISHEVA comment, the trailing `\s` might bite you. I've copied the regex in the original post and for this reason am leaving it as is, but beware that this might not do what you want if your data set differs significantly from what you posted.	[reply] [d/l] [select]
Re^2: Continue reading regex to next line by learningperl01 (Beadle) on Mar 02, 2009 at 22:11 UTC
Great thanks, your suggestions worked perfectly. Thanks again!!	[reply]
Re^3: Continue reading regex to next line by ELISHEVA (Prior) on Mar 02, 2009 at 22:32 UTC
If only ... It only works "perfectly" if you are certain that you will never have two lines like this in a row: `more junk http://www.foo.com http://www.fiddle.com` [download] The line `$string =~ s/\n//g;` removes the new line after each URL. After removal, there is no white space after either of the above URLs so the regex (which terminates the URL with whitespace) doesn't match and nothing gets printed out. Here's a demo: Read more... (507 Bytes) Both this example and JavaFan's post underscore the fact that one needs to define the difference between a new line that ends a URL and a new line that should be ignored because the URL continues onto the next line. Best, beth	[reply] [d/l] [select]
Re^4: Continue reading regex to next line by kennethk (Abbot) on Mar 02, 2009 at 23:15 UTC
Re: Continue reading regex to next line by CountZero (Bishop) on Mar 02, 2009 at 22:06 UTC
This works for me: use strict; use Regexp::Common qw /URI/; my $big_string; while (<DATA>) { chomp; $big_string .= $_; } my @websites = $big_string =~ m/($RE{URI}{HTTP})/g; print join "\n", @websites; __DATA__ This is a test this is a test test test http://www.website. com/getme.html this is test this is a test This is a test this is a test test test http://www.website2.com/ getme.html this is test this is a test This is a test this is a test test test http://www.website3.com/getme .html this is test this is a test This is a test this is a test test test http://www.website4.com/getme. +h tml this is test this is a test [download] Its output is: `http://www.website.com/getme.html http://www.website2.com/getme.html http://www.website3.com/getme.html http://www.website4.com/getme.html` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: Continue reading regex to next line by JavaFan (Canon) on Mar 02, 2009 at 22:03 UTC
So, what should the output of `This is a test this is a test test test http://www.website5.com/getme. +html this is test` [download] be? Note that `http://www.website5.com/getme.htmltest` is a perfectly valid URL.	[reply] [d/l] [select]
Re^2: Continue reading regex to next line by CountZero (Bishop) on Mar 02, 2009 at 22:13 UTC
Since it is a perfectly valid URL there is no way of discarding the "junk" that is added. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: Continue reading regex to next line by bichonfrise74 (Vicar) on Mar 02, 2009 at 23:22 UTC
Try this... #!/usr/bin/perl use strict; while (<DATA>) { my $line; ($line) = $_ =~ m\|\w+.\s(http://\w+.\.html\s)\w+.*\|; print "$line\n"; } __DATA__ This is a test this is a test test test http://www.website.com/getme.h +tml this is test this is a test This is a test this is a test test test http://www.website2.com/getme. +html this is test this is a test This is a test this is a test test test http://www.website3.com/getme. +html this is test this is a test This is a test this is a test test test http://www.website4.com/getme. +html this is test this is a test [download]	[reply] [d/l]
Re: Continue reading regex to next line by boby (Initiate) on Mar 03, 2009 at 09:11 UTC
hi friend, try out this one `open(FILE,"out1.txt") or die $!; while(<FILE>) { $_=~s/\s*//g; push(@array,$_); } $array=join('',@array); @array=split(/http\|html/,$array); foreach (@array){ if($_=~/www/){print "http$_"."html\n";} }` [download]	[reply] [d/l]