learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, hoping someone can point me in the right direction.
I have the following code which finds files that end in .txt then tries to match on a regx. The scripts works fine, the only problem is how can I get the code if no space is found continue reading after the carriage return and the next whitespace? Or is there a better way to do this?
The main issues are:

*That the script prints the entire line not just the URL
*if the URL continues to the next line then it cuts the URL/Link short.

I guess I can check if the TLD's exist in the current line before printing the line and if not continue reading to after the <CR> carriage return until a white space? I guessing that I am making this harder that it can be. Thanks for the help everyone!
#!/usr/bin/perl use File::Find; find(\&url_find, "/tmp/sub_url_test/"); sub url_find() { if ( -f && \.txt$) { #Find files ending in .txt open(LOG, "< $File::Find::name") or return(0); while ( my $LINE = <LOG> ) { if ( $LINE =~ m/(http:\S*\s)/ ) { print $LINE; } } } } ============== FILE CONTENTS ============== This is a test this is a test test test http://www.website. com/getme.html this is test this is a test This is a test this is a test test test http://www.website2.com/ getme.html this is test this is a test This is a test this is a test test test http://www.website3.com/getme .html this is test this is a test This is a test this is a test test test http://www.website4.com/getme. +h tml this is test this is a test ============== CURRENT OUTPUT ============== This is a test this is a test test test http://www.website. This is a test this is a test test test http://www.website2.com/ This is a test this is a test test test http://www.website3.com/getme This is a test this is a test test test http://www.website4.com/getme. +h ============================== OUTPUT THAT I AM HOPING TO SEE ============================== http://www.website.com/getme.html http://www.website1.com/getme.html http://www.website2.com/getme.html http://www.website3.com/getme.html http://www.website4.com/getme.html

Replies are listed 'Best First'.
Re: Continue reading regex to next line
by ELISHEVA (Prior) on Mar 02, 2009 at 21:34 UTC

    if ( $LINE =~ m/(http:\S*\s)/ ) { print $LINE should be something like:  if ( $LINE =~ m/(http:\S+)/ ) { print $1

    That will take care of the problem of grabbing more than the URL, but only if there is one URL per line and what follows the http: is always a URL. The problem with your original regex is that it "captured" the spaces after the URL. It also considered "http:" all on its lonesome a valid URL - somewhat improbable. You also were printing out the line, rather than the part you "captured".

    To solve the problem of URL's across lines, is a bit more complicated. You would need to (a) cache each line where a URL start is found (b) have a mechanism to determine the difference between a URL terminated by an end of line and a URL terminated by a run of spaces on the following line. You seem to want to use spaces between the the last letter of the URL and the new line as your test, but I'm not sure that would be reliable - couldn't a URL just end when the line ended?

    Best, beth

    Update: further explanations of issues, including need to print out captured portion rather than whole line.

      Thanks for the quick reply. I've updated the regex but the output is exactly the same as in the original post.
        You missed the change of print $LINE to print $1
Re: Continue reading regex to next line
by kennethk (Abbot) on Mar 02, 2009 at 21:46 UTC

    You cannot get your regex to match across newlines since your input is breaking on them. One solution would be to change your input record separator by setting $/. You could then get your desired result with a code like:

    #!/usr/bin/perl use strict; use warnings; undef $/; open my $fh, "<", '/tmp/sub_url_test/'; my $string = <$fh>; $string =~ s/\n//g; while ($string =~ /(http:\S*)\s/g) { print $1, "\n"; }

    Update: Note, as per ELISHEVA comment, the trailing \s might bite you. I've copied the regex in the original post and for this reason am leaving it as is, but beware that this might not do what you want if your data set differs significantly from what you posted.

      Great thanks, your suggestions worked perfectly. Thanks again!!
        If only ... It only works "perfectly" if you are certain that you will never have two lines like this in a row:
        more junk http://www.foo.com http://www.fiddle.com

        The line $string =~ s/\n//g; removes the new line after each URL. After removal, there is no white space after either of the above URLs so the regex (which terminates the URL with whitespace) doesn't match and nothing gets printed out. Here's a demo:

        Both this example and JavaFan's post underscore the fact that one needs to define the difference between a new line that ends a URL and a new line that should be ignored because the URL continues onto the next line.

        Best, beth

Re: Continue reading regex to next line
by CountZero (Bishop) on Mar 02, 2009 at 22:06 UTC
    This works for me:
    use strict; use Regexp::Common qw /URI/; my $big_string; while (<DATA>) { chomp; $big_string .= $_; } my @websites = $big_string =~ m/($RE{URI}{HTTP})/g; print join "\n", @websites; __DATA__ This is a test this is a test test test http://www.website. com/getme.html this is test this is a test This is a test this is a test test test http://www.website2.com/ getme.html this is test this is a test This is a test this is a test test test http://www.website3.com/getme .html this is test this is a test This is a test this is a test test test http://www.website4.com/getme. +h tml this is test this is a test
    Its output is:
    http://www.website.com/getme.html http://www.website2.com/getme.html http://www.website3.com/getme.html http://www.website4.com/getme.html

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Continue reading regex to next line
by JavaFan (Canon) on Mar 02, 2009 at 22:03 UTC
    So, what should the output of
    This is a test this is a test test test http://www.website5.com/getme. +html this is test
    be? Note that http://www.website5.com/getme.htmltest is a perfectly valid URL.
      Since it is a perfectly valid URL there is no way of discarding the "junk" that is added.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Continue reading regex to next line
by bichonfrise74 (Vicar) on Mar 02, 2009 at 23:22 UTC
    Try this...
    #!/usr/bin/perl use strict; while (<DATA>) { my $line; ($line) = $_ =~ m|\w+.*\s(http://\w+.*\.html\s)\w+.*|; print "$line\n"; } __DATA__ This is a test this is a test test test http://www.website.com/getme.h +tml this is test this is a test This is a test this is a test test test http://www.website2.com/getme. +html this is test this is a test This is a test this is a test test test http://www.website3.com/getme. +html this is test this is a test This is a test this is a test test test http://www.website4.com/getme. +html this is test this is a test
Re: Continue reading regex to next line
by boby (Initiate) on Mar 03, 2009 at 09:11 UTC
    hi friend, try out this one
    open(FILE,"out1.txt") or die $!; while(<FILE>) { $_=~s/\s*//g; push(@array,$_); } $array=join('',@array); @array=split(/http|html/,$array); foreach (@array){ if($_=~/www/){print "http$_"."html\n";} }