in reply to Regex to detect and remove a linebreak in an URL

Maybe
/http:\S*[^\w\s]/g and s/\G\n//;
would be good enough? If they cross multiple line boundaries, you'd need to keep a flag, like:
/http:\S*[^\w\s]/g and s/\G\n// and $http=1; $http and /^\S*[\w\s]/g and s/\G\n// or $http=0;
But a lot of URLs end in slashes, and it would be perfectly reasonable to see (in mail):
Check out http://homestarrunner.com/sbemail/ and let me know what you think
and the processor would join the "and" line onto the end of the URL.

Update: Oops, forgot the /g modifier on the pattern matches, so the \G didn't work.


The PerlMonk tr/// Advocate

Replies are listed 'Best First'.
Re: Regex to detect and remove a linebreak in an URL
by Abigail-II (Bishop) on May 19, 2004 at 19:43 UTC
    [^\w\s] matches all characters that are not both a word character and a whitespace character. That is, it matches *any* character. Which means that from the string <URL:http://www.example.com/> your first regex is going to match http://www.example.com/>. Your second regex is only going to match if the string contains http: followed by a sequence of non-whitespace characters, followed by a whitespace character, followed by a newline. Given the string http://www.example.com/\npath/here, your solution will not set $http to 1.

    Abigail

      Astonishingly, you are wrong about the character class. [^\w\s] matches characters that are neither words nor whitespace; i.e., punctuation:
      $_='o.ne<a>tw/o<b>'; @punks = /[^\w\s]/g; print "<@punks>";
      yields
      <. < > / < >>
      The problem with my example was that I left the /g modifier off the pattern match. I've updated it, and tested it:
      while(<DATA>){ /http:\S*[^\w\s]/g and s/\G\n//; print; } __DATA__ there is an http://whatever.com/address/ crossing/line/boundaries.html right in the middle of this nice string.

      The PerlMonk tr/// Advocate

        Hello Roy and others,

        What is the X in:

        s/\G\n/X/

        ?