Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Well Monks, by a remarkable coincidence my question is closely related to another posted here today. I'm not a Perl programmer or any other kind of programmer, but I can usually figure out the simple things. However, this one has me stumped.

I have a short Perl script that filters incoming email in preparation for archiving. Some email clients wrap email when sending it -- either by default or because they have been set up that way. So recipients sometimes get broken URLs and there is nothing they can do it about it.

What I need to do is match a broken URL that looks like this:

http://www012.upp.so-net.ne.jp/sculpture/gallery/backnumber/g_s_maeda/
g_maeda_sakuhin2.html

The linebreak may appear anywhere, but the URL is always split on a boundary such as a slash or dot.

Does anyone here have any idea how to construct a regular expression to match an URL broken in this way, with an linebreak at an arbitrary position?

I guess it would be something like this:

  1. If a line contains something that could be an URL which terminates at the end of the line, save the URL.
  2. If the next line start with something that looks like an URL fragment, join it to the saved URL.
  3. Validate the syntax of the resulting URL -
    • if invalid, continue
    • if valid, join the two lines

The difficult bit (for me) is detecting an URL fragment.

I'm grateful already, as I have found a load of other useful stuff on this miraculous website.

  • Comment on Regex to detect and remove a linebreak in an URL

Replies are listed 'Best First'.
Re: Regex to detect and remove a linebreak in an URL
by saintmike (Vicar) on May 19, 2004 at 15:45 UTC
    How about you read in the mail body, throw out all newlines that are preceded or followed by URI-like punctuation (like /:?&) and then throw URI::Find at it?

    Something like this:

    use warnings; use strict; use URI::Find; my $data = join '', <DATA>; $data =~ s/(?<=[.:\/?&])\n//g; $data =~ s/\n(?=[.:\/?&])//g; my $finder = URI::Find->new( sub { print "Found: $_[0]\n"; }); $finder->find(\$data); __DATA__ There's a URL http://here.com and a URL http:// there.com. What else? A really long one http://abc .def.com?foo=bar&foo=bar&foo=bar&foo=bar&foo=bar&foo=bar& bar=foo. And this is a regular line break.
Re: Regex to detect and remove a linebreak in an URL
by Abigail-II (Bishop) on May 19, 2004 at 19:48 UTC
    Does anyone here have any idea how to construct a regular expression to match an URL broken in this way, with an linebreak at an arbitrary position?
    Yes, I do. But I'm not going to. Because it will be one beast of a regular expression. For comparison, let me give you a regex for just matching HTTP URLs (that's just forgetting all the other possible schemes). Just image if you want to adapt that to include newlines...
    (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9]) [.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0 -9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a- zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a- zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?: (?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?: (?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*)) (?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0-9][a-fA-F 0-9]))*)))?))?)
    Newlines added for readability only. They are not part of the regex.

    Abigail

Re: Regex to detect and remove a linebreak in an URL
by Roy Johnson (Monsignor) on May 19, 2004 at 18:22 UTC
    Maybe
    /http:\S*[^\w\s]/g and s/\G\n//;
    would be good enough? If they cross multiple line boundaries, you'd need to keep a flag, like:
    /http:\S*[^\w\s]/g and s/\G\n// and $http=1; $http and /^\S*[\w\s]/g and s/\G\n// or $http=0;
    But a lot of URLs end in slashes, and it would be perfectly reasonable to see (in mail):
    Check out http://homestarrunner.com/sbemail/ and let me know what you think
    and the processor would join the "and" line onto the end of the URL.

    Update: Oops, forgot the /g modifier on the pattern matches, so the \G didn't work.


    The PerlMonk tr/// Advocate
      [^\w\s] matches all characters that are not both a word character and a whitespace character. That is, it matches *any* character. Which means that from the string <URL:http://www.example.com/> your first regex is going to match http://www.example.com/>. Your second regex is only going to match if the string contains http: followed by a sequence of non-whitespace characters, followed by a whitespace character, followed by a newline. Given the string http://www.example.com/\npath/here, your solution will not set $http to 1.

      Abigail

        Astonishingly, you are wrong about the character class. [^\w\s] matches characters that are neither words nor whitespace; i.e., punctuation:
        $_='o.ne<a>tw/o<b>'; @punks = /[^\w\s]/g; print "<@punks>";
        yields
        <. < > / < >>
        The problem with my example was that I left the /g modifier off the pattern match. I've updated it, and tested it:
        while(<DATA>){ /http:\S*[^\w\s]/g and s/\G\n//; print; } __DATA__ there is an http://whatever.com/address/ crossing/line/boundaries.html right in the middle of this nice string.

        The PerlMonk tr/// Advocate
Re: Regex to detect and remove a linebreak in an URL
by Hagbone (Monk) on May 20, 2004 at 01:03 UTC
    I don't pretend to know the solution, but I do agree with Abigail that the nuances involved in a comprehensive regex are daunting.

    For example, none of the suggested regex's allow for a secure server URL ... https

    Just the tip of the iceberg, it seems ;)

    Hagbone

Re: Regex to detect and remove a linebreak in an URL
by Anonymous Monk on May 20, 2004 at 11:05 UTC

    Thanks everyone -- including the sceptics -- for your help with this one.

    I think the problem may not be amenable to any real solution. Although we would be happy with a solution that matched most broken URIs most of the time (or even some of the time), we can't accept the risk of creating new broken URLs -- and unfortunately, that could easily happen.

    Take this simple example:

    Here is a valid link http://mydomain.org/
    which is not broken

    I can't see any way of detecting whether http://mydomain.org/ is correct or whether it should really be http://mydomain.org/which

    It would work if we could be certain that the first token on the next line -- the one to be appended to the possibly-broken URL -- could only exist as an end-fragment of a syntactically-correct URL. I suspect that even if theoretically possible, it would be impractical.