Well Monks, by a remarkable coincidence my question is closely related to another posted here today. I'm not a Perl programmer or any other kind of programmer, but I can usually figure out the simple things. However, this one has me stumped.

I have a short Perl script that filters incoming email in preparation for archiving. Some email clients wrap email when sending it -- either by default or because they have been set up that way. So recipients sometimes get broken URLs and there is nothing they can do it about it.

What I need to do is match a broken URL that looks like this:

http://www012.upp.so-net.ne.jp/sculpture/gallery/backnumber/g_s_maeda/
g_maeda_sakuhin2.html

The linebreak may appear anywhere, but the URL is always split on a boundary such as a slash or dot.

Does anyone here have any idea how to construct a regular expression to match an URL broken in this way, with an linebreak at an arbitrary position?

I guess it would be something like this:

  1. If a line contains something that could be an URL which terminates at the end of the line, save the URL.
  2. If the next line start with something that looks like an URL fragment, join it to the saved URL.
  3. Validate the syntax of the resulting URL -
    • if invalid, continue
    • if valid, join the two lines

The difficult bit (for me) is detecting an URL fragment.

I'm grateful already, as I have found a load of other useful stuff on this miraculous website.


In reply to Regex to detect and remove a linebreak in an URL by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.