htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm building a newsletter and trying to strip URLs from a field because Google recognizes it as an attachment, and other Mail clients present it as a link. I'm currently doing this which isn't working:

$description =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs; $shortdesc = substr ($description, 0, 270) . "..."; print "$shortdesc";

I'd like to recodnize the entire link and just not print it but I'm not having much luck. I'm trying to strip any URL, but here's one that doesn't work:

https://drive.google.com/file/d/1ZhXQYI-4fgx5hredv7Z0Tl2sszvN92oV/view +?usp=sharing...

Replies are listed 'Best First'.
Re: Stripping links from field
by GrandFather (Saint) on Oct 24, 2022 at 22:49 UTC

    Sounds easy, but in practice it's tricky to catch just the link text. Here's a partial solution that will catch most links starting with https?://

    use warnings; use strict; my $link = 'https://drive.google.com/file/d/1ZhXQYI-4fgx5hredv7Z0Tl2s +szvN92oV/view?usp=sharing'; my $text = <<TEXT; Some sample text to strip $link from. Sometimes the sentence ends with the $link. We don't want to remove th +e period if that happens, or other punctuation in similar situations. TEXT $text =~ s~\bhttps?://\S+([.)?!,]\s?|$|\b)~...$1~g; print $text;

    Prints:

    Some sample text to strip ... from. Sometimes the sentence ends with the .... We don't want to remove the +period if that happens, or other punctuation in similar situations.
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Thanks so much. It works! I had resolved myself to simply not printing the $shortdesc if there was a URL included. This is much better.
Re: Stripping links from field
by haukex (Archbishop) on Oct 25, 2022 at 06:35 UTC

    You haven't provided any sample input and the expected output for that input, but based on your regex you appear to want to strip HTML tags. Do not use regular expressions to process HTML. You may simply be looking for HTML::Strip, or I provided some code to produce an abstract of a piece of text in the presence of HTML at Re: Creating an abstract (updated), which can be adapted to remove specific tags like <a>. My node Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks and the replies also show a lot of examples of parsing links from HTML, which can be adapted to remove them as well.

    If instead you really do have simple plain text and you want to remove substrings that look like https?:// URLs, then of course there's a module for that too: Regexp::Common::URI::http.

Re: Stripping links from field
by Your Mother (Archbishop) on Oct 25, 2022 at 10:50 UTC
Re: Stripping links from field
by choroba (Cardinal) on Oct 24, 2022 at 21:39 UTC
    Can you please update the post by including the link? It's not clear what you're trying to remove.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Sorry. I've updated my post. I'm trying to strip any URL but I posted a Google Drive URL as an example.
        Your regex seems to not catch links, but HTML tags with attributes. How do you define a "link"?

        Do you want to remove anything starting with http:// or https:// up to whitespace?

        s{https?://\S+}{}g

        Something else?

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]