Stripping links from field

htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I'm building a newsletter and trying to strip URLs from a field because Google recognizes it as an attachment, and other Mail clients present it as a link. I'm currently doing this which isn't working:

$description =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs;
$shortdesc = substr ($description, 0, 270) . "...";

print "$shortdesc";
[download]

I'd like to recodnize the entire link and just not print it but I'm not having much luck. I'm trying to strip any URL, but here's one that doesn't work:

https://drive.google.com/file/d/1ZhXQYI-4fgx5hredv7Z0Tl2sszvN92oV/view
+?usp=sharing...
[download]

Comment on Stripping links from field Select or Download Code

Replies are listed 'Best First'.
Re: Stripping links from field by GrandFather (Saint) on Oct 24, 2022 at 22:49 UTC
Sounds easy, but in practice it's tricky to catch just the link text. Here's a partial solution that will catch most links starting with `https?://` `use warnings; use strict; my $link = 'https://drive.google.com/file/d/1ZhXQYI-4fgx5hredv7Z0Tl2s +szvN92oV/view?usp=sharing'; my $text = <<TEXT; Some sample text to strip $link from. Sometimes the sentence ends with the $link. We don't want to remove th +e period if that happens, or other punctuation in similar situations. TEXT $text =~ s~\bhttps?://\S+([.)?!,]\s?\|$\|\b)~...$1~g; print $text;` [download] Prints: `Some sample text to strip ... from. Sometimes the sentence ends with the .... We don't want to remove the +period if that happens, or other punctuation in similar situations.` [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Stripping links from field by htmanning (Friar) on Oct 25, 2022 at 00:07 UTC
Thanks so much. It works! I had resolved myself to simply not printing the $shortdesc if there was a URL included. This is much better.	[reply]
Re: Stripping links from field by haukex (Archbishop) on Oct 25, 2022 at 06:35 UTC
You haven't provided any sample input and the expected output for that input, but based on your regex you appear to want to strip HTML tags. Do not use regular expressions to process HTML. You may simply be looking for HTML::Strip, or I provided some code to produce an abstract of a piece of text in the presence of HTML at Re: Creating an abstract (updated), which can be adapted to remove specific tags like `<a>`. My node Why a regex really isn't good enough for HTML and XML, even for "simple" tasks and the replies also show a lot of examples of parsing links from HTML, which can be adapted to remove them as well. If instead you really do have simple plain text and you want to remove substrings that look like `https?://` URLs, then of course there's a module for that too: Regexp::Common::URI::http.	[reply] [d/l] [select]
Re: Stripping links from field by Your Mother (Archbishop) on Oct 25, 2022 at 10:50 UTC
It’s been a long time but I used these in the past: URI::Find, URI::Find::Schemeless.	[reply]
Re: Stripping links from field by choroba (Cardinal) on Oct 24, 2022 at 21:39 UTC
Can you please update the post by including the link? It's not clear what you're trying to remove. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: Stripping links from field by htmanning (Friar) on Oct 24, 2022 at 22:21 UTC
Sorry. I've updated my post. I'm trying to strip any URL but I posted a Google Drive URL as an example.	[reply]
Re^3: Stripping links from field by choroba (Cardinal) on Oct 24, 2022 at 22:35 UTC
Your regex seems to not catch links, but HTML tags with attributes. How do you define a "link"? Do you want to remove anything starting with `http://` or `https://` up to whitespace? `s{https?://\S+}{}g` [download] Something else? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Stripping links from field by htmanning (Friar) on Oct 24, 2022 at 22:45 UTC