How to strip everything in a string except HTML Link

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to strip everything in a string except HTML Link by choroba (Cardinal) on May 15, 2015 at 06:58 UTC
If your HTML is well-formed, you can use XML::LibXML: `#! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $string = q~...~; my $xml = 'XML::LibXML'->load_html(string => $string); say for $xml->findnodes('//a');` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re: How to strip everything in a string except HTML Link by Corion (Patriarch) on May 15, 2015 at 06:58 UTC
So you want to keep everything that looks like a link? I would use something like HTML::TreeBuilder::XPath and the appropriate XPath query (`//a`). Other candidates are XML::Twig. Personally, I would use App::scrape, which is a tiny command line wrapper around HTML::TreeBuilder::XPath.	[reply] [d/l]
Re^2: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 07:19 UTC
More specifically, the links are all just news affiliate webites, like newsok.com, etc. I have no idea which news affiliate websites they will be, but there are hundreds of them. He changes some of them regularly, so I am building this to check on them once per day, to pick up any new ones and add them to our database. My friend that does that said to check it daily, so I am just writing a script to go do that. The part I'm having a problem with is getting the full html link, I've been using strip and striping down every part, but that is just too much, I know there is an expr that will work. I just cannot recall how to write it. Rich	[reply]
Re^3: How to strip everything in a string except HTML Link by Corion (Patriarch) on May 15, 2015 at 07:33 UTC
Have you looked at the modules I linked? They will all happily extract the links. Alternatively, you might want to (re)read perlre, but I would use an existing HTML parser instead of trying my own.	[reply]
Re^4: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 07:41 UTC
Re^3: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 07:46 UTC
I did not mean expr, I meant a regex...	[reply]
Re^2: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 07:08 UTC
The URL's will always be different, I won't know what they are, it is based upon unique links, a friend of mine always changes and he said I could always get them, I am building a script that will check them for me, to see if I already have them, I don't want to check everyday, manually. thanks, Rich	[reply]
Re^2: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 08:03 UTC
a regex like this: `<((?!a[ ]).\|\n)*?>` Except one that leaves the trailing </a> in it. Can you find one that is like that that works?	[reply] [d/l]
Re: How to strip everything in a string except HTML Link by Discipulus (Canon) on May 15, 2015 at 07:22 UTC
Hello, i think HTML::LinkExtor will be a useful tool in your case, and this old node too. If you want to update a list of unique links you can store them somehow (plain text, database, storable file..) then you firstly load this cache in the program, building up an hash (keys are unique, so it helps). After you can extract links and update the hash only if key does not exists. On success write the new copy of the storage. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: How to strip everything in a string except HTML Link by Anonymous Monk on May 15, 2015 at 07:32 UTC
But in there, you know the base: `my $base = 'http://perlmonks.org/';` [download] I will have no idea what they are, could be any news affiliate website in the world. I just want to remove the other stuff and leave what is in the html link: `Link: <a href="http://example.com">and Anchor</a>` [download] If that above were the string, it would remove Link: and leave the rest. `my $string = q~Link: <a href="http://example.com">and Anchor</a>~; $string =~ s/<[a href.... # I cannot remember this string. There was o +ne that worked perfect, even if the link had target="_blank" it did n +ot matter what else it had... but I cannot find it in any of my files + or remember who to write it.` [download] Also, I've at this point already downloaded the one page they are all on, and I've parsed it down to just one table cell, that has other data in it and I've gotten out of that table cell the information I need, all that is left is the remnants including the html link with anchor... so I want to just use that string to remove everything left, except the link and anchor.	[reply] [d/l] [select]
Re: How to strip everything in a string except HTML Link by aaron_baugher (Curate) on May 15, 2015 at 08:58 UTC
For production code that will be used regularly or by other people, I would use one of the HTML-parsing modules mentioned earlier. For a one-time grab, a regex may be good enough to get the job done. If there's only one A link in the block of text: `$text =~ s\|^.(<a .+?</a>).$\|$1\|s;` [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]