Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How do I use an EXPR to strip everything out of a long string that is html, except a link, like this:

my $string = qq~<table> <tr> <td>I love my Football Teams</td> <td>My favorite Teams are:</td> </tr> <tr> <td>Oklahoma Sooners</td> <td><a href="http://example.com">Sooner Nation</a></td> </tr> <tr> <td>SF 49ers</td> <td>I miss Edward J. DeBartolo, Jr.!!!</td> </tr> </table> ~;

In that string, I want to only leave this: <a href="http://example.com">Sooner Nation</a>

I know there is a way to do it, I cannot remember how to though, it has been a LONG time since I've wrote Perl scripts, so I've lost most of my ability to do it...

I would greatly appreciate pointers.

Rich

Replies are listed 'Best First'.
Re: How to strip everything in a string except HTML Link
by choroba (Cardinal) on May 15, 2015 at 06:58 UTC
    If your HTML is well-formed, you can use XML::LibXML:
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $string = q~...~; my $xml = 'XML::LibXML'->load_html(string => $string); say for $xml->findnodes('//a');
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: How to strip everything in a string except HTML Link
by Corion (Patriarch) on May 15, 2015 at 06:58 UTC
      More specifically, the links are all just news affiliate webites, like newsok.com, etc. I have no idea which news affiliate websites they will be, but there are hundreds of them. He changes some of them regularly, so I am building this to check on them once per day, to pick up any new ones and add them to our database.

      My friend that does that said to check it daily, so I am just writing a script to go do that. The part I'm having a problem with is getting the full html link, I've been using strip and striping down every part, but that is just too much, I know there is an expr that will work. I just cannot recall how to write it.

      Rich

        Have you looked at the modules I linked? They will all happily extract the links.

        Alternatively, you might want to (re)read perlre, but I would use an existing HTML parser instead of trying my own.

        I did not mean expr, I meant a regex...
      The URL's will always be different, I won't know what they are, it is based upon unique links, a friend of mine always changes and he said I could always get them, I am building a script that will check them for me, to see if I already have them, I don't want to check everyday, manually.

      thanks,
      Rich
      a regex like this: <((?!a[ ]).|\n)*?>

      Except one that leaves the trailing </a> in it.

      Can you find one that is like that that works?
Re: How to strip everything in a string except HTML Link
by Discipulus (Canon) on May 15, 2015 at 07:22 UTC
    Hello,
    i think HTML::LinkExtor will be a useful tool in your case, and this old node too.

    If you want to update a list of unique links you can store them somehow (plain text, database, storable file..) then you firstly load this cache in the program, building up an hash (keys are unique, so it helps). After you can extract links and update the hash only if key does not exists. On success write the new copy of the storage.
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      But in there, you know the base:

      my $base = 'http://perlmonks.org/';
      I will have no idea what they are, could be any news affiliate website in the world.

      I just want to remove the other stuff and leave what is in the html link:

      Link: <a href="http://example.com">and Anchor</a>
      If that above were the string, it would remove Link: and leave the rest.

      my $string = q~Link: <a href="http://example.com">and Anchor</a>~; $string =~ s/<[a href.... # I cannot remember this string. There was o +ne that worked perfect, even if the link had target="_blank" it did n +ot matter what else it had... but I cannot find it in any of my files + or remember who to write it.


      Also, I've at this point already downloaded the one page they are all on, and I've parsed it down to just one table cell, that has other data in it and I've gotten out of that table cell the information I need, all that is left is the remnants including the html link with anchor... so I want to just use that string to remove everything left, except the link and anchor.
Re: How to strip everything in a string except HTML Link
by aaron_baugher (Curate) on May 15, 2015 at 08:58 UTC

    For production code that will be used regularly or by other people, I would use one of the HTML-parsing modules mentioned earlier. For a one-time grab, a regex may be good enough to get the job done. If there's only one A link in the block of text:

    $text =~ s|^.*(<a .+?</a>).*$|$1|s;

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.