in reply to How to strip everything in a string except HTML Link

So you want to keep everything that looks like a link?

I would use something like HTML::TreeBuilder::XPath and the appropriate XPath query (//a). Other candidates are XML::Twig.

Personally, I would use App::scrape, which is a tiny command line wrapper around HTML::TreeBuilder::XPath.

Replies are listed 'Best First'.
Re^2: How to strip everything in a string except HTML Link
by Anonymous Monk on May 15, 2015 at 07:19 UTC
    More specifically, the links are all just news affiliate webites, like newsok.com, etc. I have no idea which news affiliate websites they will be, but there are hundreds of them. He changes some of them regularly, so I am building this to check on them once per day, to pick up any new ones and add them to our database.

    My friend that does that said to check it daily, so I am just writing a script to go do that. The part I'm having a problem with is getting the full html link, I've been using strip and striping down every part, but that is just too much, I know there is an expr that will work. I just cannot recall how to write it.

    Rich

      Have you looked at the modules I linked? They will all happily extract the links.

      Alternatively, you might want to (re)read perlre, but I would use an existing HTML parser instead of trying my own.

        Yeah, I looked over them all, very complicated for me, since I've not been writing Perl scripts for over 3 years, so I've forgotten nearly everything, except simple code...

      I did not mean expr, I meant a regex...
Re^2: How to strip everything in a string except HTML Link
by Anonymous Monk on May 15, 2015 at 07:08 UTC
    The URL's will always be different, I won't know what they are, it is based upon unique links, a friend of mine always changes and he said I could always get them, I am building a script that will check them for me, to see if I already have them, I don't want to check everyday, manually.

    thanks,
    Rich
Re^2: How to strip everything in a string except HTML Link
by Anonymous Monk on May 15, 2015 at 08:03 UTC
    a regex like this: <((?!a[ ]).|\n)*?>

    Except one that leaves the trailing </a> in it.

    Can you find one that is like that that works?