yacoubean has asked for the wisdom of the Perl Monks concerning the following question:

Holy Monks,

I have a piece of code that's driving me crazy. In short, I am trying to extract the link from <a href links through out my HTML page. I have the extract_tagged code working in other parts of my program, and from what I can tell it is exactly the same as this chunk that is mis-behaving.

Bad boy:

if (/<a href\=\"/) { my @link = extract_tagged($_, '<a href="', '">', undef, undef); print " @link[4]\n"; }

This code works:

else { my @text = extract_tagged($_, '<li>', '</li>', undef, undef); print " *@text[4]*\n"; }

Here is some text that the code is parsing:
<li><a href="menuheader.html">menuheader.cfm</a></li>

I know that the condition for the if statement is firing, because I can print out some debugging text inside. If I return a count of @link, its 3, but it should at least be 5, if I understand things right. I've tried returning @link positions 0-5, and all return null. I've tried escaping the quotes and/or equal signs as well.

FYI, I am fairly new to Perl, so go easy on me. ;)

Replies are listed 'Best First'.
Re: extract_tagged
by ikegami (Patriarch) on Sep 29, 2004 at 15:24 UTC

    Is this extract_tagged from Text::Balanced? Text::Balanced is a set of tokenizing functions for parsers. Tokenizers extract from the current position in the string/stream, so these function can't be used to match something that may occur later in the string. In other words, your string doesn't start with <a href=" (it starts with <li><a href="), so nothing is extracted.

    I don't know what to suggest as a replacement, but I'm sure someone else will be suggesting a module better suited to what you are doing.

      You are the holiest Monk of them all. At lest of those that responded. :) davido's and mifflin's suggestions probably would have worked, but it was much easier to just change my code to:

      if (/<a href\=\"/) { my @link = extract_tagged($_, '<li><a href="', '">', undef, undef); print " @link[4]\n"; }

      That worked like a charm.

Re: extract_tagged
by mifflin (Curate) on Sep 29, 2004 at 15:30 UTC
    Try HTML::SimpleLinkExtor
    Here is an example...
    # cat testit use HTML::SimpleLinkExtor; use LWP::Simple; $content = get('http://www.perlmonks.com'); $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); for ($extor->links) { print "$_\n" if /http/ } # perl testit http://pair.com http://promote.pair.com/i/pair-banner-current.gif http://perlmonks.org/images/usermonkpics/BBQmonk.gif http://www.perldoc.com/perl5.8.0/pod/func/unpack.html http://www.perldoc.com/perl5.8.0/pod/func/vec.html http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple%3A%3A +VisitorFactory http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple%3A%3A +VisitorFactory http://rio.pm.org/ http://www.conisli.org.br/ http://tinymicros.com/pm/index.php?goto=OverallStats http://www.cafepress.com/perlmonks,perlmonks_too,pm_more http://aegis.sourceforge.net/ http://www.gnu.org/software/gnu-arch/ http://www.bitmover.com/bitkeeper http://www.cvshome.org http://www.perforce.com http://msdn.microsoft.com/vstudio/previous/ssafe/ http://subversion.tigris.org http://everydevel.com http://yetanother.org http://promote.pair.com/direct.pl?perlmonks.org
Re: extract_tagged
by davido (Cardinal) on Sep 29, 2004 at 15:15 UTC

    The easiest and most robust way is to use a piece of code tested and used by many many others first. HTML::LinkExtor does what you are trying to do. and is pretty easy to install.


    Dave

Re: extract_tagged
by JediWizard (Deacon) on Sep 29, 2004 at 15:16 UTC

    Can you send us some sample data that is causing the error? Assuming extract_tagged is something you wrote, can you show us that function?

    May the Force be with you
Re: extract_tagged
by TedPride (Priest) on Sep 30, 2004 at 05:05 UTC
    You should probably use the suggested modules for this, but if you really want to do it otherwise, here's some code:
    while ($text =~ /(href|<frame .*?src)[ ="']+(.*?)["'>]/g) { print $2; }
    Considering some of the nasty ways people can arrange their links, this is about as good as you can get. If you want to eliminate anything starting with command: other than http: (like mailto:), you can modify the above as follows:
    while ($text =~ /(href|<frame .*?src)[ ="']+((http:)?[^:]*?)["'>]/g) { print $2; }
    If you find a link format that gets past this, feel free to post so I can update the regex.