extract_tagged

yacoubean has asked for the wisdom of the Perl Monks concerning the following question:

Holy Monks,

I have a piece of code that's driving me crazy. In short, I am trying to extract the link from <a href links through out my HTML page. I have the extract_tagged code working in other parts of my program, and from what I can tell it is exactly the same as this chunk that is mis-behaving.

Bad boy:

if (/<a href\=\"/) {
  my @link = extract_tagged($_, '<a href="', '">', undef, undef);
  print "  @link[4]\n";
}
[download]

This code works:

else {
  my @text = extract_tagged($_, '<li>', '</li>', undef, undef);
  print "  *@text[4]*\n";
}
[download]

Here is some text that the code is parsing:
<li><a href="menuheader.html">menuheader.cfm</a></li>

I know that the condition for the if statement is firing, because I can print out some debugging text inside. If I return a count of @link, its 3, but it should at least be 5, if I understand things right. I've tried returning @link positions 0-5, and all return null. I've tried escaping the quotes and/or equal signs as well.

FYI, I am fairly new to Perl, so go easy on me. ;)

Comment on extract_tagged Select or Download Code

Replies are listed 'Best First'.
Re: extract_tagged by ikegami (Patriarch) on Sep 29, 2004 at 15:24 UTC
Is this extract_tagged from Text::Balanced? Text::Balanced is a set of tokenizing functions for parsers. Tokenizers extract from the current position in the string/stream, so these function can't be used to match something that may occur later in the string. In other words, your string doesn't start with `<a href="` (it starts with `<li><a href="`), so nothing is extracted. I don't know what to suggest as a replacement, but I'm sure someone else will be suggesting a module better suited to what you are doing.	[reply] [d/l] [select]
Re^2: extract_tagged by yacoubean (Scribe) on Sep 29, 2004 at 15:32 UTC
You are the holiest Monk of them all. At lest of those that responded. :) davido's and mifflin's suggestions probably would have worked, but it was much easier to just change my code to: `if (/<a href\=\"/) { my @link = extract_tagged($_, '<li><a href="', '">', undef, undef); print " @link[4]\n"; }` [download] That worked like a charm.	[reply] [d/l]
Re: extract_tagged by mifflin (Curate) on Sep 29, 2004 at 15:30 UTC
Try HTML::SimpleLinkExtor Here is an example... # cat testit use HTML::SimpleLinkExtor; use LWP::Simple; $content = get('http://www.perlmonks.com'); $extor = HTML::SimpleLinkExtor->new(); $extor->parse($content); for ($extor->links) { print "$_\n" if /http/ } # perl testit http://pair.com http://promote.pair.com/i/pair-banner-current.gif http://perlmonks.org/images/usermonkpics/BBQmonk.gif http://www.perldoc.com/perl5.8.0/pod/func/unpack.html http://www.perldoc.com/perl5.8.0/pod/func/vec.html http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple%3A%3A +VisitorFactory http://search.cpan.org/search?mode=module&query=Tree%3A%3ASimple%3A%3A +VisitorFactory http://rio.pm.org/ http://www.conisli.org.br/ http://tinymicros.com/pm/index.php?goto=OverallStats http://www.cafepress.com/perlmonks,perlmonks_too,pm_more http://aegis.sourceforge.net/ http://www.gnu.org/software/gnu-arch/ http://www.bitmover.com/bitkeeper http://www.cvshome.org http://www.perforce.com http://msdn.microsoft.com/vstudio/previous/ssafe/ http://subversion.tigris.org http://everydevel.com http://yetanother.org http://promote.pair.com/direct.pl?perlmonks.org [download]	[reply] [d/l]
Re: extract_tagged by davido (Cardinal) on Sep 29, 2004 at 15:15 UTC
The easiest and most robust way is to use a piece of code tested and used by many many others first. HTML::LinkExtor does what you are trying to do. and is pretty easy to install. Dave	[reply]
Re: extract_tagged by JediWizard (Deacon) on Sep 29, 2004 at 15:16 UTC
Can you send us some sample data that is causing the error? Assuming extract_tagged is something you wrote, can you show us that function? May the Force be with you	[reply]
Re: extract_tagged by TedPride (Priest) on Sep 30, 2004 at 05:05 UTC
You should probably use the suggested modules for this, but if you really want to do it otherwise, here's some code: `while ($text =~ /(href\|<frame .?src)[ ="']+(.?)["'>]/g) { print $2; }` [download] Considering some of the nasty ways people can arrange their links, this is about as good as you can get. If you want to eliminate anything starting with command: other than http: (like mailto:), you can modify the above as follows: `while ($text =~ /(href\|<frame .?src)[ ="']+((http:)?[^:]?)["'>]/g) { print $2; }` [download] If you find a link format that gets past this, feel free to post so I can update the regex.	[reply] [d/l] [select]