Re: Re: Text::Balanced woes..

Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do...

It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged....

Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string?

i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?"

i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email.

update

i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!)..

# find all the URLs from the page contents, rejecting any from bianca
@data = extract_multiple( $response->content, 
                [ sub {extract_tagged($_[0], 
                '<a href="http://', '</a>', 
                undef, 
                {reject => ['bianca.com']} ) } ], 
                                    undef, 1);

# loop thru and strip the URL to it's bare address, this is
# what's needed to insert into the database
for (my $i=0; $i<=$#data; $i++) {
    my @temp = extract_tagged($data[$i], '<a href="http://', '">', und
+ef, undef);
    $data[$i] = $temp[4];
}
[download]

Thanks again for everyone's help and comments!

Comment on Re: Re: Text::Balanced woes.. Download Code

Replies are listed 'Best First'.
Re: Text::Balanced woes.. by Smylers (Pilgrim) on May 28, 2002 at 10:29 UTC
That loop can be simplified: Don't bother doing the counting yourself when Perl will do it for you. You don't actually need the temporary array — you can grab a single element from a list. This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler: `foreach my $datum (@data) { $datum = (extract_tagged($datum, '<a href="http://', '">'))[4]; }` [download] Smylers	[reply] [d/l]
Re: Text::Balanced woes.. by Smylers (Pilgrim) on May 28, 2002 at 10:37 UTC
Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do... I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it: upper-case letters: `<a HREF="...">` single quotes: `<a href='...'>` other attributes: `<a class="main" href="...">` Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations. Smylers	[reply] [d/l] [select]
Re: Re: Text::Balanced woes.. by u914 (Pilgrim) on Jun 12, 2002 at 05:45 UTC
Actually, the html pages that it works on are generated by another script, so they're pretty consistent. i like the loop simplification, though... i often forget about foreach, and the C-ishness is a fiendish habit to break! Thanks for the tip! :-)	[reply]