in reply to Re: Text::Balanced woes..
in thread Text::Balanced woes..

Having looked at the HTML:: modules, i really do think this is exactly what i want/need it to do...

It works marvelously, except that when using extract_multiple with extract_tagged as the subroutine, there seems no (obvious:) way to access the 5th (#4) element of the array returned by extract_tagged....

Or is it that by calling it within extract_multiple it isn't in list context? But if that's the case, then it must be in scalar context, what happens to the remainder string?

i guess the crux of my question is: "When using extract_multiple, how does one access the other members of the returned array, as it seems that item 0 is the only available?"

i've got a some working code, but am reluctant to post the code here (it is an anti-spambot tool, after all)but i'd be happy to share it via email.

update

i've worked it out with a for loop (i know, control structures are for whimps! guilty as charged!)..
# find all the URLs from the page contents, rejecting any from bianca @data = extract_multiple( $response->content, [ sub {extract_tagged($_[0], '<a href="http://', '</a>', undef, {reject => ['bianca.com']} ) } ], undef, 1); # loop thru and strip the URL to it's bare address, this is # what's needed to insert into the database for (my $i=0; $i<=$#data; $i++) { my @temp = extract_tagged($data[$i], '<a href="http://', '">', und +ef, undef); $data[$i] = $temp[4]; }
Thanks again for everyone's help and comments!

Replies are listed 'Best First'.
Re: Text::Balanced woes..
by Smylers (Pilgrim) on May 28, 2002 at 10:29 UTC

    That loop can be simplified:

    1. Don't bother doing the counting yourself when Perl will do it for you.
    2. You don't actually need the temporary array — you can grab a single element from a list.

    This is untested, since I don't have sample data handy, but I reckon does the same as your loop and is a little simpler:

    foreach my $datum (@data) { $datum = (extract_tagged($datum, '<a href="http://', '">'))[4]; }

    Smylers

Re: Text::Balanced woes..
by Smylers (Pilgrim) on May 28, 2002 at 10:37 UTC
    Having looked at the HTML:: modules, i really do think this Text::Balanced is exactly what i want/need it to do...

    I realize that your code is only a snippet, but it does look like it is possible to concoct valid HTML hyperlinks that don't get caught by it:

    • upper-case letters: <a HREF="...">
    • single quotes: <a href='...'>
    • other attributes: <a class="main" href="...">

    Whether these matter depends on your application and your users. But if they do you probably are better using a module explitly for parsing HTML rather than trying to think of all the possible valid variations.

    Smylers

      Actually, the html pages that it works on are generated by another script, so they're pretty consistent.

      i like the loop simplification, though... i often forget about foreach, and the C-ishness is a fiendish habit to break!

      Thanks for the tip!

      :-)