Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

(Hopefully that title makes sense)

Fellow monks,

I've been beating my head over this one for the last few hours and am still completely stumped!

I'm trying to parse some html

<tr align="left" valign="top"> <td align="left" valign="top"> <table C +ELLPADDING="0" CELLSPACING="0"><tr><td> <a href="page.cfm?objectid=11 +933900&method=full&siteid=50144" CLASS="smallteaserpic">Costly false +alarms</a><BR> <font CLASS="headtypea"> A new policy aimed at tackling the huge waste of police time a +ttending false security alarm calls is to be introduced this week <a +href="page.cfm?objectid=11933900&method=full&siteid=50144">more</a> </font> </td></tr></table> <p> <table CELLPADDING="0" CELLSPACING="0"><tr> +<td> <a href="page.cfm?objectid=11933890&method=full&siteid=50144" CL +ASS="smallteaserpic">Mindless yobs terrorise OAP's</a><BR>

using the following code:

for (@list){ if ($list[$count]=~ m!page.cfm!iog){ $list[$count] =~ s/<img[^>]*>//iog; $list[$count] =~ s/\r\n//iog; $list[$count] =~ s!</?t(r|d|able)?[^>]*>!!iog; $list[$count]=~ m!(<a.+href.+>)(.+</a>)!iog; print "Count is: $count\n"; &add_it( $1, $2 ); } $count++; };

The add subroutine checks if the url is already present in a hash table, and if not, it adds it.

My problem is that I'm getting one long hash string here. It's not making multiple hash entries.

I know the technique *should* work as I've got it working on other html pages, but I'm absolutely stuck as to why this is not seperating the data in to seperate hash keys/elements.

I have even tried to get rid of the linefeeds to see if that was perhaps causing the mischief, but to no avail.

Anyone have any ideas?

Thanks in advance!

Some people fall from grace. I prefer a running start...

Replies are listed 'Best First'.
(jeffa) Re: Problems splitting HTML in to hash table
by jeffa (Bishop) on Jun 11, 2002 at 06:53 UTC
    There are many problems here - first, why are you looping like that:
    for (@list) { $list[$count]; #yadda yadda $count++; }
    Either use the elements of the list or access the indexes like so:
    for my $count (0..$#list) { $list[$count]; #yadda yadda }
    Second - use a Parser! If HTML::LinkExtor won't do the job then try HTML::TokeParser or HTML::Parser. You did not specify what you are trying to accomplish with this code, so i can't really help you much more. Even though you have managed to get this technique to work on other pages, i still question it's robustness. Trust me, use a parser - it might even be as simple as:
    use strict; use Data::Dumper; use HTML::TokeParser; my $data = do {local $/;<DATA>}; my $parser = HTML::TokeParser->new(\$data); my %hash; while (my $tag = $parser->get_tag('a')) { $hash{$tag->[1]->{href}}++; } print Dumper \%hash; __DATA__ <tr align="left" valign="top"> <td align="left" valign="top"> <table C +ELLPADDING="0" CELLSPACING="0"><tr><td> <a href="page.cfm?objectid=11 +933900&method=full&siteid=50144" CLASS="smallteaserpic">Costly false +alarms</a><BR> <font CLASS="headtypea"> A new policy aimed at tackling the huge waste of police time attending + false security alarm calls is to be introduced this week <a href="pa +ge.cfm?objectid=11933900&method=full&siteid=50144">more</a> </font> </td></tr></table> <p> <table CELLPADDING="0" CELLSPACING="0"><tr><td> + <a href="page.cfm?objectid=11933890&method=full&siteid=50144" CLASS= +"smallteaserpic">Mindless yobs terrorise OAP's</a><BR>

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Firstly, the reason I'm looping through it is that this is test code to work out a rule for a certain page layout.

      I'm writing a program to pull headlines from non-RSS newspapers so I'm looking *only* for the headlines. What I have found is that there is some kind of designation, be it graphic or comment, in the HTML code that I can look for and then start my headline link search after that.

      As this is test code, I saved a copy of the html as a text file and was reading it in to an array, then parsing from there.

      As for the parser, I'll give it a shot, but from what your output looks like I'd still have to search for all the href links as it's pulling all the <tag> stuff out. That's not what I'm after. I just want the href links and the text between them. That's why I was using:

      m/(<a href[^>]*>)(.+</a>)/io;
      thereby giving me my link, the text between and the closing tag.

      Then I am throwing $1 and $2 in to a hash table to eliminate duplicate headlines.

      I will have a look at the parser though. It would be nice to make this easier. : )

      What I'm really stumped about though is why the code I posted was concatenating the values on the matches. Unless my PC was seriously overheated and something was going wrong, I can't see why those wouldn't be unique matches every time as you're sending it different data to check.

      Any ideas on that?

      Update: After much thought I have figured out where my thinking went wrong with my original question.

      When I was asking why m!(<a^>*])(.+?)!iog was not matching $3, 4, etc... with the global, but merely $1 and $2, it finally occured to me that all I'm *asking* it to match is $1 and $2.

      Some people fall from grace. I prefer a running start...

        Sorry, but i didn't ask why you are looping, i asked why are you looping like that? But the point is mu. Read on. ;)

        "looks like I'd still have to search for all the href links as it's pulling all the stuff out..."

        That's much more trivial to do then you make it sound. Now, i don't know what a 'headline' is, so i am going to assume it is the text between the anchor tags. All you need to do is this:

        # create the parser, etc. my %hash; while (my $tag = $parser->get_tag('a')) { $hash{$parser->get_text} = $tag->[1]->{href}; } for (keys %hash) { print qq|<a href="$_">$hash{$_}</a>\n|; }
        Every time you add a key to hash, non-unique keys will overwrite the ones that already exists - i see no good reason to encapsulate this in a subroutine call.

        If you want unique URL's instead, simply switch  $parser->get_text with $tag->[1]->{href} (and the keys with the values in the for loop). If you want to parse the href links even further, then i suggest the URI module:

        use URI; # etc. my @list; while (my $tag = $parser->get_tag('a')) { my $uri = URI->new($tag->[1]->{href}); push @list, { path => $uri->path(), query => { $uri->query_form() }, text => $parser->get_text(), }; } print Dumper \@list;
        There are soooo many cool modules out there to make your life easier. I personally have more fun writing 'glue code' than 'doing it all by hand'. Doing the later is a good way to learn, but after that, i say it is better and faster to use the help of the CPAN (and all the wonderful folks who contribute).

        "What I'm really stumped about though is why the code I posted was concatenating the values on the matches ...Any ideas on that?"

        Nope, sorry. When i see someone doing it the wrong way, instead of trying to understand their logic i try to show them a more right way. It would take far too much energy do the former and liberal amount of PSI::ESP.

        I know this came off as grumpy - but i really do wish you the best in your endeavor. Good luck!

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: Problems splitting HTML in to hash table
by Zaxo (Archbishop) on Jun 11, 2002 at 06:55 UTC

    Right up there with use CGI; is use HTML::Parser;. Hand-rolled parsing of HTML or XML gets you a long term of reflection on your sins.

    After Compline,
    Zaxo