(jeffa) Re: Problems splitting HTML in to hash table

There are many problems here - first, why are you looping like that:

for (@list) {
   $list[$count]; #yadda yadda
   $count++;
}
[download]

Either use the elements of the list or access the indexes like so:

for my $count (0..$#list) {
   $list[$count]; #yadda yadda
}
[download]

Second - use a Parser! If HTML::LinkExtor won't do the job then try HTML::TokeParser or HTML::Parser. You did not specify what you are trying to accomplish with this code, so i can't really help you much more. Even though you have managed to get this technique to work on other pages, i still question it's robustness. Trust me, use a parser - it might even be as simple as:

use strict;
use Data::Dumper;
use HTML::TokeParser;

my $data   = do {local $/;<DATA>};
my $parser = HTML::TokeParser->new(\$data);

my %hash;
while (my $tag = $parser->get_tag('a')) {
   $hash{$tag->[1]->{href}}++;
}

print Dumper \%hash;


__DATA__
<tr align="left" valign="top"> <td align="left" valign="top"> <table C
+ELLPADDING="0" CELLSPACING="0"><tr><td> <a href="page.cfm?objectid=11
+933900&method=full&siteid=50144" CLASS="smallteaserpic">Costly false 
+alarms</a><BR> <font CLASS="headtypea">
A new policy aimed at tackling the huge waste of police time attending
+ false security alarm calls is to be introduced this week <a href="pa
+ge.cfm?objectid=11933900&method=full&siteid=50144">more</a>
</font>
</td></tr></table> <p> <table CELLPADDING="0" CELLSPACING="0"><tr><td>
+ <a href="page.cfm?objectid=11933890&method=full&siteid=50144" CLASS=
+"smallteaserpic">Mindless yobs terrorise OAP's</a><BR>
[download]

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on (jeffa) Re: Problems splitting HTML in to hash table Select or Download Code

Replies are listed 'Best First'.
Re: (jeffa) Re: Problems splitting HTML in to hash table by Popcorn Dave (Abbot) on Jun 11, 2002 at 17:56 UTC
Firstly, the reason I'm looping through it is that this is test code to work out a rule for a certain page layout. I'm writing a program to pull headlines from non-RSS newspapers so I'm looking only for the headlines. What I have found is that there is some kind of designation, be it graphic or comment, in the HTML code that I can look for and then start my headline link search after that. As this is test code, I saved a copy of the html as a text file and was reading it in to an array, then parsing from there. As for the parser, I'll give it a shot, but from what your output looks like I'd still have to search for all the href links as it's pulling all the <tag> stuff out. That's not what I'm after. I just want the href links and the text between them. That's why I was using: `m/(<a href[^>]>)(.+</a>)/io;` [download] thereby giving me my link, the text between and the closing tag. Then I am throwing $1 and $2 in to a hash table to eliminate duplicate headlines. I will have a look at the parser though. It would be nice to make this easier. : ) What I'm really stumped about though is why the code I posted was concatenating the values on the matches. Unless my PC was seriously overheated and something was going wrong, I can't see why those wouldn't be unique matches every time as you're sending it different data to check. Any ideas on that? Update:* After much thought I have figured out where my thinking went wrong with my original question. When I was asking why m!(<a^>])(.+?)!iog was not matching $3, 4, etc... with the global, but merely $1 and $2, it finally occured to me that all I'm asking* it to match is $1 and $2. Some people fall from grace. I prefer a running start...	[reply] [d/l]
(jeffa) 3Re: Problems splitting HTML in to hash table by jeffa (Bishop) on Jun 11, 2002 at 19:20 UTC
Sorry, but i didn't ask why you are looping, i asked why are you looping like that? But the point is mu. Read on. ;) "looks like I'd still have to search for all the href links as it's pulling all the stuff out..." That's much more trivial to do then you make it sound. Now, i don't know what a 'headline' is, so i am going to assume it is the text between the anchor tags. All you need to do is this: `# create the parser, etc. my %hash; while (my $tag = $parser->get_tag('a')) { $hash{$parser->get_text} = $tag->[1]->{href}; } for (keys %hash) { print qq\|<a href="$_">$hash{$_}</a>\n\|; }` [download] Every time you add a key to hash, non-unique keys will overwrite the ones that already exists - i see no good reason to encapsulate this in a subroutine call. If you want unique URL's instead, simply switch `$parser->get_text` with `$tag->[1]->{href}` (and the keys with the values in the for loop). If you want to parse the href links even further, then i suggest the URI module: `use URI; # etc. my @list; while (my $tag = $parser->get_tag('a')) { my $uri = URI->new($tag->[1]->{href}); push @list, { path => $uri->path(), query => { $uri->query_form() }, text => $parser->get_text(), }; } print Dumper \@list;` [download] There are soooo many cool modules out there to make your life easier. I personally have more fun writing 'glue code' than 'doing it all by hand'. Doing the later is a good way to learn, but after that, i say it is better and faster to use the help of the CPAN (and all the wonderful folks who contribute). "What I'm really stumped about though is why the code I posted was concatenating the values on the matches ...Any ideas on that?" Nope, sorry. When i see someone doing it the wrong way, instead of trying to understand their logic i try to show them a more right way. It would take far too much energy do the former and liberal amount of PSI::ESP. I know this came off as grumpy - but i really do wish you the best in your endeavor. Good luck! jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: (jeffa) 3Re: Problems splitting HTML in to hash table by Popcorn Dave (Abbot) on Jun 12, 2002 at 02:37 UTC
Thanks for all that! Firstly, the reason I am looping like that is I'm reading a file in to an array, indexing the count until I find my target text, then I know the index from which I need to count to find what I'm after. There may be a more efficient way to do it, but for now I want it to work. : ) As far as my problem I've at least found it. For some reason the author of this particular page had put all their news headlines, links and text, on one long line. Now that I know that I think I can take it from there. And you didn't sound grumpy at all. For now, I think I'm going to steer clear of the modules to practice my regexes as I'm still a bit rusty on some of the finer points of that. However once this thing is running, I will definitely look at the module aspect to see if I can shorten the code. At present I've got 79 newspaper websites that I want to look at, but I've managed to pare it down to 19 rules so that isn't too bad I don't think. Oh, btw, is the ESP::PSI module in the ACME section of CPAN? I think I could really use that for some serious debugging... ; ) Some people fall from grace. I prefer a running start...	[reply]