(jeffa) 3Re: Problems splitting HTML in to hash table

Sorry, but i didn't ask why you are looping, i asked why are you looping like that? But the point is mu. Read on. ;)

"looks like I'd still have to search for all the href links as it's pulling all the stuff out..."

That's much more trivial to do then you make it sound. Now, i don't know what a 'headline' is, so i am going to assume it is the text between the anchor tags. All you need to do is this:

# create the parser, etc.

my %hash;
while (my $tag = $parser->get_tag('a')) {
   $hash{$parser->get_text} = $tag->[1]->{href};
}

for (keys %hash) {
   print qq|<a href="$_">$hash{$_}</a>\n|;
}
[download]

Every time you add a key to hash, non-unique keys will overwrite the ones that already exists - i see no good reason to encapsulate this in a subroutine call.

If you want unique URL's instead, simply switch $parser->get_text with $tag->[1]->{href} (and the keys with the values in the for loop). If you want to parse the href links even further, then i suggest the URI module:

use URI;

# etc.

my @list;
while (my $tag = $parser->get_tag('a')) {
   my $uri = URI->new($tag->[1]->{href});
   push @list, {
      path  => $uri->path(),
      query => { $uri->query_form() },
      text  => $parser->get_text(),
   };
}
print Dumper \@list;
[download]

There are soooo many cool modules out there to make your life easier. I personally have more fun writing 'glue code' than 'doing it all by hand'. Doing the later is a good way to learn, but after that, i say it is better and faster to use the help of the CPAN (and all the wonderful folks who contribute).

"What I'm really stumped about though is why the code I posted was concatenating the values on the matches ...Any ideas on that?"

Nope, sorry. When i see someone doing it the wrong way, instead of trying to understand their logic i try to show them a more right way. It would take far too much energy do the former and liberal amount of PSI::ESP.

I know this came off as grumpy - but i really do wish you the best in your endeavor. Good luck!

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on (jeffa) 3Re: Problems splitting HTML in to hash table Select or Download Code

Replies are listed 'Best First'.
Re: (jeffa) 3Re: Problems splitting HTML in to hash table by Popcorn Dave (Abbot) on Jun 12, 2002 at 02:37 UTC
Thanks for all that! Firstly, the reason I am looping like that is I'm reading a file in to an array, indexing the count until I find my target text, then I know the index from which I need to count to find what I'm after. There may be a more efficient way to do it, but for now I want it to work. : ) As far as my problem I've at least found it. For some reason the author of this particular page had put all their news headlines, links and text, on one long line. Now that I know that I think I can take it from there. And you didn't sound grumpy at all. For now, I think I'm going to steer clear of the modules to practice my regexes as I'm still a bit rusty on some of the finer points of that. However once this thing is running, I will definitely look at the module aspect to see if I can shorten the code. At present I've got 79 newspaper websites that I want to look at, but I've managed to pare it down to 19 rules so that isn't too bad I don't think. Oh, btw, is the ESP::PSI module in the ACME section of CPAN? I think I could really use that for some serious debugging... ; ) Some people fall from grace. I prefer a running start...	[reply]

Replies are listed 'Best First'.

Re: (jeffa) 3Re: Problems splitting HTML in to hash table
by Popcorn Dave (Abbot) on Jun 12, 2002 at 02:37 UTC

Firstly, the reason I am looping like that is I'm reading a file in to an array, indexing the count until I find my target text, then I know the index from which I need to count to find what I'm after. There may be a more efficient way to do it, but for now I want it to work. : )

As far as my problem I've at least found it. For some reason the author of this particular page had put all their news headlines, links and text, on one long line. Now that I know that I *think* I can take it from there.

And you didn't sound grumpy at all. For now, I think I'm going to steer clear of the modules to practice my regexes as I'm still a bit rusty on some of the finer points of that. However once this thing is running, I will definitely look at the module aspect to see if I can shorten the code.

At present I've got 79 newspaper websites that I want to look at, but I've managed to pare it down to 19 rules so that isn't too bad I don't think.

Oh, btw, is the ESP::PSI module in the ACME section of CPAN? I think I could really use that for some *serious* debugging... ; )

Some people fall from grace. I prefer a running start...

[reply]