in reply to (jeffa) Re: Problems splitting HTML in to hash table
in thread Problems splitting HTML in to hash table
I'm writing a program to pull headlines from non-RSS newspapers so I'm looking *only* for the headlines. What I have found is that there is some kind of designation, be it graphic or comment, in the HTML code that I can look for and then start my headline link search after that.
As this is test code, I saved a copy of the html as a text file and was reading it in to an array, then parsing from there.
As for the parser, I'll give it a shot, but from what your output looks like I'd still have to search for all the href links as it's pulling all the <tag> stuff out. That's not what I'm after. I just want the href links and the text between them. That's why I was using:
thereby giving me my link, the text between and the closing tag.m/(<a href[^>]*>)(.+</a>)/io;
Then I am throwing $1 and $2 in to a hash table to eliminate duplicate headlines.
I will have a look at the parser though. It would be nice to make this easier. : )
What I'm really stumped about though is why the code I posted was concatenating the values on the matches. Unless my PC was seriously overheated and something was going wrong, I can't see why those wouldn't be unique matches every time as you're sending it different data to check.
Any ideas on that?
Update: After much thought I have figured out where my thinking went wrong with my original question.
When I was asking why m!(<a^>*])(.+?)!iog was not matching $3, 4, etc... with the global, but merely $1 and $2, it finally occured to me that all I'm *asking* it to match is $1 and $2.
Some people fall from grace. I prefer a running start...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
(jeffa) 3Re: Problems splitting HTML in to hash table
by jeffa (Bishop) on Jun 11, 2002 at 19:20 UTC | |
by Popcorn Dave (Abbot) on Jun 12, 2002 at 02:37 UTC |