in reply to How would you extract *content* from websites?
A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there.
I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How would you extract *content* from websites?
by BUU (Prior) on Jun 18, 2005 at 03:10 UTC | |
by Popcorn Dave (Abbot) on Jun 18, 2005 at 18:27 UTC |