Don't ask to ask, just ask | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
If I understand your question correctly, you may be able to look for comment tags and grab what's between those.
A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there. I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.
Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
In reply to Re: How would you extract *content* from websites?
by Popcorn Dave
|
|