in reply to How would you extract *content* from websites?

If I understand your question correctly, you may be able to look for comment tags and grab what's between those.

A few years ago I wrote a program to parse headlines from newspapers - this was pre RSS - and I initially got it down to 9 rules for 25 papers. Of course I was still learning Perl at this point, but I had to go through every web page and find the similiarities much like you're talking about. So what I ended up doing was building a config file with a start and end marker for every webpage I was looking at, and parsed my info from there.

I've gone back and reworked the program somewhat but I used HTML::TokeParser and that made the job a whole lot easier, but I did have an advantage that the papers I was looking at were from the same news organizations, but from different towns, so a lot of the layouts were the same. I still use the config file though.

Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
  • Comment on Re: How would you extract *content* from websites?

Replies are listed 'Best First'.
Re^2: How would you extract *content* from websites?
by BUU (Prior) on Jun 18, 2005 at 03:10 UTC
    That sounds reasonable, but how do you programatically determine the starting and ending comments?
      Boy, I wish I knew the answer to that. Like I said, I looked at the page layouts of the web sites I was after and built my config file with the comments to look for - starting and ending.

      Along the lines of what you're after I suppose you could just parse for comments and build a list of comment tags to look for. You had mentioned doing a diff on the files you wanted to look at, so that may be the way to start.

      Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.