Re^2: Scrape a blog: a statistical approach

I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term?

So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change.

So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate.

With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?)

I'll post same examplef of code and the results I've got so far as soon as I can.

Comment on Re^2: Scrape a blog: a statistical approach

Replies are listed 'Best First'.
Re^3: Scrape a blog: a statistical approach by soonix (Chancellor) on Apr 13, 2014 at 22:18 UTC
You could take a diff between consecutive pages instead of counting lines. You'd have to experiment with different modules like e.g. HTML::Diff or Text::Diff, but this approach could also help with style/layout changes.	[reply]