in reply to Re: Scrape a blog: a statistical approach
in thread Scrape a blog: a statistical approach
I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term?
So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change.
So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate.
With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?)
I'll post same examplef of code and the results I've got so far as soon as I can.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Scrape a blog: a statistical approach
by soonix (Chancellor) on Apr 13, 2014 at 22:18 UTC |