I am trying to get all the article links from different websites on daily base. I will have 1000s of such websites. First, I will need to identify whether a particular link is an article link or not. Now, the way, I like to identify the article links is via a training set. So some form of regular expressions can be build from this set and can be used to extract future article links. Any idea and useful modules would be helpful.