I have been working on this project on and off for a couple of months and wanted to share the idea to receive some feedback. I do a lot of screen scraping - parsing web pages into perl data structures. So what I have is a simple framework that allows a unified interface to drivers that scrape news articles from websites. Kind of like what WWW::Search does for search engines/databases etc.
Take some news website, say BBC News or The Hindu. I pass the URL to Khabar, which finds the appropriate pareser if available and gives it the URL to parse. Then I can get back some data structure thats got basically the same information available on the website, but something I could use in my application.
The basic structure I'm using now is title, publisher, date, author, byline, content, category, related articles, related links, imbedded images, ad banner URL and links, etc. I also have a simple module that can output this into RSS2.0
Do the wise monks have any ideas of other projects to look at, design suggestions, potential pitfalls ...? I could also use hints/tips/advice on better screenscraping/parsing. Hopefully in time every Monk can contribute a parser that can read their local news website and we will no longer be dependent on the Googleopoly for news aggregation.
In reply to Framework for News Articles by smalhotra
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |