I have been working on this project on and off for a couple of months and wanted to share the idea to receive some feedback. I do a lot of screen scraping - parsing web pages into perl data structures. So what I have is a simple framework that allows a unified interface to drivers that scrape news articles from websites. Kind of like what WWW::Search does for search engines/databases etc.

Take some news website, say BBC News or The Hindu. I pass the URL to Khabar, which finds the appropriate pareser if available and gives it the URL to parse. Then I can get back some data structure thats got basically the same information available on the website, but something I could use in my application.

The basic structure I'm using now is title, publisher, date, author, byline, content, category, related articles, related links, imbedded images, ad banner URL and links, etc. I also have a simple module that can output this into RSS2.0

Do the wise monks have any ideas of other projects to look at, design suggestions, potential pitfalls ...? I could also use hints/tips/advice on better screenscraping/parsing. Hopefully in time every Monk can contribute a parser that can read their local news website and we will no longer be dependent on the Googleopoly for news aggregation.


In reply to Framework for News Articles by smalhotra

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.