perleager has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I'm not quite sure where to start off with this. I'm planning on taking News Articles links from sites such as the WallStreet, Reuters, etc. to place them on my website for visitors to click on.

Should I use LWP::Simple module to carry out this job? I did some researching and found out there are already existing scripts that carry out this same exact job. I'm figuring it'll involve retrieving content with the lwp module and then storing it into a file. Then the output part of the script will read the files with the headlines and links. Would this be a lot of programming? I'm figuring it would. I'm under a time constraint to finish the webpage so I don't know If I should try to build a script that does this job or maybe use a script such as NewsClipper. (News Clipper.com)

Thanks,
Anthony

Replies are listed 'Best First'.
Re: News with LWP::Simple?
by Popcorn Dave (Abbot) on Feb 29, 2004 at 08:50 UTC
    Yes you can quite easily do that. I did it for a Perl class final project. One thing I can suggest is that you use HTML::TokeParser as it will make page decoding a lot easier for you.

    The one thing I did find was that I needed to grab a page from each news source that I was going to display headlines from and see how the page was layed out. From there I wrote rules using regexes to parse out the relevant bits of information. My original version had 9 rules for 24 different web sites, while the newer version I got it down to 3 rules for about 90 sites. Now it's up to 4 with the RSS feeds I plan to add.

    One thing to be aware of, though, is the possibility that you're violating copyrights by doing what you're doing. Make sure you check in to that.

    Check my scratchpad for a quick and dirty tokeparser program that will spit out the layout of a page by tokens.

    Good luck!

    There is no emoticon for what I'm feeling now.

Re: News with LWP::Simple?
by matija (Priest) on Feb 29, 2004 at 08:21 UTC
    Yes, you could use LWP simple. However, since you will likely need to traverse through multiple links, and possibly some forms, I suggest you also take a look at WWW::Mechanize.
Re: News with LWP::Simple?
by tinita (Parson) on Feb 29, 2004 at 12:42 UTC
    you might also look if the news site offers an interface, e.g. a file in xml or another format which you can download and process, like heise.de (they offer an rdf-file to download). this will be much more convenient than parsing downloaded HTML, i believe.
    if you download HTML be sure to check if the news site is restricting this in any way.