show has asked for the wisdom of the Perl Monks concerning the following question:

Hello there,

I am trying to create a script to retrieve RSS data and search the contents. I would like to extract all records (item) that are available at the moment but I get only recent 10 records or so with the code below. I should be able to get a lot more as when I use Google Reader, it seems that I can go back to a lot more of other (older) records. How to do this?

#!/usr/local/bin/perl use strict; use Encode; use LWP::Simple; use XML::RSS; my @RSS_URLs = ("http://rss.cnn.com/rss/cnn_us.rss"); binmode (STDOUT); binmode (STDOUT, ":encoding(utf8)"); for my $url (@RSS_URLs) { my $document = LWP::Simple::get($url) or die "cannot get content f +rom $url"; my $rss = XML::RSS->new; $rss->parse($document); for (@{$rss->{items}}) { print $_->{title} . "\n"; } }

Replies are listed 'Best First'.
Re: How to retrieve all RSS data
by muba (Priest) on Oct 14, 2010 at 00:33 UTC

    Well, that RSS file only contains ten items, so you *are* extracting everything that RSS stream currently offers.

    You can't compare a single RSS fetch with what Google Reader (or any other news aggregator) does: they repeatedly check the RSS for new items and store them on their own servers, so that you can read them at your convenience. That's not exactly the same as downloading the current RSS just once and seeing what's in there.

      like has been said, you need to compare apples to apples when it comes to coding.
      look at the feed directly in the browser, without a reader, i.e. put the whatever.rss url in the browser url bar. depending on browser it might look a bit different, as most browsers try to apply some basic styling..but importantly all the feed items will be in the page just as your code would download the feed and do things to it.
      rss reader applications, some of which are like google reader and some (especially older ones) are standalone GUI applications. these reader applications do a whole lot more stuff behind the scenes, and hence don't just present the feed as you would get it with a single GET request in code.
      the hardest line to type correctly is: stty erase ^H
        Thanks guys for your reply. I see what you are saying but what makes me still puzzled is that When I have a brand new RDF link, and start feeding for the first time with Google Reader, it is able to retrieve old feeds. So there should be a way to do this.