Earlier in the evening, converter, who was working on his extremely handy perldoc IRC bot, had an idea: that it would be pretty handy for the bot to have a searchable list of all cpan modules with their descriptions. Well, CPAN, doesn't exactly have a list of modules, it has a catagorized list of modules, known as theModule Long List. There is also a text-only (but much harder to parse) version. With the wonderousHTML::TokeParser, a script was quickly written that extracts the module links and their corresponding descriptions. Note that this script may seem slow; it is because CPAN isn't the fastest of servers. Just be patient :)

#!perl -w use HTML::TokeParser; use LWP::Simple; use strict; ## set marker definintions my $content = get("http://www.cpan.org/modules/00modlist.long.html"); my $parse = HTML::TokeParser->new(\$content); my @module; my @links; while (my $token = $parse->get_tag("a")) { my $url = $token->[1]{href} || ""; my $text = $parse->get_trimmed_text("/a"); if (($url =~ /module=/i) && $text) { push @links, quotemeta(qq(<a href="$url">$text</a>)); my $header = quotemeta(qq(http://search.cpan.org/searc +h?module=)); $url =~ s/^$header(.*?)/$1/i; push @module, $url; } } my @descs; foreach my $link (@links) { my ($desc) = $content =~ /$link(.*?)\Q<\E/im; push @descs, $desc; }

Replies are listed 'Best First'.
Re: Grab a list of all modules on CPAN + their descriptions
by merlyn (Sage) on Dec 29, 2001 at 14:23 UTC
    The data is already Data::Dumper-ized in 03modlist.data.gz! No need to parse anything, except toss down to the first blank line, then eval the rest of the file!

    -- Randal L. Schwartz, Perl hacker

      Hehe... I spent over a half hour looking for something exactly like this, to no avail. In fact, writing the script only took half as long as the search... you'd think that they would link to the 03modlist.data.gz on the module page, eh? :)