in reply to Re: Random phrases
in thread Random phrases

Unless I've missed something, all the wikipedia dumps are in XML tagged format which would require a considerable amount of effort to remove the markup.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: Random phrases
by afoken (Chancellor) on Jun 05, 2012 at 19:00 UTC

    http://en.wikipedia.org/wiki/Wikipedia:Computer_help_desk/ParseMediaWikiDump links to Parse::MediaWikiDump, a tool that can handle those dumps. The documentation for the Parse::MediaWikiDump::page class has an example that dumps title and id for each page. Replace the print with print ${$page->text()} and you get all article texts. Not much work for you, but perhaps for your machine. ;-)

    BTW:

    This software is being RETIRED - MediaWiki::DumpFile is the official successor to Parse::MediaWikiDump and includes a compatibility library called MediaWiki::DumpFile::Compat that is 100% API compatible and is a near perfect standin for this module. It is faster in all instances where it counts and is actively maintained. Any undocumented deviation of MediaWiki::DumpFile::Compat from Parse::MediaWikiDump is considered a bug and will be fixed.

    Looking at http://search.cpan.org/~triddle/MediaWiki-DumpFile-0.2.1/lib/MediaWiki/DumpFile/FastPages.pm, I see an example that should give you exactly what you want: Plan text phrases from a Wikipedia dump written to STDOUT or whatever is currently select()ed, and optimized for speed.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)