MediaWiki::API get text wikipedia

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I used to use WWW::Wikipedia to get text content from Wikipedia articles. My problem was (and still is) the the text delivered contains a lot of tags (or other infos) I find diffult to remove, as I don't find any recurrent pattern. I've now discovered MediaWiki::API. I already can get some useful information, for example the titles in different languages. What I really can't get, is the "plain" text of a single article.

Anyone could point me to the right direction?

Comment on MediaWiki::API get text wikipedia

Replies are listed 'Best First'.
Re: MediaWiki::API get text wikipedia by tangent (Parson) on Dec 04, 2013 at 19:56 UTC
From the docs: `# get some page contents my $page = $mw->get_page( { title => 'Main Page' } ); # print page contents print $page->{''};` [download] So you get at the pages's content by accessing the '' key of the hash reference returned by the MediaWiki::API object's get_page() method. That will be HTML so you then need to parse it to get at the 'plain' text. There are many ways to extract the text from the HTML (see CPAN) depending on the structure you want to end up with - if you can tell us a little bit more we can point you further.	[reply] [d/l]

Replies are listed 'Best First'.

Re: MediaWiki::API get text wikipedia
by tangent (Parson) on Dec 04, 2013 at 19:56 UTC

docs

# get some page contents
my $page = $mw->get_page( { title => 'Main Page' } );
# print page contents
print $page->{'*'};
[download]

CPAN

[reply]
[d/l]