Strip wiki markup

Sixtease has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I need to strip (media)wiki markup from some text. Is there an easier way than just filtering out the individual syntactic constructs?

Actually, this is an XY case: I need to gather a large amount of vietnamese texts and I have some wikipedia articles which I plan to rip.

use strict; use warnings; print "Just Another Perl Hacker\n";

Comment on Strip wiki markup Download Code

Replies are listed 'Best First'.
Re: Strip wiki markup by Joost (Canon) on Dec 03, 2007 at 22:03 UTC
Have you tried Text::MediawikiFormat? It seems to convert to standard HTML by default, but looking at the source it looks fairly easy to extend it to convert to plain text. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: Strip wiki markup by moritz (Cardinal) on Dec 03, 2007 at 21:56 UTC
I think rindolf received a TPF grant to write a mediawiki markup parser, maybe he has some results already that you could use. Other modules like Parse::MediaWikiDump look promising too. As for your X-Y problem: you could just parse wikipedia's HTML output, there's a myriad of modules for that on CPAN.	[reply]
Re: Strip wiki markup by Sixtease (Friar) on Dec 04, 2007 at 18:06 UTC
Thank you guys. I managed to do the thing during the night thanks to you. I had to hack the Text::MediawikiFormat module as it crashes on utf8 in headers, however. My best regards. `use strict; use warnings; print "Just Another Perl Hacker\n";`	[reply] [d/l]