Sixtease has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I need to strip (media)wiki markup from some text. Is there an easier way than just filtering out the individual syntactic constructs?

Actually, this is an XY case: I need to gather a large amount of vietnamese texts and I have some wikipedia articles which I plan to rip.

use strict; use warnings; print "Just Another Perl Hacker\n";

Replies are listed 'Best First'.
Re: Strip wiki markup
by Joost (Canon) on Dec 03, 2007 at 22:03 UTC
Re: Strip wiki markup
by moritz (Cardinal) on Dec 03, 2007 at 21:56 UTC
    I think rindolf received a TPF grant to write a mediawiki markup parser, maybe he has some results already that you could use.

    Other modules like Parse::MediaWikiDump look promising too.

    As for your X-Y problem: you could just parse wikipedia's HTML output, there's a myriad of modules for that on CPAN.

Re: Strip wiki markup
by Sixtease (Friar) on Dec 04, 2007 at 18:06 UTC

    Thank you guys. I managed to do the thing during the night thanks to you. I had to hack the Text::MediawikiFormat module as it crashes on utf8 in headers, however.

    My best regards.

    use strict; use warnings; print "Just Another Perl Hacker\n";