Wikipedia ... meaningful content

Uhm, i'm not sure that is a problem that actually can be solved by using Perl.... scnr.

Back on topic, since you have the "raw" articles, you have to do multiple things. First, you have to remove all the markup. That alone does not seem trivial, since the MediaWiki format is a big mess to begin with. It's actually messy enough that more and more editors quit and the MediaWiki developers don't seem to be able to come up with a working visual editor.

A quick and dirty solution for this would be to try to use one of the MediaWiki-to-HTML converters like Text::Markup::Mediawiki and then scrape the text by using something like HTML::Extract.

Then, you can split resulting text on whitespaces. For each word then increment the counter in the hash.

since sometimes words are also used as double words (like "flying dutchman"), you might want to count them as well and see where it leads you. For this, consider 774421.

"I know what i'm doing! Look, what could possibly go wrong? All i have to pull this lever like so, and then press this button here like ArghhhhhaaAaAAAaaagraaaAAaa!!!"

In reply to Re: Create a dictionary from wikipedia by cavac
in thread Create a dictionary from wikipedia by vit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.