There are ways to remove the markup using regexes. Try this:
I hope this helps.$page = "my ##Media Wiki [text|here]"; %wordcount; @words = split /(\s*|#|\[|\||\]|@|$|!|.|,)/ $page; foreach $word (@words) { $wordcount{$word}++ if $word =~ /\w/; } foreach $word (keys %wordcount) { print "$word\t$wordcount{$word}\n"; }
In reply to Re^2: Create a dictionary from wikipedia
by linuxkid
in thread Create a dictionary from wikipedia
by vit
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |