Creating Dictionaries

YanDas has asked for the wisdom of the Perl Monks concerning the following question:

I am an inexperienced programmer in Perl so you might find many odd things in my following script. It is supposed to read text files piped in from stdin and create a dictionary out of the words found in it. The script actually works at a reasonable speed (only a bit slower than "tr -cs '\na-zA-Z' '\n'|tr A-Z a-z|sort -u|egrep -v '^.?$|^$|(\w)\1\1\1\1'"), with minimal memory requirements (~7Mb), for most of the input texts. The problem is that when the input is an actual dictionary (sorted word list with one word per line) the performance is trully awful. The program consumes huge amounts of RAM (maybe more than 100MB for a 10MB input file) and works very slowly (at 1/10th of the initial speed) without actually having to do much work. I have tracked down the performance penalty in the addition of new elements in the hash which on the other hand has no problem at all when it works on normal text files. I have tried several tricks but without success. Could this be a problem of the perl garbage collector or just poor programming from my side?

while ($line=<STDIN>){
$line=lc $line;
$line=~s/[^a-z ]+/ /g;
foreach $word (split (/\s+/,$line)){
if ($word!~m/(\w)\1\1\1\1/ && $word!~m/^.?$/){
$hash{$word}++;
}}}
foreach $wo (sort (keys %hash)){
print "$wo\n";
}
[download]

Comment on Creating Dictionaries Download Code

Replies are listed 'Best First'.
Re: Creating Dictionaries by Perl Mouse (Chaplain) on Dec 16, 2005 at 13:56 UTC
The bottleneck in your program seems the sorting at the end. Sorting is super-linear, which means that if your input size doubles, the running time more than doubles. I'd write your program slightly different, but that shouldn't make much of a difference: `my (%hash, $line); while ($line = <STDIN>) { while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) { next if $word =~ /(.)\1\1\1\1/; $hash{$word}++; } } print "$_\n" for sort keys %hash;` [download] Now, if the words are in order, there's no need for the hash, or the sorting: `my ($line, $prev); $prev = ""; while ($line = <STDIN>) { while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) { next if $word =~ /(.)\1\1\1\1/; next if $word eq $prev; print "$word\n"; $prev = $word; } }` [download] (Code fragments are untested) `Perl --((8:>*`	[reply] [d/l] [select]
Re^2: Creating Dictionaries by davidrw (Prior) on Dec 16, 2005 at 14:12 UTC
This line has some problems: `while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) {` [download] It won't compile under `use strict;` ... i think you meant to pattern match against `lc $line`, and there's no parens in the regex to capture anything ... Working off your general idea, i came up with (no clue how this rates performance-wise against OP or my other solution below): `my %hash; while (my $line = <STDIN>) { foreach my $word ( $line =~ m/\b([a-zA-Z]{2,4})\b/g ) { $hash{lc $word}++; } } print "$_\n" for sort keys %hash;` [download] Which can be rewritten as: `while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/\b([a-zA-Z]{2,4})\b/g; } #or do { $hash{lc $_}++ for m/\b([a-zA-Z]{2,4})\b/g } for <STDIN>;` [download] Update: Doh. note i misread the `/(\w)\1\1\1\1/` regex as 5+ letters instead of 5+ of the _same_ letter in a row .. If the 5+ letters don't happen very often, might be better to just exclude at the end: `while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/([a-zA-Z]{2,})/g; } delete $hash{$_} for grep /(\w)\1{4}/, keys %hash;` [download]	[reply] [d/l] [select]
Re^3: Creating Dictionaries by Perl Mouse (Chaplain) on Dec 16, 2005 at 14:33 UTC
There's no need for parenthesis if you use `m//g` in list context (as I've done). However, I shouldn't have used a while, but a for. Now, your solution only grabs words 2, 3 or 4 letters long. Which is a restriction that OP didn't have - he eliminates words that have 5 times the same letter (not five letters!) `Perl --((8:>*`	[reply] [d/l]
Re: Creating Dictionaries by davidrw (Prior) on Dec 16, 2005 at 13:59 UTC
Try this instead ... `use strict; use warnings; my %hash; while (my $line=<STDIN>){ foreach my $word ( split( /[^a-zA-Z]+/ , $line) ){ my $len = length($word); $hash{lc $word}++ if 2<=$len && $len < 5; } } print $_."\n" for sort keys %hash;` [download] comments/explanations: be sure to 'use strict' and 'use warnings' .. (thus need 'my $line' and 'my $word') note that lowercasing isn't done til last. replaced `m/^.?$/` regex with `length` (i assume that's what it was doing -- length should be faster, and clearer, than a regex) removed the `!~m/(\w)\1\1\1\1/` and replaced with a length check for speed and clarity. (don't want 5+ letter words, right?) Update: as PerlMouse pointed out I misread the regex--it's excluding words with a letter repeated 5 or more times removed the `s/[^a-z ]+/ /g;` and replaced it implicitly with the regex in the split()... now have it split on non-leters, which all go away and you're left with the words of just letters. Not having the substitution should help a lot speed-wise.	[reply] [d/l] [select]
Re: Creating Dictionaries by TedPride (Priest) on Dec 16, 2005 at 14:19 UTC
Hashes take up a lot of space, and sorting requires making a copy of all the keys. If memory is a big issue, you can sort and process just the input, then do a line-by-line merge with the dictionary file, placing your output in a secondary file. Then when you finish, delete the original and rename the secondary. This way you never have to store more than the input x2 in memory, which is far better than (dictionary + input) x2. You're also working with arrays, not hashes, and you don't have to sort the dictionary, just the input.	[reply]
Re: Creating Dictionaries by salva (Canon) on Dec 16, 2005 at 14:19 UTC
IMHO, the problem is not the input being sorted but all the entries being unique and causing the hash to grow too much and eating all the memory. On common text files, most words are repetitions of already found words and so, they don't make the hash grow. There are several ways to solve that problem, for instance, you can try using an on disk tree with DB_File. Another way is to flush all the words found to temporal files on disk everytime their number goes over some limit, and at the end, perform a merge sort and eliminate duplicates.	[reply]
Re: Creating Dictionaries by planetscape (Chancellor) on Dec 16, 2005 at 14:23 UTC
Depending on your needs, you might wish to take a look at the following writeups: NLP - natural language regex-collections? Perl and Linguistics Perl NLP What are the monks doing with Perl and Linguistics? Take a tour of CPAN with its many gems such as Ted Pedersen's Ngram Statistics Package... Or Google it... This is a mature area of study and there are many existing quality tools available. HTH, planetscape	[reply]
Re: Creating Dictionaries by YanDas (Novice) on Dec 17, 2005 at 14:03 UTC
Thanks to everyone for your help..... Any other suggestions are always welcome of course. I have tried the proposed solutions with little if any impact in speed. The main problem is still around though. The thing that made me thing about perl's garbage collector is the huge increase in memory requests... Thanx again....Yannis	[reply]
Re: Creating Dictionaries by spiritway (Vicar) on Dec 18, 2005 at 05:49 UTC
I'm not sure how Perl implements its sort function, but I recall that some sort routines are incredibly bad if they are given input that is already in the desired order. Some implementations of that particular sort actually spend a moment randomizing the data, to ensure that it's not already sorted.	[reply]

Back to Seekers of Perl Wisdom