Re: Creating Dictionaries

in reply to Creating Dictionaries

The bottleneck in your program seems the sorting at the end. Sorting is super-linear, which means that if your input size doubles, the running time more than doubles.

I'd write your program slightly different, but that shouldn't make much of a difference:

my (%hash, $line);
while ($line = <STDIN>) {
    while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) {
        next if $word =~ /(.)\1\1\1\1/;
        $hash{$word}++;
    }
}
print "$_\n" for sort keys %hash;
[download]

Now, if the words are in order, there's no need for the hash, or the sorting:

my ($line, $prev);
$prev = "";
while ($line = <STDIN>) {
    while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) {
        next if $word =~ /(.)\1\1\1\1/;
        next if $word eq $prev;
        print "$word\n";
        $prev = $word;
    }
}
[download]

(Code fragments are untested)

Perl --((8:>*

Comment on Re: Creating Dictionaries Select or Download Code

Replies are listed 'Best First'.
Re^2: Creating Dictionaries by davidrw (Prior) on Dec 16, 2005 at 14:12 UTC
This line has some problems: `while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) {` [download] It won't compile under `use strict;` ... i think you meant to pattern match against `lc $line`, and there's no parens in the regex to capture anything ... Working off your general idea, i came up with (no clue how this rates performance-wise against OP or my other solution below): `my %hash; while (my $line = <STDIN>) { foreach my $word ( $line =~ m/\b([a-zA-Z]{2,4})\b/g ) { $hash{lc $word}++; } } print "$_\n" for sort keys %hash;` [download] Which can be rewritten as: `while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/\b([a-zA-Z]{2,4})\b/g; } #or do { $hash{lc $_}++ for m/\b([a-zA-Z]{2,4})\b/g } for <STDIN>;` [download] Update: Doh. note i misread the `/(\w)\1\1\1\1/` regex as 5+ letters instead of 5+ of the _same_ letter in a row .. If the 5+ letters don't happen very often, might be better to just exclude at the end: `while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/([a-zA-Z]{2,})/g; } delete $hash{$_} for grep /(\w)\1{4}/, keys %hash;` [download]	[reply] [d/l] [select]
Re^3: Creating Dictionaries by Perl Mouse (Chaplain) on Dec 16, 2005 at 14:33 UTC
There's no need for parenthesis if you use `m//g` in list context (as I've done). However, I shouldn't have used a while, but a for. Now, your solution only grabs words 2, 3 or 4 letters long. Which is a restriction that OP didn't have - he eliminates words that have 5 times the same letter (not five letters!) `Perl --((8:>*`	[reply] [d/l]

In Section Seekers of Perl Wisdom