in reply to Creating Dictionaries

The bottleneck in your program seems the sorting at the end. Sorting is super-linear, which means that if your input size doubles, the running time more than doubles.

I'd write your program slightly different, but that shouldn't make much of a difference:

my (%hash, $line); while ($line = <STDIN>) { while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) { next if $word =~ /(.)\1\1\1\1/; $hash{$word}++; } } print "$_\n" for sort keys %hash;
Now, if the words are in order, there's no need for the hash, or the sorting:
my ($line, $prev); $prev = ""; while ($line = <STDIN>) { while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) { next if $word =~ /(.)\1\1\1\1/; next if $word eq $prev; print "$word\n"; $prev = $word; } }
(Code fragments are untested)
Perl --((8:>*

Replies are listed 'Best First'.
Re^2: Creating Dictionaries
by davidrw (Prior) on Dec 16, 2005 at 14:12 UTC
    This line has some problems:
    while (my ($word) = (lc $word) =~ /[a-z]{2,}/g) {
    It won't compile under use strict; ... i think you meant to pattern match against lc $line, and there's no parens in the regex to capture anything ...

    Working off your general idea, i came up with (no clue how this rates performance-wise against OP or my other solution below):
    my %hash; while (my $line = <STDIN>) { foreach my $word ( $line =~ m/\b([a-zA-Z]{2,4})\b/g ) { $hash{lc $word}++; } } print "$_\n" for sort keys %hash;
    Which can be rewritten as:
    while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/\b([a-zA-Z]{2,4})\b/g; } #or do { $hash{lc $_}++ for m/\b([a-zA-Z]{2,4})\b/g } for <STDIN>;

    Update: Doh. note i misread the /(\w)\1\1\1\1/ regex as 5+ letters instead of 5+ of the _same_ letter in a row .. If the 5+ letters don't happen very often, might be better to just exclude at the end:
    while (my $line = <STDIN>) { $hash{lc $_}++ for $line =~ m/([a-zA-Z]{2,})/g; } delete $hash{$_} for grep /(\w)\1{4}/, keys %hash;
      There's no need for parenthesis if you use m//g in list context (as I've done). However, I shouldn't have used a while, but a for.

      Now, your solution only grabs words 2, 3 or 4 letters long. Which is a restriction that OP didn't have - he eliminates words that have 5 times the same letter (not five letters!)

      Perl --((8:>*