It doesn't necessarily require two passes.
It seems reasonable to assume that a dictionary file is sorted case-insensitively (and simple to make it so if it isn't already). Then you can read the file building up a list of words beginning with letter 'a'. As soon as you come across the first word beginning with 'b', make your random selection of 100 words beginning with 'a'and then discard the word list and re-initialise it with the 'b' word just read. Repeat until you've gone through the alphabet.
That way the job is done with a single pass but you don't have to keep the entire dictionary in memory, just one letter's worth.
use strict;
use warnings;
my $howMany = shift || 100;
my $done = 0;
my $rxValidWord = qr{^([A-Za-z])[-a-z]+};
my $totalWordCt = 0;
my $validWordCt = 0;
my %letterCts = ();
my @letters = (q{a} .. q{z});
my $currentLetter = q{};
my $rxCurrent = getNextLetter();
my @wordCache = ();
my $dictFile = q{../Web2};
open my $dictFH, q{<}, $dictFile
or die qq{open: $dictFile: $!\n};
while(<$dictFH>)
{
$totalWordCt ++;
next if $done;
next unless /$rxValidWord/;
my $initLetter = $1;
$validWordCt ++;
chomp;
if($initLetter =~ $rxCurrent)
{
push @wordCache, $_;
}
else
{
reportForLetter();
}
}
reportForLetter() if $currentLetter;
close $dictFH
or die qq{close: $dictFile: $!\n};
print
qq{\n},
qq{Total words in dictionary - $totalWordCt\n},
qq{Valid words processed - $validWordCt\n};
sub generateSlice
{
my ($total, $howMany) = @_;
my $rlPool = [0 .. ($total - 1)];
return $rlPool if $total <= $howMany;
my $rlSlice = [];
for (1 .. $howMany)
{
push @$rlSlice, splice @$rlPool, int rand scalar @$rlPool, 1;
}
return [sort {$a <=> $b} @$rlSlice];
}
sub getNextLetter
{
if(@letters)
{
$currentLetter = shift @letters;
my $charClass =
q{[}
. $currentLetter
. uc $currentLetter
. q{]};
return qr{$charClass};
}
else
{
$currentLetter = q{};
$done ++;
return 0;
}
}
sub reportForLetter
{
my $savedWord = $_;
print
qq{\n},
qq{Found @{[scalar @wordCache]} words },
qq{for letter $currentLetter\n};
my $rlSlice = generateSlice(scalar @wordCache, $howMany);
my @randomWords = @wordCache[@$rlSlice];
print qq{$howMany words selected at random\n};
print qq{ $_\n} for @randomWords;
@wordCache = ($savedWord);
$rxCurrent = getNextLetter();
}
When run to look for 5 random word it produces this
Found 17059 words for letter a
5 words selected at random
accede
acrimoniously
Agade
amenable
antibasilican
Found 11025 words for letter b
5 words selected at random
banuyo
basichromatin
beclothe
Beethovenish
bilamellate
Found 19859 words for letter c
5 words selected at random
celebratedness
centrical
connectival
contraflexure
copepodan
Found 10841 words for letter d
5 words selected at random
deuteroscopy
didymous
digit
disinfector
dressership
Found 8714 words for letter e
5 words selected at random
electrometallurgy
eremic
erethistic
evident
exhortatively
Found 6852 words for letter f
5 words selected at random
fermorite
fetal
flinthearted
foder
forwarding
Found 6834 words for letter g
5 words selected at random
geranial
goldentop
gonimous
gude
guttiferous
Found 8982 words for letter h
5 words selected at random
haffle
hierophant
homeotypical
homologate
hydrocyanate
Found 8785 words for letter i
5 words selected at random
ignobleness
incommensurateness
inestimability
infundibular
isotonic
Found 1569 words for letter j
5 words selected at random
jacketwise
jeddock
Jesse
jiffle
jilter
Found 2218 words for letter k
5 words selected at random
kefir
kneelingly
kongoni
kotwalee
kyphosis
Found 6231 words for letter l
5 words selected at random
lablab
ladder
leeches
loreal
lyopomatous
Found 12529 words for letter m
5 words selected at random
marquisal
mnemonical
moribundity
murgeon
murine
Found 6746 words for letter n
5 words selected at random
nastika
nominee
nondefinition
nonemploying
Normanist
Found 7833 words for letter o
5 words selected at random
officious
oligarch
ostiate
overaccuracy
overlanguaged
Found 24409 words for letter p
5 words selected at random
palustral
Pherophatta
precausation
presubstitution
probaseball
Found 1150 words for letter q
5 words selected at random
quadrual
queery
quidditative
quinaldinic
quira
Found 9594 words for letter r
5 words selected at random
racemate
ringleaderless
risper
roadite
rotundness
Found 25029 words for letter s
5 words selected at random
scrapman
silvertop
stepmotherly
sticheron
subexcite
Found 12907 words for letter t
5 words selected at random
tailward
terzina
thrimble
throughganging
translatable
Found 16385 words for letter u
5 words selected at random
underfortify
underminister
uninjurious
uniunguiculate
upcloser
Found 3409 words for letter v
5 words selected at random
velveted
victual
vilely
vivaciously
volcanist
Found 3925 words for letter w
5 words selected at random
whelklike
winningness
wireworking
wiseacre
wusp
Found 383 words for letter x
5 words selected at random
xanthine
xanthopsin
Xenomi
xerotherm
xyst
Found 669 words for letter y
5 words selected at random
ya
yardkeep
yearnfulness
yentnite
ypsiliform
Found 947 words for letter z
5 words selected at random
Zalophus
zedoary
zelatrice
zincotype
Zoophagineae
Total words in dictionary - 234936
Valid words processed - 234884
Cheers,
JohnGG
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.