Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
It doesn't necessarily require two passes.

It seems reasonable to assume that a dictionary file is sorted case-insensitively (and simple to make it so if it isn't already). Then you can read the file building up a list of words beginning with letter 'a'. As soon as you come across the first word beginning with 'b', make your random selection of 100 words beginning with 'a'and then discard the word list and re-initialise it with the 'b' word just read. Repeat until you've gone through the alphabet.

That way the job is done with a single pass but you don't have to keep the entire dictionary in memory, just one letter's worth.

use strict; use warnings; my $howMany = shift || 100; my $done = 0; my $rxValidWord = qr{^([A-Za-z])[-a-z]+}; my $totalWordCt = 0; my $validWordCt = 0; my %letterCts = (); my @letters = (q{a} .. q{z}); my $currentLetter = q{}; my $rxCurrent = getNextLetter(); my @wordCache = (); my $dictFile = q{../Web2}; open my $dictFH, q{<}, $dictFile or die qq{open: $dictFile: $!\n}; while(<$dictFH>) { $totalWordCt ++; next if $done; next unless /$rxValidWord/; my $initLetter = $1; $validWordCt ++; chomp; if($initLetter =~ $rxCurrent) { push @wordCache, $_; } else { reportForLetter(); } } reportForLetter() if $currentLetter; close $dictFH or die qq{close: $dictFile: $!\n}; print qq{\n}, qq{Total words in dictionary - $totalWordCt\n}, qq{Valid words processed - $validWordCt\n}; sub generateSlice { my ($total, $howMany) = @_; my $rlPool = [0 .. ($total - 1)]; return $rlPool if $total <= $howMany; my $rlSlice = []; for (1 .. $howMany) { push @$rlSlice, splice @$rlPool, int rand scalar @$rlPool, 1; } return [sort {$a <=> $b} @$rlSlice]; } sub getNextLetter { if(@letters) { $currentLetter = shift @letters; my $charClass = q{[} . $currentLetter . uc $currentLetter . q{]}; return qr{$charClass}; } else { $currentLetter = q{}; $done ++; return 0; } } sub reportForLetter { my $savedWord = $_; print qq{\n}, qq{Found @{[scalar @wordCache]} words }, qq{for letter $currentLetter\n}; my $rlSlice = generateSlice(scalar @wordCache, $howMany); my @randomWords = @wordCache[@$rlSlice]; print qq{$howMany words selected at random\n}; print qq{ $_\n} for @randomWords; @wordCache = ($savedWord); $rxCurrent = getNextLetter(); }

When run to look for 5 random word it produces this

Found 17059 words for letter a 5 words selected at random accede acrimoniously Agade amenable antibasilican Found 11025 words for letter b 5 words selected at random banuyo basichromatin beclothe Beethovenish bilamellate Found 19859 words for letter c 5 words selected at random celebratedness centrical connectival contraflexure copepodan Found 10841 words for letter d 5 words selected at random deuteroscopy didymous digit disinfector dressership Found 8714 words for letter e 5 words selected at random electrometallurgy eremic erethistic evident exhortatively Found 6852 words for letter f 5 words selected at random fermorite fetal flinthearted foder forwarding Found 6834 words for letter g 5 words selected at random geranial goldentop gonimous gude guttiferous Found 8982 words for letter h 5 words selected at random haffle hierophant homeotypical homologate hydrocyanate Found 8785 words for letter i 5 words selected at random ignobleness incommensurateness inestimability infundibular isotonic Found 1569 words for letter j 5 words selected at random jacketwise jeddock Jesse jiffle jilter Found 2218 words for letter k 5 words selected at random kefir kneelingly kongoni kotwalee kyphosis Found 6231 words for letter l 5 words selected at random lablab ladder leeches loreal lyopomatous Found 12529 words for letter m 5 words selected at random marquisal mnemonical moribundity murgeon murine Found 6746 words for letter n 5 words selected at random nastika nominee nondefinition nonemploying Normanist Found 7833 words for letter o 5 words selected at random officious oligarch ostiate overaccuracy overlanguaged Found 24409 words for letter p 5 words selected at random palustral Pherophatta precausation presubstitution probaseball Found 1150 words for letter q 5 words selected at random quadrual queery quidditative quinaldinic quira Found 9594 words for letter r 5 words selected at random racemate ringleaderless risper roadite rotundness Found 25029 words for letter s 5 words selected at random scrapman silvertop stepmotherly sticheron subexcite Found 12907 words for letter t 5 words selected at random tailward terzina thrimble throughganging translatable Found 16385 words for letter u 5 words selected at random underfortify underminister uninjurious uniunguiculate upcloser Found 3409 words for letter v 5 words selected at random velveted victual vilely vivaciously volcanist Found 3925 words for letter w 5 words selected at random whelklike winningness wireworking wiseacre wusp Found 383 words for letter x 5 words selected at random xanthine xanthopsin Xenomi xerotherm xyst Found 669 words for letter y 5 words selected at random ya yardkeep yearnfulness yentnite ypsiliform Found 947 words for letter z 5 words selected at random Zalophus zedoary zelatrice zincotype Zoophagineae Total words in dictionary - 234936 Valid words processed - 234884

Cheers,

JohnGG


In reply to Re^2: improving the efficiency of a script by johngg
in thread improving the efficiency of a script by sulfericacid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-19 20:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found