Wordlist maker

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Wordlist maker by chromatic (Archbishop) on Sep 17, 2000 at 07:44 UTC
Instead of the join, try slurp mode. See local and $/ (the latter might be in perlvar. Instead of using s///, try tr///. It's more efficient. Always check the return values of system calls, like open. An array in scalar context gives the number of elements. `my $file; my $out = 'wordlist.txt'; { local $/; $file = <>; } $file =~ tr/\n / /s; $file =~ tr/A-Za-z0-9 //dc; my %wordlist; $wordlist{$_}++ foreach (split ' ', $file); open(LIST, ">$out") or die "Can't open $out: $!"; print LIST join("\n", keys %wordlist); close LIST; print (scalar keys %wordlist), " words found. Saved in $out\n";` [download] That's untested, but that's how I'd do it. (Minus any bugs, of course.) Update: Removed the problematic /d switch from the first tr/// statement, prompted by turnstep's defense of his more comprehensive post.	[reply] [d/l]
Re: Wordlist maker by merlyn (Sage) on Sep 17, 2000 at 12:41 UTC
`print LIST join("\n", keys %wordlist);` [download] Hmm. That leaves the final newline off the file. Perhaps you wanted this: `print LIST "$_\n" for keys %wordlist;` [download] or perhaps `print LIST "$_\n" while $_ = each %wordlist;` [download] or going the other direction in efficiency (worse {grin}): `print LIST map "$_\n", keys %wordlist;` [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
RE: Re: Wordlist maker by runrig (Abbot) on Sep 17, 2000 at 20:12 UTC
There's something to be said for having only one print, efficiency wise, though I haven't benchmarked, so: `print LIST join("\n", keys %wordlist, '');` [download]	[reply] [d/l]
(jcwren) RE: (3) Wordlist maker by jcwren (Prior) on Sep 17, 2000 at 20:20 UTC
Actually, that's an interesting question. With the one large string you have the overhead of allocating memory to append the string. I don't know any details in the internals of the memory management involved in that, but we know there is some overhead. On the other hand, multiple prints with carriage returns will cause the stdio routines to flush to the file or console, so you're invoking the overhead of the system I/O routines for each line, as opposed until waiting for the one big line. And if it's not flushing ($\| = 1), then you still have the overhead for the buffer management within stdio. Anyone know any more details on that? Is it more efficient to let Perl do it's memory management on a big string, or let stdio do it's thing? --Chris e-mail jcwren	[reply]
RE: RE: Re: Wordlist maker by merlyn (Sage) on Sep 17, 2000 at 20:14 UTC
Well, in that case, go with my slow one: `print LIST map "$_\n", keys %wordlist;` [download] At least, I think that'll be slightly faster than having one big fat string. Update: duh. apparently not. So much for my gut level feel. Don't trust me anymore, I guess. {grin} -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
RE: RE: RE: Re: Wordlist maker by runrig (Abbot) on Sep 17, 2000 at 20:41 UTC
Re: Wordlist maker by turnstep (Parson) on Sep 17, 2000 at 17:30 UTC
A quick and simple way, especially if you don't want to read the whole file into memory first, would be: `s/([A-Z0-9]{5,})/$seenit{$1}++ or print "$1\n"/egi while <>;` [download] Better yet, save the printing until the end, so you can sort the words alphabetically, or perhaps by the number of appearances: `s/([A-Z0-9]{5,})/$seenit{$1}++/egi while <>; ## Sorted by name for (sort keys %seenit) { print "$_: $seenit{$_}\n"; } ## Sorted by freuency, then by name: for (sort {$seenit{$a} <=> $seenit{$b} or $a cmp $b} keys %seenit) { print "$_: $seenit{$_}\n"; }` [download] As a final suggestion, you may want to disregard the case of the words, in which case you'd want to use `$seenit{lc $1}`. Probably best, as words at the start of a sentence tend to be capitalized.	[reply] [d/l] [select]
Re: Wordlist maker by Anonymous Monk on Sep 17, 2000 at 20:02 UTC
Thanks for all your replies, it looks like that people here at perlmonks.org really like to help beginners like me :) Well i've benchmarked all the suggestions, and the faster is chromatic's suggestion. `four: 24 wallclock secs (20.96 usr + 2.30 sys = 23.26 CPU) one: 30 wallclock secs (28.59 usr + 1.79 sys = 30.38 CPU) three: 22 wallclock secs (19.01 usr + 2.29 sys = 21.30 CPU) two: 15 wallclock secs (13.54 usr + 1.72 sys = 15.26 CPU)` [download] one: my original code two: chromatic's code three: turnstep's code four: turnstep's code, using merlyn's way to print	[reply] [d/l]
RE: Re: Wordlist maker by turnstep (Parson) on Sep 17, 2000 at 22:47 UTC
In my code's humble defense, I'd like to point out three things: My code was written for large files, to avoid slurping every line into memory. Chromatic's code as written will not work (there should not be a space after the \n in the first trans.) Chromatic's code does not check for words that are five or more letters.	[reply]
RE: RE: Re: Wordlist maker by Anonymous Monk on Sep 18, 2000 at 09:00 UTC
I've already made the correction and added the length check before benchmarking. BTW how come i can't register here at perlmonks.org? I've tried to register 2 times and i didn't received the email with my password in both tries...	[reply]
No registration e-mail (Re: Wordlist maker) by tye (Sage) on Sep 18, 2000 at 09:22 UTC
RE: Wordlist maker by Zarathustra (Beadle) on Sep 18, 2000 at 03:13 UTC
Hello How about: `open(LIST, ">wordlist.txt"); while (<>) { length($_) >= 5 or next; s/(\W\|[1-9])//g; $i++; print LIST "$_\n"; } close(LIST); print "$i words found. Saved in wordlist.txt\n";` [download]	[reply] [d/l]
Re: Wordlist maker by shlomoy (Novice) on Sep 18, 2000 at 13:45 UTC
$file=~s/\W//sg; ## remove all alphanumeric characters from all the file. @words=split( /\s+/, $file); ## put all words in @words. my @good_words=(); foreach (@words) { push @good_words, $_ if length $_ < 5; ## lose words shorter than 5 characters } ## do with @good_words whatever you want	[reply]
Re: Wordlist maker by Anonymous Monk on Apr 03, 2020 at 11:38 UTC
`#!/usr/bin/perl $startingNum = 0001000000000000; $EndNum = 9999000000000000; $KiloBytes = $EndNum - $startingNum /1024; $MegaByte = $KiloBytes / 1024; $GigaByte = $MegaByte / 1024; $Terabyte = $GigaByte / 1024; print "The File will take up: " , $KiloBytes , "kb\n" , $MegaByte , "m +b\n" , $GigaByte , "gb\n" , $Terabyte , "TB\n"; while($startingNum++ < $EndNum) { #print "$startingNum\n"; #print "Writing " + $startingNum + " to file"; printf "%016d\n", $startingNum; }` [download] 2020-04-03 Athanasius added code tags.	[reply] [d/l]