Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm making a "wordlist maker" which takes input from a text file, takes off all of the non-alphanumeric chars (!,?,@,etc) and then writes all of the words with 5 characters or more into a file. I made this code, and I would like to know how to improve it, since i'm new in Perl:
$file = join('',<>); $file =~ s/\n/ /g; $file =~ s/ +/ /g; @file = split(' ',$file); foreach $word (@file) { $word =~ s/[\!\@\#\$\%\^\&\*\(\)\_\+\=\|\\\~\`\[\]\{\}\:\;\'\"\,\. +\<\>\/\?1-9]//g; (length($word)>=5)?$wordlist{lc($word)}=1:next; } open(LIST,">wordlist.txt"); foreach $word (keys %wordlist) { $i++; print LIST "$word\n"; } close(LIST); print "$i words found. Saved in wordlist.txt\n";

Replies are listed 'Best First'.
Re: Wordlist maker
by chromatic (Archbishop) on Sep 17, 2000 at 07:44 UTC
    Instead of the join, try slurp mode. See local and $/ (the latter might be in perlvar.

    Instead of using s///, try tr///. It's more efficient.

    Always check the return values of system calls, like open. An array in scalar context gives the number of elements.

    my $file; my $out = 'wordlist.txt'; { local $/; $file = <>; } $file =~ tr/\n / /s; $file =~ tr/A-Za-z0-9 //dc; my %wordlist; $wordlist{$_}++ foreach (split ' ', $file); open(LIST, ">$out") or die "Can't open $out: $!"; print LIST join("\n", keys %wordlist); close LIST; print (scalar keys %wordlist), " words found. Saved in $out\n";
    That's untested, but that's how I'd do it. (Minus any bugs, of course.)

    Update: Removed the problematic /d switch from the first tr/// statement, prompted by turnstep's defense of his more comprehensive post.

Re: Wordlist maker
by merlyn (Sage) on Sep 17, 2000 at 12:41 UTC
    print LIST join("\n", keys %wordlist);
    Hmm. That leaves the final newline off the file. Perhaps you wanted this:
    print LIST "$_\n" for keys %wordlist;
    or perhaps
    print LIST "$_\n" while $_ = each %wordlist;
    or going the other direction in efficiency (worse {grin}):
    print LIST map "$_\n", keys %wordlist;

    -- Randal L. Schwartz, Perl hacker

      There's something to be said for having only one print, efficiency wise, though I haven't benchmarked, so:

      print LIST join("\n", keys %wordlist, '');
        Actually, that's an interesting question. With the one large string you have the overhead of allocating memory to append the string. I don't know any details in the internals of the memory management involved in that, but we know there is some overhead.

        On the other hand, multiple prints with carriage returns will cause the stdio routines to flush to the file or console, so you're invoking the overhead of the system I/O routines for each line, as opposed until waiting for the one big line. And if it's not flushing ($| = 1), then you still have the overhead for the buffer management within stdio.

        Anyone know any more details on that? Is it more efficient to let Perl do it's memory management on a big string, or let stdio do it's thing?

        --Chris

        e-mail jcwren
        Well, in that case, go with my slow one:
        print LIST map "$_\n", keys %wordlist;
        At least, I think that'll be slightly faster than having one big fat string.
        Update: duh. apparently not. So much for my gut level feel. Don't trust me anymore, I guess. {grin}

        -- Randal L. Schwartz, Perl hacker

Re: Wordlist maker
by turnstep (Parson) on Sep 17, 2000 at 17:30 UTC

    A quick and simple way, especially if you don't want to read the whole file into memory first, would be:

    s/([A-Z0-9]{5,})/$seenit{$1}++ or print "$1\n"/egi while <>;

    Better yet, save the printing until the end, so you can sort the words alphabetically, or perhaps by the number of appearances:

    s/([A-Z0-9]{5,})/$seenit{$1}++/egi while <>; ## Sorted by name for (sort keys %seenit) { print "$_: $seenit{$_}\n"; } ## Sorted by freuency, then by name: for (sort {$seenit{$a} <=> $seenit{$b} or $a cmp $b} keys %seenit) { print "$_: $seenit{$_}\n"; }

    As a final suggestion, you may want to disregard the case of the words, in which case you'd want to use $seenit{lc $1}. Probably best, as words at the start of a sentence tend to be capitalized.

Re: Wordlist maker
by Anonymous Monk on Sep 17, 2000 at 20:02 UTC
    Thanks for all your replies, it looks like that people here at perlmonks.org really like to help beginners like me :) Well i've benchmarked all the suggestions, and the faster is chromatic's suggestion.
    four: 24 wallclock secs (20.96 usr + 2.30 sys = 23.26 CPU) one: 30 wallclock secs (28.59 usr + 1.79 sys = 30.38 CPU) three: 22 wallclock secs (19.01 usr + 2.29 sys = 21.30 CPU) two: 15 wallclock secs (13.54 usr + 1.72 sys = 15.26 CPU)
    one: my original code two: chromatic's code three: turnstep's code four: turnstep's code, using merlyn's way to print

      In my code's humble defense, I'd like to point out three things:

      1. My code was written for large files, to avoid slurping every line into memory.
      2. Chromatic's code as written will not work (there should not be a space after the \n in the first trans.)
      3. Chromatic's code does not check for words that are five or more letters.
        I've already made the correction and added the length check before benchmarking. BTW how come i can't register here at perlmonks.org? I've tried to register 2 times and i didn't received the email with my password in both tries...
RE: Wordlist maker
by Zarathustra (Beadle) on Sep 18, 2000 at 03:13 UTC

    Hello

    How about:

    open(LIST, ">wordlist.txt"); while (<>) { length($_) >= 5 or next; s/(\W|[1-9])//g; $i++; print LIST "$_\n"; } close(LIST); print "$i words found. Saved in wordlist.txt\n";
Re: Wordlist maker
by shlomoy (Novice) on Sep 18, 2000 at 13:45 UTC
    $file=~s/\W//sg; ## remove all alphanumeric characters from all the file.
    @words=split( /\s+/, $file); ## put all words in @words.
    my @good_words=();
    foreach (@words) {
    push @good_words, $_ if length $_ < 5; ## lose words shorter than 5 characters
    }
    ## do with @good_words whatever you want
Re: Wordlist maker
by Anonymous Monk on Apr 03, 2020 at 11:38 UTC
    #!/usr/bin/perl $startingNum = 0001000000000000; $EndNum = 9999000000000000; $KiloBytes = $EndNum - $startingNum /1024; $MegaByte = $KiloBytes / 1024; $GigaByte = $MegaByte / 1024; $Terabyte = $GigaByte / 1024; print "The File will take up: " , $KiloBytes , "kb\n" , $MegaByte , "m +b\n" , $GigaByte , "gb\n" , $Terabyte , "TB\n"; while($startingNum++ < $EndNum) { #print "$startingNum\n"; #print "Writing " + $startingNum + " to file"; printf "%016d\n", $startingNum; }

    2020-04-03 Athanasius added code tags.