/Silver_Wolf has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I need to remove all of the non 8 letter words from the file usr/shared/dict/words and print the output to another file. The words are printed one per line. This is for a friend who has created an anagram game for IRC if you want to check it out go to irc.esper.net #countdown.
-Creating order out of chaos.
  • Comment on Removing all non 8 letter words from the dict/words file

Replies are listed 'Best First'.
Re: Removing all none 8 letter words from the dict/words file
by NetWallah (Canon) on Apr 09, 2006 at 01:50 UTC
    Perhaps something like this one-liner :
    perl -ne "m/^\w{8}$/ and print" InputfileName
    You could redirect STDOUT, if you need to save the file.

         "For every complex problem, there is a simple answer ... and it is wrong." --H.L. Mencken

      My initial reaction was that this was going to be less efficient than the length check, but my second reaction is that this ensures that all 8 characters are perl word characters and not hyphens, periods, apostrophes, etc. -- many of which are in the typical dictionary file.

      Checking length == 8 && ! /\A\w{8}\z/ found 2400 entries on my /usr/share/dict/words file.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Removing all none 8 letter words from the dict/words file
by jZed (Prior) on Apr 09, 2006 at 01:37 UTC
    I'm guessing by "non 8 letter words" you mean all words that don't have exactly 8 characters? If so, open an output file and loop through the input file doing this inside the loop: print OUTFILE $_ if length $_ != 8;

    update oh wait, you want to remove everything except 8-letter words, (or actually create a new file that doesn't contain them). For that you'd do as above but change != to ==.

Re: Removing all none 8 letter words from the dict/words file
by Cody Pendant (Prior) on Apr 09, 2006 at 08:59 UTC
    Aren't all the length() based solutions forgetting to chomp()? Isn't Silver Wolf going to end up with seven-letter words that way, or maybe six if it's CRLF?


    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print

      Good point, but not with the -l switch from the command line:

      $ perl -nle 'print if length == 8 && ! /\A\w{8}\z/' /usr/share/dict/wo +rds 10-point 11-point 12-point 16-point 18-point 20-point 48-point -ability Abu-Bekr acantho- [snip]

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Removing all none 8 letter words from the dict/words file
by salva (Canon) on Apr 09, 2006 at 09:17 UTC
    run from the shell...
    grep '^........$' /usr/share/dict/words
      egrep is faster.

      Oh, wait, you've probably got grep aliased to egrep just like I do!

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        actually, on my linux box, grep is GNU grep, and egrep is a shell wrapper for grep:
        #!/bin/sh exec grep -E ${1+"$@"}
Re: Removing all none 8 letter words from the dict/words file
by /Silver_Wolf (Novice) on Apr 09, 2006 at 16:34 UTC
    I got it to work but I had to change to this:
    print OUTFILE $_ if length $_ == 9;
    other wise it printed out all of the non-seven letter words instead of printing only the 8-letter words.
    Creating chaos out of order.
Re: Removing all non 8 letter words from the dict/words file
by /Silver_Wolf (Novice) on Apr 09, 2006 at 23:00 UTC
    What would I need to print only words that do not include a capital letter, number, -, or period?
    Thatks for all of the help so far.
    Creating chaos out of order.
      Read perlre!!!

      Only words that do not include a capital, number, dash, or period:

      while (<>) { print if /^[^A-Z0-9.-]+$/; }
      However, it's usually much easier to list the things you can include.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        the only problem with that is that the lines that I want to include have only 1 thing different than the words I want:
        They are all lower case
        ...so I didn't think you could use lower case to sort the words.
        But by all means feel free to prove me wrong.
        Creating chaos out of order.