Also note, it would be nice if there's a way to be just as fast, but less memory intensive, as the hash may end up holding as many as 4 million items, with keys of length 1 to 32.There is a two-pass solution to this that uses very little memory. Parse the words from each line and output them to a pipe "|sort|uniq -d". Input the results of that pipe and you'll have a list of duplicate words to save in a hash.
The second time through your file you compare the words to that hash, something like:
If you know that STDIN is seekable (i.e. a disk file, not pipe or socket or terminal), you can seek STDIN, 0, 0 to rewind. Otherwise you'll have to write a copy of the data somewhere for your second pass.if (!exists $dup{$_} || $dup{$_}++ == 0) { print it }
If what you are really after is a list of the unique words in a file and you don't care about the order or line breaks, you can just parse the words out to "|sort -u".
In reply to Re: Removing repeated words
by Thelonius
in thread Removing repeated words
by abitkin
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |