in reply to Removing repeated words
Also note, it would be nice if there's a way to be just as fast, but less memory intensive, as the hash may end up holding as many as 4 million items, with keys of length 1 to 32.There is a two-pass solution to this that uses very little memory. Parse the words from each line and output them to a pipe "|sort|uniq -d". Input the results of that pipe and you'll have a list of duplicate words to save in a hash.
The second time through your file you compare the words to that hash, something like:
If you know that STDIN is seekable (i.e. a disk file, not pipe or socket or terminal), you can seek STDIN, 0, 0 to rewind. Otherwise you'll have to write a copy of the data somewhere for your second pass.if (!exists $dup{$_} || $dup{$_}++ == 0) { print it }
If what you are really after is a list of the unique words in a file and you don't care about the order or line breaks, you can just parse the words out to "|sort -u".
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Removing repeated words
by Aristotle (Chancellor) on Sep 21, 2002 at 08:51 UTC | |
by Thelonius (Priest) on Oct 03, 2002 at 21:30 UTC |