in reply to Removing repeated lines from file
It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place".
What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique.
Depending on how much compression you are able to get, you may very well be able to process your input "in memory".
OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Removing repeated lines from file
by matth (Monk) on Jun 24, 2003 at 14:39 UTC | |
by CountZero (Bishop) on Jun 25, 2003 at 06:12 UTC | |
by husker (Chaplain) on Jun 24, 2003 at 15:04 UTC | |
by matth (Monk) on Jun 24, 2003 at 15:10 UTC | |
by zengargoyle (Deacon) on Jun 25, 2003 at 22:24 UTC |