in reply to Filtering very large files using Tie::File
Besides the using techniques already mentioned I'd reduce the data by extracting e. g. the md4 checksum of the relevant columns. Byte offset and length of each line also go into this intermediate file.
The intermediate file gets the opposite treatment to list all but one of the duplicates in a group. The extracted information is used to overwrite duplicate lines in the original 1 GB file e. g. with spaces using seek() for positioning. This is the computer analogon to crossing lines out on paper with a ruler and pen.
A trivial pipe filter is then applied to skip the overwritten lines when feeding the result to the next application.
The command line approach is also worth checking out on GNU tools: sort -u -t "\t" -k1,2 -S 100M -o out.file
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Filtering very large files using Tie::File
by elef (Friar) on Nov 26, 2010 at 20:58 UTC |