Re: Filtering very large files using Tie::File

It seems you are carrying around a lot of extra data and perform extra I/O. An alternative but destructive approach could reduce disk I/O even further. Maybe you can even tolerate a very remote chance of dropping a unique line, say 1 in 1E30.

Besides the using techniques already mentioned I'd reduce the data by extracting e. g. the md4 checksum of the relevant columns. Byte offset and length of each line also go into this intermediate file.

The intermediate file gets the opposite treatment to list all but one of the duplicates in a group. The extracted information is used to overwrite duplicate lines in the original 1 GB file e. g. with spaces using seek() for positioning. This is the computer analogon to crossing lines out on paper with a ruler and pen.

A trivial pipe filter is then applied to skip the overwritten lines when feeding the result to the next application.

The command line approach is also worth checking out on GNU tools: sort -u -t "\t" -k1,2 -S 100M -o out.file

Comment on Re: Filtering very large files using Tie::File Download Code

Replies are listed 'Best First'.
Re^2: Filtering very large files using Tie::File by elef (Friar) on Nov 26, 2010 at 20:58 UTC
Thanks, that sounds convincing. It's way over my head, though, and I don't think the advantages are worth the effort here. I need this to work on very large files "just in case", with 1GB being on the extreme upper end of what I expect it'll need to handle. Most of the time, it will be crunching much smaller files with well under 50,000 records.	[reply]