Delete lines as i read them

pie-ence has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I am looking for a way to delete lines once I have read them I read them. I have several large files which I want to combine, deleting any non-unique lines. I am using a hash to combine them for uniqueness but I do not have enough memory to put the whole thing into one hash (the files combined are over 100GB). The files are sorted so I just want to take out, say, the first million lines of each to combine them. After getting the lines I want to delete them and close the file. Whats the best way to do this? thanks.

Comment on Delete lines as i read them

Replies are listed 'Best First'.
Re: Delete lines as i read them by Corion (Patriarch) on Jan 20, 2010 at 21:08 UTC
The best way is to not do that and instead just to remember the offset up to which you read last time. Perl will tell you where you are in a file. Deleting lines has the problem that if your program does not work right you have already removed part of the data your program should process.	[reply]
Re^2: Delete lines as i read them by pie-ence (Initiate) on Jan 20, 2010 at 22:58 UTC
thanks for the advice. I did not know about tell, it could be very useful. Originally I did not want to have the program remember which line i read last time because for each file that is going to be a different point. I see the problem with deleting as I go but is that an issue if they are just intermediate files anyways? I was thinking (and tell me if this is just silly) that this way I could easily tell if I had got all the lines from every file.	[reply]
Re: Delete lines as i read them by apl (Monsignor) on Jan 20, 2010 at 21:58 UTC
Don't. Create a new file (name.new) containing whatever records should not be 'deleted'. Upon completion, rename it to name.old, and (if successful), rename name.new to name. In this way, you also have an audit trail...	[reply]
Re: Delete lines as i read them by NiJo (Friar) on Jan 21, 2010 at 19:23 UTC
Sometimes it is hard to beat several decades of development... The classical Unix command line tools written in 'C' should be quite fast in exexuction and development: `sort --merge --unique fileA fileB > output` [download] If disk space is your major concern, I'd instead compress the input and output files. Text files typically compress to 10% of their initial size. Your command line could then be `zcat fileA.gz fileB.gz \| sort --merge --uniq --compress-prog= gzip \| g +zip > output.gz` [download] Both commands need some disk space in /tmp or in --temporary-directory. RAM usage can be fine tuned with --bufer-size. You'll find out about the --key option by yourself.	[reply] [d/l] [select]