pie-ence has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I am looking for a way to delete lines once I have read them I read them. I have several large files which I want to combine, deleting any non-unique lines. I am using a hash to combine them for uniqueness but I do not have enough memory to put the whole thing into one hash (the files combined are over 100GB). The files are sorted so I just want to take out, say, the first million lines of each to combine them. After getting the lines I want to delete them and close the file. Whats the best way to do this? thanks.

Replies are listed 'Best First'.
Re: Delete lines as i read them
by Corion (Patriarch) on Jan 20, 2010 at 21:08 UTC

    The best way is to not do that and instead just to remember the offset up to which you read last time. Perl will tell you where you are in a file. Deleting lines has the problem that if your program does not work right you have already removed part of the data your program should process.

      thanks for the advice. I did not know about tell, it could be very useful. Originally I did not want to have the program remember which line i read last time because for each file that is going to be a different point. I see the problem with deleting as I go but is that an issue if they are just intermediate files anyways? I was thinking (and tell me if this is just silly) that this way I could easily tell if I had got all the lines from every file.
Re: Delete lines as i read them
by apl (Monsignor) on Jan 20, 2010 at 21:58 UTC
    Don't.

    Create a new file (name.new) containing whatever records should not be 'deleted'. Upon completion, rename it to name.old, and (if successful), rename name.new to name.

    In this way, you also have an audit trail...

Re: Delete lines as i read them
by NiJo (Friar) on Jan 21, 2010 at 19:23 UTC
    Sometimes it is hard to beat several decades of development... The classical Unix command line tools written in 'C' should be quite fast in exexuction and development:
    sort --merge --unique fileA fileB > output
    If disk space is your major concern, I'd instead compress the input and output files. Text files typically compress to 10% of their initial size. Your command line could then be
    zcat fileA.gz fileB.gz | sort --merge --uniq --compress-prog= gzip | g +zip > output.gz
    Both commands need some disk space in /tmp or in --temporary-directory. RAM usage can be fine tuned with --bufer-size. You'll find out about the --key option by yourself.