in reply to Re: Filtering very large files using Tie::File
in thread Filtering very large files using Tie::File

The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special.

Well, I know that, and while (<FILE>) is what I use most of the time. I just thought that wouldn't work for this purpose without sorting the file alphabetically first. Corion's solution cleared that up, so the Tie::File idea went out the window after the first post in the thread.
Just for the record, here's the code I ended up with, including some reporting:
open (ALIGNED, "<:encoding(UTF-8)", "${filename}.txt") or die "Can't o +pen aligned file for reading: $!"; open (ALIGNED_MOD, ">:encoding(UTF-8)", "${filename}_mod.txt") or die +"Can't open file for writing: $!"; if ($delete_dupes eq "y") { my %seen; # hash that contains uique records (hash lookups +are faster than array lookups) my $key; # key to be put in hash while (<ALIGNED>) { /^([^\t]*\t[^\t]*)/; # only watch first two fields chomp ($key = $1); # only watch first two fields print ALIGNED_MOD $_ if (! $seen{ $key }++); # add to hash, an +d if new, print to file } my $unfiltered_number = $.; my $filtered_number = keys %seen; print "\n\n-------------------------------------------------"; print "\n\nSegment numbers before and after filtering out dupes: $ +unfiltered_number -> $filtered_number\n"; print LOG "\nFiltered out dupes: $unfiltered_number -> $filtered_n +umber"; undef %seen; # free up memory close ALIGNED; close ALIGNED_MOD; rename ("${filename}_mod.txt", "${filename}.txt") or die "Can't re +name file: $!"; }

Replies are listed 'Best First'.
Re^3: Filtering very large files using Tie::File
by ikegami (Patriarch) on Nov 27, 2010 at 20:54 UTC

    I just thought that wouldn't work for this purpose without sorting the file alphabetically first.

    That makes no sense. If you can't do it by reading the file a line at a time using <>, what makes you think you can do it by reading the file a line at a time using Tie::File. Conversely, if you can do it by reading a line at a time using Tie::File, you can do it by reading a line at a time using <>.

      Because Tie::File doesn't go line by line, for example? Not that it has any importance at this point.

        Quite the contrary. I'm trying to explain that your logic for picking Tie::File was wrong. If you don't realise this, you're likely to make that mistake again.

        sort requires the whole array regardless of how you read the file, so changing the way you read the file is not the solution.