Re^2: Filtering very large files using Tie::File

The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special.

Well, I know that, and while (<FILE>) is what I use most of the time. I just thought that wouldn't work for this purpose without sorting the file alphabetically first. Corion's solution cleared that up, so the Tie::File idea went out the window after the first post in the thread.
Just for the record, here's the code I ended up with, including some reporting:

open (ALIGNED, "<:encoding(UTF-8)", "${filename}.txt") or die "Can't o
+pen aligned file for reading: $!";
open (ALIGNED_MOD, ">:encoding(UTF-8)", "${filename}_mod.txt") or die 
+"Can't open file for writing: $!";

if ($delete_dupes eq "y") {

    my %seen;        # hash that contains uique records (hash lookups 
+are faster than array lookups)
    my $key;        # key to be put in hash
    while (<ALIGNED>) {
        /^([^\t]*\t[^\t]*)/;    # only watch first two fields
        chomp ($key = $1);        # only watch first two fields
        print ALIGNED_MOD $_ if (! $seen{ $key }++); # add to hash, an
+d if new, print to file
    }

    my $unfiltered_number = $.;
    my $filtered_number = keys %seen;
    print "\n\n-------------------------------------------------";
    print "\n\nSegment numbers before and after filtering out dupes: $
+unfiltered_number -> $filtered_number\n";
    print LOG "\nFiltered out dupes: $unfiltered_number -> $filtered_n
+umber";

    undef %seen; # free up memory

    close ALIGNED;
    close ALIGNED_MOD;
    rename ("${filename}_mod.txt", "${filename}.txt") or die "Can't re
+name file: $!";
}
[download]

Comment on Re^2: Filtering very large files using Tie::File Select or Download Code

Replies are listed 'Best First'.
Re^3: Filtering very large files using Tie::File by ikegami (Patriarch) on Nov 27, 2010 at 20:54 UTC
I just thought that wouldn't work for this purpose without sorting the file alphabetically first. That makes no sense. If you can't do it by reading the file a line at a time using `<>`, what makes you think you can do it by reading the file a line at a time using Tie::File. Conversely, if you can do it by reading a line at a time using Tie::File, you can do it by reading a line at a time using `<>`.	[reply] [d/l] [select]
Re^4: Filtering very large files using Tie::File by elef (Friar) on Nov 27, 2010 at 22:18 UTC
Because Tie::File doesn't go line by line, for example? Not that it has any importance at this point.	[reply]
Re^5: Filtering very large files using Tie::File by ikegami (Patriarch) on Nov 27, 2010 at 22:36 UTC
Quite the contrary. I'm trying to explain that your logic for picking Tie::File was wrong. If you don't realise this, you're likely to make that mistake again. `sort` requires the whole array regardless of how you read the file, so changing the way you read the file is not the solution.	[reply] [d/l]