in reply to Filtering very large files using Tie::File

Now, as far as I can tell, the best way to do things like this in large files is Tie::File, which loads files into a pseudo-array and helps do operations like this without loading the entire file to memory.

The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special. What you are doing by using Tie::File is wasting time and memory for features you don't even need.

I'm not sure if the foreach my $list( sort( @lists ) ){ line means that I will get alphabetically sorted output at the end, but I suspect it does, which wouldn't be ideal.

sort @tied would cause all of the file to be loaded into memory so it can be passed to sort. Thankfully, you're not passing the tied array there, but that probably means you are doing remove_duplicate_from_array(@tied).

remove_duplicate_from_array(@tied) would cause all of the file to be loaded into memory so it can be placed in @_. Then the first thing you do in remove_duplicate_from_array is to create a copy of @_

ouch.

Replies are listed 'Best First'.
Re^2: Filtering very large files using Tie::File
by elef (Friar) on Nov 27, 2010 at 12:15 UTC
    The basic means of reading from a file (read, readline) do not read the entire file into memory, so that aspect of Tie::File is not special.

    Well, I know that, and while (<FILE>) is what I use most of the time. I just thought that wouldn't work for this purpose without sorting the file alphabetically first. Corion's solution cleared that up, so the Tie::File idea went out the window after the first post in the thread.
    Just for the record, here's the code I ended up with, including some reporting:
    open (ALIGNED, "<:encoding(UTF-8)", "${filename}.txt") or die "Can't o +pen aligned file for reading: $!"; open (ALIGNED_MOD, ">:encoding(UTF-8)", "${filename}_mod.txt") or die +"Can't open file for writing: $!"; if ($delete_dupes eq "y") { my %seen; # hash that contains uique records (hash lookups +are faster than array lookups) my $key; # key to be put in hash while (<ALIGNED>) { /^([^\t]*\t[^\t]*)/; # only watch first two fields chomp ($key = $1); # only watch first two fields print ALIGNED_MOD $_ if (! $seen{ $key }++); # add to hash, an +d if new, print to file } my $unfiltered_number = $.; my $filtered_number = keys %seen; print "\n\n-------------------------------------------------"; print "\n\nSegment numbers before and after filtering out dupes: $ +unfiltered_number -> $filtered_number\n"; print LOG "\nFiltered out dupes: $unfiltered_number -> $filtered_n +umber"; undef %seen; # free up memory close ALIGNED; close ALIGNED_MOD; rename ("${filename}_mod.txt", "${filename}.txt") or die "Can't re +name file: $!"; }

      I just thought that wouldn't work for this purpose without sorting the file alphabetically first.

      That makes no sense. If you can't do it by reading the file a line at a time using <>, what makes you think you can do it by reading the file a line at a time using Tie::File. Conversely, if you can do it by reading a line at a time using Tie::File, you can do it by reading a line at a time using <>.

        Because Tie::File doesn't go line by line, for example? Not that it has any importance at this point.