in reply to Building a new file by filtering a randomized old file on two fields

G'day mbp,

Welcome to the monastery.

To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File. This will not load the entire file into memory. In the following code, I just keep an index of the elements in an array. I'm making a very rough guess (as I don't know the record size) that will need ~100 Mb.

You give an example of 500 locations for a subset. Assuming that's a realistic number, the memory required for the other data structures is minimal.

Here's my test script:

#!/usr/bin/env perl use strict; use warnings; use autodie; use Tie::File; my $subset_size = 4; my $min_distance = 20; my (%sample, @keep); tie my @locations, 'Tie::File', './pm_1084445_locations.txt'; my $last_index = $#locations; my @indexes = 0 .. $last_index; for (0 .. $last_index) { my $rand_index = int rand @indexes; my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0 +, 1]; if (is_new_location(\%sample, $min_distance, $chr, $pos)) { push @keep, $indexes[$rand_index]; add_location(\%sample, $chr, $pos); } splice @indexes, $rand_index, 1; last if @keep == $subset_size; } if (@keep < $subset_size) { warn 'WARNING! Subset size [', scalar @keep, "] is less than the required size [$subset_size].\n"; } else { print "$_\n" for @locations[sort { $a <=> $b } @keep]; } untie @locations; sub is_new_location { my ($sample, $min, $chr, $pos) = @_; return 1 unless $sample->{$chr}; for (@{$sample->{$chr}}) { return 0 if abs($_ - $pos) < $min; } return 1; } sub add_location { my ($sample, $chr, $pos) = @_; my $index = 0; for (@{$sample->{$chr}}) { last if $_ > $pos; ++$index; } splice @{$sample->{$chr}}, $index, 0, $pos; return; }

The code is fairly straightforward; however, if there's something you don't understand, feel free to ask.

Obviously, adjust $subset_size, filename, etc. to suit. The test data I used and some sample runs are in the spoiler below.

Test data (I just doubled up what you had and changed the chromosome numbers):

$ cat pm_1084445_locations.txt chromosome1 100 100 . G T several other columns chromosome1 110 110 . A C several other columns chromosome1 200 200 . C T several other columns chromosome2 125 125 . C T several other columns chromosome3 100 100 . G T several other columns chromosome3 110 110 . A C several other columns chromosome3 200 200 . C T several other columns chromosome4 125 125 . C T several other columns

Sample runs:

$ pm_1084445_locations.pl chromosome1 100 100 . G T several other columns chromosome2 125 125 . C T several other columns chromosome3 100 100 . G T several other columns chromosome4 125 125 . C T several other columns
$ pm_1084445_locations.pl chromosome1 110 110 . A C several other columns chromosome1 200 200 . C T several other columns chromosome2 125 125 . C T several other columns chromosome3 100 100 . G T several other columns
$ pm_1084445_locations.pl chromosome1 100 100 . G T several other columns chromosome2 125 125 . C T several other columns chromosome3 110 110 . A C several other columns chromosome3 200 200 . C T several other columns

-- Ken

Replies are listed 'Best First'.
Re^2: Building a new file by filtering a randomized old file on two fields
by BrowserUk (Patriarch) on Apr 30, 2014 at 11:33 UTC
    To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File. This will not load the entire file into memory.

    That code will be horribly slow.

    This single line:

    my $last_index = $#locations;

    Will cause the entire file to be read through a tiny buffer.

    And this line:

    my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0 +, 1];

    will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records.

    Many parts of the file will be read and re-read many times. Performance will be abysmal.

    This algorithm will fairly pick random lines from using a single pass over that file.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful!

      "Will cause the entire file to be read through a tiny buffer."

      The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest?

      "... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."

      I suspect this is a reference to your "This algorithm" link: see the next point.

      "This algorithm [Link removed - see update above] will fairly pick random lines from using a single pass over that file."

      That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.

      [By the way, the URL associated with that link (i.e. http://docstore.mik.ua/orelly/perl/cookbook/ch08_07.htm) is questionable: it's O'Reilly material provided by someone else. I've seen arguments for and against this specific one, so I'm just pointing it out.]

      I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:

      #my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[ +0, 1]; my ($chr, $pos) = (split '', $locations[$indexes[$rand_index]])[0, + 1]; $chr = 'chromosome' . ($_ % 4 + 1); $pos = $_ + 10;

      I also changed the subset size from my original test value of 4 to the OP's example of 500.

      With twice the expected file size and additional processing, this took almost exactly 40 minutes.

      I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes.

      While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance.

      -- Ken