chromosome1 100 100 . G T several other columns
chromosome1 110 110 . A C several other columns
chromosome1 200 200 . C T several other columns
chromosome2 125 125 . C T several other columns
####
use List::Util qw(shuffle);
use strict;
my $file = $ARGV[0];
open (VCF,$file);
my @array=; #read file into array
my @newvcf;
for (my $i=0; $i<1000000; ++$i) { #giving script plenty of room to work...
my $randomline=$array[rand @array]; #randomize lines of file
if (scalar @newvcf<2) {
push (@newvcf, $randomline); #build new array/subset of lines
}
}
####
either randomize file first (for example with unix 'shuf') and read first line of randomized file or slurp entire file into array and then randomize
compare first random line with second random line
- IF first field of second line (chromosome) matches first field of first line AND second field of second line (position) minus the second field of first line is either less than X or greater than -X, discard second line
- ELSE keep both first and second line
compare third random line to first random line as above, and to second random line if was not discarded and if third line was not discarded due to comparison with first line
continue until a new collection of a specified number of random lines is generated, with no lines containing positions on the same chromosome and within X distance of one another.