in reply to Re: Get random unique lines from file
in thread Get random unique lines from file

radnorr:

I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)

I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:

Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:

$ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7, +8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don +e

While it was cooking, I quickly put together something to crunch the files and check:

#!/usr/bin/perl + # # pm_sample_histo_test.pl <FNames> # # Make a brief histogram to see if there's any obvious bias in the sam +pler. # Assumes you're sampling from a file containing a million lines, each + of # which contains just the line number. # use 5.10.1; use strict; use warnings; my %H; for my $FName (@ARGV) { print "-- $FName --\n"; open my $FH, '<', $FName or die "can't open $FName\n"; while (<$FH>) { next if /^random/; my $I = int $_/10000; $H{$I}++; } } my $max=0; for (keys %H) { $max = $H{$_} if $H{$_} > $max; } for my $I (0 .. 99) { my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50); if ($I > 1 and $I < 98) { my $avg; my $sum = 0; $sum += $H{$_} for $I-2 .. $I+2; $avg = $sum/5; substr($bar,int(50 * ($avg/$max)),1)='+'; } printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar; }

It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:

$ ./pm_sample_histo_test.pl t.* 0: (321) ************************************************** 1: (290) ********************************************* 2: (290) *********************************************+ 3: (291) *********************************************+ 4: (281) ******************************************* + 5: (307) *********************************************+* 6: (290) ********************************************+ 7: (283) ********************************************+ 8: (280) ******************************************* + 9: (283) ********************************************+ 10: (309) ********************************************+*** 11: (287) *******************************************+ 12: (267) ***************************************** + 13: (253) *************************************** + 14: (292) *****************************************+*** 15: (257) **************************************** + 16: (278) ******************************************+ 17: (254) *************************************** + 18: (286) ******************************************+* 19: (250) ************************************** + 20: (291) ******************************************+** 21: (277) ******************************************+ 22: (270) ******************************************+ 23: (262) **************************************** + 24: (261) ****************************************+ 25: (263) ***************************************+ 26: (246) ************************************** + 27: (237) ************************************ + 28: (261) ***************************************+ 29: (245) ************************************** + 30: (265) ***************************************+* 31: (267) ***************************************+* 32: (234) ************************************ + 33: (269) ****************************************+ 34: (252) *************************************** + 35: (272) ****************************************+* 36: (259) ****************************************+ 37: (243) ************************************* + 38: (266) ***************************************+* 39: (237) ************************************ + 40: (259) ***************************************+ 41: (256) ***************************************+ 42: (249) ************************************** + 43: (269) ****************************************+ 44: (274) *****************************************+ 45: (266) ****************************************+ 46: (267) ***************************************+* 47: (229) *********************************** + 48: (233) ************************************ + 49: (265) ***************************************+* 50: (247) ************************************** + 51: (281) ****************************************+** 52: (243) ************************************* + 53: (249) ************************************** + 54: (249) ************************************** + 55: (251) **************************************+ 56: (262) *************************************+** 57: (228) *********************************** + 58: (217) ********************************* + 59: (262) *************************************+** 60: (260) **************************************+* 61: (225) *********************************** + 62: (263) ****************************************+ 63: (297) ****************************************+***** 64: (271) ******************************************+ 65: (250) ************************************** + 66: (275) *****************************************+ 67: (267) ***************************************+* 68: (260) ***************************************+ 69: (224) ********************************** + 70: (256) **************************************+ 71: (241) ************************************* + 72: (257) ***************************************+ 73: (256) ***************************************+ 74: (257) ***************************************+ 75: (246) ************************************** + 76: (267) ***************************************+* 77: (267) **************************************+** 78: (232) ************************************ + 79: (239) ************************************* + 80: (252) **************************************+ 81: (269) **************************************+** 82: (234) ************************************ + 83: (239) ************************************* + 84: (240) ************************************* + 85: (259) ***************************************+ 86: (290) ****************************************+**** 87: (249) ************************************** + 88: (256) ***************************************+ 89: (235) ************************************ + 90: (242) ************************************* + 91: (261) **************************************+* 92: (245) **************************************+ 93: (245) ************************************** + 94: (248) **************************************+ 95: (274) ***************************************+** 96: (230) *********************************** + 97: (263) **************************************+* 98: (241) ************************************* 99: (236) ************************************

As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.

...roboticus

When your only tool is a hammer, all problems look like your thumb.