in reply to Re^8: Random sampling a variable length file.
in thread Random sampling a variable record-length file.
My other thoughts overnight had to do with the pathological case presented by bobf:
Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.
So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^10: Random sampling a variable length file.
by BrowserUk (Patriarch) on Dec 27, 2009 at 15:53 UTC |