in reply to Re^8: Random sampling a variable length file.
in thread Random sampling a variable record-length file.

Not sure. You'd probably need some way to estimate whether your sampling distribution is uniform with respect to the index of the sample. Also, you could see whether the average length of the records in your sample jibes with the average length of records in the entire population.

My other thoughts overnight had to do with the pathological case presented by bobf:

Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

So even if you are hitting the big record 90% of the time, you ignore it after the first time, and then other 10% of the hits select records as normal. Since any record at all can follow the 90% length record, that's fair. And since the length of the last record has nothing to do with the length of the first, it has same same likelihood of being selected as any record.
  • Comment on Re^9: Random sampling a variable length file.

Replies are listed 'Best First'.
Re^10: Random sampling a variable length file.
by BrowserUk (Patriarch) on Dec 27, 2009 at 15:53 UTC
    Taken together, these make even the extreme case just as amenable to this method as any other. If you remember which records you've hit and do not re-sample them, you're simply omitting a segment of the number line from a uniform distribution. The distributions on either side are still uniform, i.e., random.

    Thankyou again! That makes a great deal of sense.

    My first reaction was that remembering whether I had already picked a record was an awkward prospect given I olny have the byte position and no nknowledge of how long it is, then it dawned on me querying the offset once I've read the partial record make for a perfect signature.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.