in reply to Re^5: Random sampling a variable length file.
in thread Random sampling a variable record-length file.
Would it be possible to generate an index in parallel with the creation of the file?
No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.
If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?
No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.
I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.
Agreed. In extremis, it doesn't.
But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.
The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.
I realise that it is corrolated to the length of its positional predecessor, but does that matter?
If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?
The type of information being inferred (estimated) from the sample:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^7: Random sampling a variable length file.
by bobf (Monsignor) on Dec 27, 2009 at 03:43 UTC | |
|
Re^7: Random sampling a variable length file.
by bellaire (Hermit) on Dec 27, 2009 at 03:33 UTC | |
by BrowserUk (Patriarch) on Dec 27, 2009 at 11:41 UTC | |
by bellaire (Hermit) on Dec 27, 2009 at 13:35 UTC | |
by BrowserUk (Patriarch) on Dec 27, 2009 at 15:53 UTC |