Would it be possible to generate an index in parallel with the creation of the file?

No. The application is a generalised file utility aimed at text-based records. Think csv, tsv, log files etc.

If not, would it be possible to scan the file for record delimiters as a pre-processing step to generate the index?

No. Because creating an offset index requires reading the entire file and negates the purpose of taking a random sample.

I do not see how the bias would be negated by reading the next record, ... For example, if one record was 90% of the entire file, then seeking to a random position in the file would result in landing in that record about 90% of the time and whatever record followed it would be chosen each time.

Agreed. In extremis, it doesn't.

But in the general case, the application is for huge files (many GBs), with relatively short records (10s to 100s of bytes), and record length variations of a (say) maximum of 50% of the maximum, typically less.

The (unsupported) notion is, that as the length of the next (picked) record is uncorrolated to the length of the record containing the picked position, the bias is reduced if not eliminated. Ie. The probability of a given record being picked is not corrolated to its length.

I realise that it is corrolated to the length of its positional predecessor, but does that matter?

If you have a 10GB file containing 100e6 records that average 100 bytes +- 50, then the maximum probability of a record being picked is 0.0000015%; and the minimum 0.0000005%. Is that difference enough to invalidate using (say) 1000 random (byte positions), to choose a representative sample of the entire file?

The type of information being inferred (estimated) from the sample:


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

In reply to Re^6: Random sampling a variable length file. by BrowserUk
in thread Random sampling a variable record-length file. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.