All,
I am looking to do frequency analysis sampling on a file - assume it is bigger than 10 MB. I want to have a command line argument that tells the program what percentage of the file to sample. I will be examining the frequency of character tuples of sizes 2, 3 and 4.

One way would be to simply read from the start of the file until enough bytes had been read to satisfy the request. If the rest of the file differed greatly from what was sampled, the frequency analysis would suffer.

Another way to do it would be to pick N evenly distributed positions in the file, seek to that position and read X bytes such that N * X = the desired percentage. This is the approach I am looking at but I can't seem to convince myself of a good way of determining N and X. I know that the larger the percentage requested from the command line, the smaller N can be. I know N needs to be at least 3 because I want to ensure I get representative data from the start, middle and end.

Given a hypothetical file of 1000 bytes with a desired 20% sampling there are a number of possibilities:

4 evenly distributed reads of 50 bytes each 5 evenly distributed reads of 40 bytes each 8 evenly distributed reads of 25 bytes each 10 evenly distributed reads of 20 bytes each 20 evenly distributed reads of 10 bytes each 25 evenly distributed reads of 8 bytes each 40 evenly distributed reads of 5 bytes each 50 evenly distributed reads of 4 bytes each

My gut says to pick the values of X and Y that are closest together and in the case of ties (such as above) to choose the one with fewer reads of more bytes. What are your thoughts?

Cheers - L~R


In reply to Frequency Analysis Of A Subset Of A File by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.