Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
One way would be to simply read from the start of the file until enough bytes had been read to satisfy the request. If the rest of the file differed greatly from what was sampled, the frequency analysis would suffer.
Another way to do it would be to pick N evenly distributed positions in the file, seek to that position and read X bytes such that N * X = the desired percentage. This is the approach I am looking at but I can't seem to convince myself of a good way of determining N and X. I know that the larger the percentage requested from the command line, the smaller N can be. I know N needs to be at least 3 because I want to ensure I get representative data from the start, middle and end.
Given a hypothetical file of 1000 bytes with a desired 20% sampling there are a number of possibilities:
4 evenly distributed reads of 50 bytes each 5 evenly distributed reads of 40 bytes each 8 evenly distributed reads of 25 bytes each 10 evenly distributed reads of 20 bytes each 20 evenly distributed reads of 10 bytes each 25 evenly distributed reads of 8 bytes each 40 evenly distributed reads of 5 bytes each 50 evenly distributed reads of 4 bytes each
My gut says to pick the values of X and Y that are closest together and in the case of ties (such as above) to choose the one with fewer reads of more bytes. What are your thoughts?
Cheers - L~R
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Frequency Analysis Of A Subset Of A File
by BrowserUk (Patriarch) on Apr 24, 2013 at 18:34 UTC | |
by Limbic~Region (Chancellor) on Apr 24, 2013 at 18:51 UTC | |
by BrowserUk (Patriarch) on Apr 24, 2013 at 20:45 UTC | |
|
Re: Frequency Analysis Of A Subset Of A File
by talexb (Chancellor) on Apr 24, 2013 at 19:48 UTC | |
by Limbic~Region (Chancellor) on Apr 24, 2013 at 19:57 UTC |