comment on

All,
I am looking to do frequency analysis sampling on a file - assume it is bigger than 10 MB. I want to have a command line argument that tells the program what percentage of the file to sample. I will be examining the frequency of character tuples of sizes 2, 3 and 4.

One way would be to simply read from the start of the file until enough bytes had been read to satisfy the request. If the rest of the file differed greatly from what was sampled, the frequency analysis would suffer.

Another way to do it would be to pick N evenly distributed positions in the file, seek to that position and read X bytes such that N * X = the desired percentage. This is the approach I am looking at but I can't seem to convince myself of a good way of determining N and X. I know that the larger the percentage requested from the command line, the smaller N can be. I know N needs to be at least 3 because I want to ensure I get representative data from the start, middle and end.

Given a hypothetical file of 1000 bytes with a desired 20% sampling there are a number of possibilities:

4 evenly distributed reads of 50 bytes each
5 evenly distributed reads of 40 bytes each
8 evenly distributed reads of 25 bytes each
10 evenly distributed reads of 20 bytes each
20 evenly distributed reads of 10 bytes each
25 evenly distributed reads of 8 bytes each
40 evenly distributed reads of 5 bytes each
50 evenly distributed reads of 4 bytes each
[download]

My gut says to pick the values of X and Y that are closest together and in the case of ties (such as above) to choose the one with fewer reads of more bytes. What are your thoughts?

Cheers - L~R

In reply to Frequency Analysis Of A Subset Of A File by Limbic~Region

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.