Frequency Analysis Of A Subset Of A File

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I am looking to do frequency analysis sampling on a file - assume it is bigger than 10 MB. I want to have a command line argument that tells the program what percentage of the file to sample. I will be examining the frequency of character tuples of sizes 2, 3 and 4.

One way would be to simply read from the start of the file until enough bytes had been read to satisfy the request. If the rest of the file differed greatly from what was sampled, the frequency analysis would suffer.

Another way to do it would be to pick N evenly distributed positions in the file, seek to that position and read X bytes such that N * X = the desired percentage. This is the approach I am looking at but I can't seem to convince myself of a good way of determining N and X. I know that the larger the percentage requested from the command line, the smaller N can be. I know N needs to be at least 3 because I want to ensure I get representative data from the start, middle and end.

Given a hypothetical file of 1000 bytes with a desired 20% sampling there are a number of possibilities:

4 evenly distributed reads of 50 bytes each
5 evenly distributed reads of 40 bytes each
8 evenly distributed reads of 25 bytes each
10 evenly distributed reads of 20 bytes each
20 evenly distributed reads of 10 bytes each
25 evenly distributed reads of 8 bytes each
40 evenly distributed reads of 5 bytes each
50 evenly distributed reads of 4 bytes each
[download]

My gut says to pick the values of X and Y that are closest together and in the case of ties (such as above) to choose the one with fewer reads of more bytes. What are your thoughts?

Cheers - L~R

Comment on Frequency Analysis Of A Subset Of A File Download Code

Replies are listed 'Best First'.
Re: Frequency Analysis Of A Subset Of A File by BrowserUk (Patriarch) on Apr 24, 2013 at 18:34 UTC
This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size: `C:\test>wc -l 986831-01.dat 268 986831-01.dat C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat \| wc -l 33 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat \| wc -l 26 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat \| wc -l 32 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat \| wc -l 24` [download] Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Frequency Analysis Of A Subset Of A File by Limbic~Region (Chancellor) on Apr 24, 2013 at 18:51 UTC
BrowserUk, And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. In this approach, each read can result in at most, one newline. Cheers - L~R	[reply]
Re^3: Frequency Analysis Of A Subset Of A File by BrowserUk (Patriarch) on Apr 24, 2013 at 20:45 UTC
And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. Then read fixed sized blocks instead of lines: `C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Frequency Analysis Of A Subset Of A File by talexb (Chancellor) on Apr 24, 2013 at 19:48 UTC
I'm wondering if you need to worry about two of your samples overlapping, and thus skewing the results. Just a thought -- I don't have code to offer, but it should be fairly trivial. :) Alex / talexb / Toronto "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds	[reply]
Re^2: Frequency Analysis Of A Subset Of A File by Limbic~Region (Chancellor) on Apr 24, 2013 at 19:57 UTC
talexb, I think that would only happen if you were randomly selecting your seek positions and not ensuring they were evenly distributed. Cheers - L~R	[reply]