Re: Frequency Analysis Of A Subset Of A File

This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size:

C:\test>wc -l 986831-01.dat
    268 986831-01.dat

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     33

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     26

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     32

C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l
     24
[download]

Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Frequency Analysis Of A Subset Of A File Download Code

Replies are listed 'Best First'.
Re^2: Frequency Analysis Of A Subset Of A File by Limbic~Region (Chancellor) on Apr 24, 2013 at 18:51 UTC
BrowserUk, And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. In this approach, each read can result in at most, one newline. Cheers - L~R	[reply]
Re^3: Frequency Analysis Of A Subset Of A File by BrowserUk (Patriarch) on Apr 24, 2013 at 20:45 UTC
And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. Then read fixed sized blocks instead of lines: `C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]