Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I am looking to do frequency analysis sampling on a file - assume it is bigger than 10 MB. I want to have a command line argument that tells the program what percentage of the file to sample. I will be examining the frequency of character tuples of sizes 2, 3 and 4.

One way would be to simply read from the start of the file until enough bytes had been read to satisfy the request. If the rest of the file differed greatly from what was sampled, the frequency analysis would suffer.

Another way to do it would be to pick N evenly distributed positions in the file, seek to that position and read X bytes such that N * X = the desired percentage. This is the approach I am looking at but I can't seem to convince myself of a good way of determining N and X. I know that the larger the percentage requested from the command line, the smaller N can be. I know N needs to be at least 3 because I want to ensure I get representative data from the start, middle and end.

Given a hypothetical file of 1000 bytes with a desired 20% sampling there are a number of possibilities:

4 evenly distributed reads of 50 bytes each 5 evenly distributed reads of 40 bytes each 8 evenly distributed reads of 25 bytes each 10 evenly distributed reads of 20 bytes each 20 evenly distributed reads of 10 bytes each 25 evenly distributed reads of 8 bytes each 40 evenly distributed reads of 5 bytes each 50 evenly distributed reads of 4 bytes each

My gut says to pick the values of X and Y that are closest together and in the case of ties (such as above) to choose the one with fewer reads of more bytes. What are your thoughts?

Cheers - L~R

Replies are listed 'Best First'.
Re: Frequency Analysis Of A Subset Of A File
by BrowserUk (Patriarch) on Apr 24, 2013 at 18:34 UTC

    This will print a pretty good approximation to a randomly distributed 10% of the lines in any file, regardless of its size:

    C:\test>wc -l 986831-01.dat 268 986831-01.dat C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 33 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 26 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 32 C:\test>perl -nle" rand() < 0.1 and print" 986831-01.dat | wc -l 24

    Once you have randomly selected X% of the lines in the file, you only need randomly select X% of the characters (pairs/triples) in each of those lines to satisfy your overall goal.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples. In this approach, each read can result in at most, one newline.

      Cheers - L~R

        And if the file contains 0 newlines? Update: Or, you want newline characters to be include in your tuples.

        Then read fixed sized blocks instead of lines:

        C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 C:\test>perl -e"BEGIN{$/= \1024}" -nle" rand() < 0.1 and print length( +)" 986831-01.dat 1024 1024 1024 1024

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Frequency Analysis Of A Subset Of A File
by talexb (Chancellor) on Apr 24, 2013 at 19:48 UTC

    I'm wondering if you need to worry about two of your samples overlapping, and thus skewing the results. Just a thought -- I don't have code to offer, but it should be fairly trivial. :)

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      talexb,
      I think that would only happen if you were randomly selecting your seek positions and not ensuring they were evenly distributed.

      Cheers - L~R