Maire, what do you say is the maximum number of 5=n-grams in your chosen language in practice?

Theoretically, the number of permutations of 5 words from the dictionary of a language with X=500,000 words in its dictionary is X ^ 5 = 5E103E28 (<---Late edit) (or X! X!/(X-5)! (<<--late EDIT, perms of non-repeating words) if you want to err on the pedantic). Now there is a huge number of never-incurring-in-practice n-grams to be subtracted from said huge number but the number in practice, I suspect, it is enough to crash a 256 GB computer, given that you have multiple dates too. A huge matrix for each date x N days => computer says no (or rather sighs and dies).

If so far I have understood correctly your problen, then what can you do to cope with that huge dataset?

Create information from data and then forget about the data. I.e. get your 1 KByte statistics for the day and forget about the day's ginormous matrix.

If you want to extract information = compute statistics across many days then you could save each date to disk (i.e. in a file) hoping you have lots of terra-bytes of disk storage. Then you could calculate statistics on that data. But how will it all fit in memory to, say, calculate its mean or standard deviation (sd)?

It is amazing how few Scientists know about running / online statistics. One may think that in order to calculate the mean AND sd of a collection of numbers one needs to store all these numbers in memory, in an array so to speak.

Not so!

There is another way which calculates a running mean and sd as the numbers keep coming in as if from a stream. There is no need to save them to memory and thanks to the work of B.P. Welford and latters, one can do the caclulation efficiently and avoiding the accumulation of floating point errors.

So, if you want to calculate the sd over 1,000,000 days you do not need to read all that cyber-huge data to memory in order to calculate a mean through a summation loop and then calculate the sd through another loop. Instead you read each day's data, update your running mean and sd and forget about that day's data, i.e. unload it from memory.

There are a few modules in CPAN which do online statistics as per B.P.Welford's paper. Search for the name and you will find. Choose the one that fits your standards.

bw, bliako


In reply to Re: Reducing memory usage on n-grams script by bliako
in thread Reducing memory usage on n-grams script by Maire

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.