Re: Reducing memory usage on n-grams script

Maire, what do you say is the maximum number of 5=n-grams in your chosen language in practice?

Theoretically, the number of permutations of 5 words from the dictionary of a language with X=500,000 words in its dictionary is X ^ 5 = ~~5E10~~3E28 (<---Late edit) (or X! X!/(X-5)! (<<--late EDIT, perms of non-repeating words) if you want to err on the pedantic). Now there is a huge number of never-incurring-in-practice n-grams to be subtracted from said huge number but the number in practice, I suspect, it is enough to crash a 256 GB computer, given that you have multiple dates too. A huge matrix for each date x N days => computer says no (or rather sighs and dies).

If so far I have understood correctly your problen, then what can you do to cope with that huge dataset?

Create information from data and then forget about the data. I.e. get your 1 KByte statistics for the day and forget about the day's ginormous matrix.

If you want to extract information = compute statistics across many days then you could save each date to disk (i.e. in a file) hoping you have lots of terra-bytes of disk storage. Then you could calculate statistics on that data. But how will it all fit in memory to, say, calculate its mean or standard deviation (sd)?

It is amazing how few Scientists know about running / online statistics. One may think that in order to calculate the mean AND sd of a collection of numbers one needs to store all these numbers in memory, in an array so to speak.

Not so!

There is another way which calculates a running mean and sd as the numbers keep coming in as if from a stream. There is no need to save them to memory and thanks to the work of B.P. Welford and latters, one can do the caclulation efficiently and avoiding the accumulation of floating point errors.

So, if you want to calculate the sd over 1,000,000 days you do not need to read all that cyber-huge data to memory in order to calculate a mean through a summation loop and then calculate the sd through another loop. Instead you read each day's data, update your running mean and sd and forget about that day's data, i.e. unload it from memory.

There are a few modules in CPAN which do online statistics as per B.P.Welford's paper. Search for the name and you will find. Choose the one that fits your standards.

bw, bliako

Comment on Re: Reducing memory usage on n-grams script

Replies are listed 'Best First'.
Re^2: Reducing memory usage on n-grams script by Maire (Scribe) on Sep 02, 2018 at 07:04 UTC
This is incredibly helpful, thank you!	[reply]