Maire has asked for the wisdom of the Perl Monks concerning the following question:

Good evening Monks,

I've been using the following script to identify and count the daily frequency of n-grams of 5 words in my data.

use strict; use warnings; my %mycorpus = ( a => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", b => "date:#20180101# comment:#b1 b2 b3 b4 b5 b6 b7# comment:#c1 c +2 c3 c4 c5 c6#", c => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", ); my %counts; foreach my $filename ( sort keys %mycorpus ) { my $date; my $dataset = ''; my $word = ''; while ( $mycorpus{$filename} =~ /date:#(\d+)#/g ) { $date = $1; } while ( $mycorpus{$filename} =~ /comment:#(.*?)#/g ) { $dataset .= "$1 "; } while ( $dataset =~ m/(\w+) \s (?= (\w+) \s (\w+) \s (\w+) \s (\w+ +) )/gx ) { $word = "$1 $2 $3 $4 $5"; $counts{$date}{$word}++; } } use Data::Dumper; print Dumper \%counts;
The script, although it is (more than!) a little clunky with the regular expression, produces the desired output.
$VAR1 = { '20180101' => { 'c1 c2 c3 c4 c5' => 1, 'b1 b2 b3 b4 b5' => 1, 'b3 b4 b5 b6 b7' => 1, 'b2 b3 b4 b5 b6' => 1, 'd2 d3 d4 d5 d6' => 2, 'd1 d2 d3 d4 d5' => 2, 'c2 c3 c4 c5 c6' => 1 } };

However, earlier this week I tried to use the script on a significantly larger hash/dataset than I've used it with before (approximately 45 million words), and my system killed the script after it had breached the maximum memory capacity of 256GB! My system usually copes fairly well when handling this large dataset, and I've never known it use more than 32GB of memory when running Perl scripts on this data before.

Since then I've been trying to find alternative ways of returning the daily frequency of 5-word n-grams. I first explored Text::Ngrams with no luck and then attempted to adopt the solutions provided by you guys to a similar question posed on here (see Create n-grams from tokenized text file). Specifically, the problem was that I couldn't work out how to adapt the scripts so that 1) they recognized that n-grams do not run across comments (e.g. in the example above B's cannot be included in n-grams with C's, and thus the B and C forms are stored with several spaces between them in $dataset and my regular expression will not match words with more than one space separating them) and 2) I can store the number of n-grams in a hash to be used later.

Any tips on reducing memory usage would be greatly appreciated. Thanks!

Replies are listed 'Best First'.
Re: Reducing memory usage on n-grams script
by Eily (Monsignor) on Aug 31, 2018 at 16:58 UTC

    Hello Maire

    while ( $mycorpus{$filename} =~ /date:#(\d+)#/g ) { $date = $1; }
    This will keep searching for a date and each time one is found, overwrite the previous one. An if might be better than a while there. And if you want the last one, /.*date:#(\d+)#/ might do the trick (the .* will make perl read the whole string first, and then backtrack to try and match "date"). If you use an if you might ask yourself is an else is required (calling next to jump to the next file might be an option)

    Then, rather than collecting all the datasets and then trying to search them separately anyway, you could process them as you find them:

    while ( $mycorpus{$filename} =~ /comment:#(.*?)#/g ) { my $dataset = $1; while ($dataset =~ /(\w+) (?= ( (?:\s\w+){4} ) )/gx) { $counts{$date}{"$1 $2"}++; } }

    Your code says "filename" when it's actually hashkeys, but if you had your input data in files, and didn't read them all at once, you would also save some memory. By the way, since your output is also an hash (ie, no ordering), sorting the keys has no effect.

      Thank you very much for this!
Re: Reducing memory usage on n-grams script
by tybalt89 (Monsignor) on Aug 31, 2018 at 23:41 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1221461 use strict; use warnings; my %mycorpus = ( a => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", b => "date:#20180101# comment:#b1 b2 b3 b4 b5 b6 b7# comment:#c1 c +2 c3 c4 c5 c6#", c => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", ); open my $fh, '|-', 'sort | uniq -c' or die; for ( values %mycorpus ) { my ($date) = /date:#(\d+)#/; for ( /comment:#(.*?)#/g ) { my @words = split; print $fh "$date @words[$_..$_+4]\n" for 0 .. @words - 5; } } close $fh;

    Outputs:

    1 20180101 b1 b2 b3 b4 b5 1 20180101 b2 b3 b4 b5 b6 1 20180101 b3 b4 b5 b6 b7 1 20180101 c1 c2 c3 c4 c5 1 20180101 c2 c3 c4 c5 c6 2 20180101 d1 d2 d3 d4 d5 2 20180101 d2 d3 d4 d5 d6
      Thanks!
Re: Reducing memory usage on n-grams script
by bliako (Abbot) on Aug 31, 2018 at 23:25 UTC

    Maire, what do you say is the maximum number of 5=n-grams in your chosen language in practice?

    Theoretically, the number of permutations of 5 words from the dictionary of a language with X=500,000 words in its dictionary is X ^ 5 = 5E103E28 (<---Late edit) (or X! X!/(X-5)! (<<--late EDIT, perms of non-repeating words) if you want to err on the pedantic). Now there is a huge number of never-incurring-in-practice n-grams to be subtracted from said huge number but the number in practice, I suspect, it is enough to crash a 256 GB computer, given that you have multiple dates too. A huge matrix for each date x N days => computer says no (or rather sighs and dies).

    If so far I have understood correctly your problen, then what can you do to cope with that huge dataset?

    Create information from data and then forget about the data. I.e. get your 1 KByte statistics for the day and forget about the day's ginormous matrix.

    If you want to extract information = compute statistics across many days then you could save each date to disk (i.e. in a file) hoping you have lots of terra-bytes of disk storage. Then you could calculate statistics on that data. But how will it all fit in memory to, say, calculate its mean or standard deviation (sd)?

    It is amazing how few Scientists know about running / online statistics. One may think that in order to calculate the mean AND sd of a collection of numbers one needs to store all these numbers in memory, in an array so to speak.

    Not so!

    There is another way which calculates a running mean and sd as the numbers keep coming in as if from a stream. There is no need to save them to memory and thanks to the work of B.P. Welford and latters, one can do the caclulation efficiently and avoiding the accumulation of floating point errors.

    So, if you want to calculate the sd over 1,000,000 days you do not need to read all that cyber-huge data to memory in order to calculate a mean through a summation loop and then calculate the sd through another loop. Instead you read each day's data, update your running mean and sd and forget about that day's data, i.e. unload it from memory.

    There are a few modules in CPAN which do online statistics as per B.P.Welford's paper. Search for the name and you will find. Choose the one that fits your standards.

    bw, bliako

      This is incredibly helpful, thank you!