comment on

Good evening Monks,

I've been using the following script to identify and count the daily frequency of n-grams of 5 words in my data.

use strict;
use warnings;
my %mycorpus = (
    a => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#",
    b => "date:#20180101# comment:#b1 b2 b3 b4 b5 b6 b7# comment:#c1 c
+2 c3 c4 c5 c6#",
    c => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#",
);

my %counts;
foreach my $filename ( sort keys %mycorpus ) {
    my $date;
    my $dataset = '';
    my $word    = '';

    while ( $mycorpus{$filename} =~ /date:#(\d+)#/g ) {
        $date = $1;
    }

    while ( $mycorpus{$filename} =~ /comment:#(.*?)#/g ) {
        $dataset .= "$1    ";

    }
    
    while ( $dataset =~ m/(\w+) \s (?= (\w+) \s (\w+) \s (\w+) \s (\w+
+) )/gx ) {
        $word = "$1 $2 $3 $4 $5";

        $counts{$date}{$word}++;
    }
}

use Data::Dumper;
print Dumper \%counts;
[download]

The script, although it is (more than!) a little clunky with the regular expression, produces the desired output.

$VAR1 = {
          '20180101' => {
                          'c1 c2 c3 c4 c5' => 1,
                          'b1 b2 b3 b4 b5' => 1,
                          'b3 b4 b5 b6 b7' => 1,
                          'b2 b3 b4 b5 b6' => 1,
                          'd2 d3 d4 d5 d6' => 2,
                          'd1 d2 d3 d4 d5' => 2,
                          'c2 c3 c4 c5 c6' => 1
                        }
        };
[download]

However, earlier this week I tried to use the script on a significantly larger hash/dataset than I've used it with before (approximately 45 million words), and my system killed the script after it had breached the maximum memory capacity of 256GB! My system usually copes fairly well when handling this large dataset, and I've never known it use more than 32GB of memory when running Perl scripts on this data before.

Since then I've been trying to find alternative ways of returning the daily frequency of 5-word n-grams. I first explored Text::Ngrams with no luck and then attempted to adopt the solutions provided by you guys to a similar question posed on here (see Create n-grams from tokenized text file). Specifically, the problem was that I couldn't work out how to adapt the scripts so that 1) they recognized that n-grams do not run across comments (e.g. in the example above B's cannot be included in n-grams with C's, and thus the B and C forms are stored with several spaces between them in $dataset and my regular expression will not match words with more than one space separating them) and 2) I can store the number of n-grams in a hash to be used later.

Any tips on reducing memory usage would be greatly appreciated. Thanks!

In reply to Reducing memory usage on n-grams script by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.