Maire has asked for the wisdom of the Perl Monks concerning the following question:
Good evening Monks,
I've been using the following script to identify and count the daily frequency of n-grams of 5 words in my data.
The script, although it is (more than!) a little clunky with the regular expression, produces the desired output.use strict; use warnings; my %mycorpus = ( a => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", b => "date:#20180101# comment:#b1 b2 b3 b4 b5 b6 b7# comment:#c1 c +2 c3 c4 c5 c6#", c => "date:#20180101# comment:#d1 d2 d3 d4 d5 d6#", ); my %counts; foreach my $filename ( sort keys %mycorpus ) { my $date; my $dataset = ''; my $word = ''; while ( $mycorpus{$filename} =~ /date:#(\d+)#/g ) { $date = $1; } while ( $mycorpus{$filename} =~ /comment:#(.*?)#/g ) { $dataset .= "$1 "; } while ( $dataset =~ m/(\w+) \s (?= (\w+) \s (\w+) \s (\w+) \s (\w+ +) )/gx ) { $word = "$1 $2 $3 $4 $5"; $counts{$date}{$word}++; } } use Data::Dumper; print Dumper \%counts;
$VAR1 = { '20180101' => { 'c1 c2 c3 c4 c5' => 1, 'b1 b2 b3 b4 b5' => 1, 'b3 b4 b5 b6 b7' => 1, 'b2 b3 b4 b5 b6' => 1, 'd2 d3 d4 d5 d6' => 2, 'd1 d2 d3 d4 d5' => 2, 'c2 c3 c4 c5 c6' => 1 } };
However, earlier this week I tried to use the script on a significantly larger hash/dataset than I've used it with before (approximately 45 million words), and my system killed the script after it had breached the maximum memory capacity of 256GB! My system usually copes fairly well when handling this large dataset, and I've never known it use more than 32GB of memory when running Perl scripts on this data before.
Since then I've been trying to find alternative ways of returning the daily frequency of 5-word n-grams. I first explored Text::Ngrams with no luck and then attempted to adopt the solutions provided by you guys to a similar question posed on here (see Create n-grams from tokenized text file). Specifically, the problem was that I couldn't work out how to adapt the scripts so that 1) they recognized that n-grams do not run across comments (e.g. in the example above B's cannot be included in n-grams with C's, and thus the B and C forms are stored with several spaces between them in $dataset and my regular expression will not match words with more than one space separating them) and 2) I can store the number of n-grams in a hash to be used later.
Any tips on reducing memory usage would be greatly appreciated. Thanks!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Reducing memory usage on n-grams script
by Eily (Monsignor) on Aug 31, 2018 at 16:58 UTC | |
by Maire (Scribe) on Sep 02, 2018 at 07:02 UTC | |
|
Re: Reducing memory usage on n-grams script
by tybalt89 (Monsignor) on Aug 31, 2018 at 23:41 UTC | |
by Maire (Scribe) on Sep 02, 2018 at 07:03 UTC | |
|
Re: Reducing memory usage on n-grams script
by bliako (Abbot) on Aug 31, 2018 at 23:25 UTC | |
by Maire (Scribe) on Sep 02, 2018 at 07:04 UTC |