c:\test>774421-2
Using words: 44501
Using numbers: 40801
####
Numbers array: 12776
Lookup hash: 8860
##
##
Lookup words: 0.000440120697021484 seconds
Lookup numbers: 0.000699043273925781 seconds
Lookups words: 515; numbers:515
Lookup words: 0.000420093536376953 seconds
Lookup numbers: 0.000321149826049805 seconds
Lookups words: 515; numbers:515
Lookup words: 0.000285863876342773 seconds
Lookup numbers: 0.000767946243286133 seconds
Lookups words: 515; numbers:515
##
##
#! perl -sw
use 5.010;
use strict;
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];
use Devel::Size qw[ total_size ];
my $abstract = do{ local $/; };
my @words = split ' ', $abstract;
my %words;
$words{ doubles }{ join ' ', @words[ $_ .. $_ +1 ] } = undef
for 0 .. $#words - 1;
$words{ triples }{ join ' ', @words[ $_ .. $_ +2 ] } = undef
for 0 .. $#words - 2;
$words{ quads }{ join ' ', @words[ $_ .. $_ +3 ] } = undef
for 0 .. $#words - 3;
say "Using words: ", total_size \%words;
my( $n, %lookup ) = 0;
my @numbers = map $lookup{ $_ } //= ++$n, @words;
my %numbers;
$numbers{ doubles }{ join ' ', @numbers[ $_ .. $_ +1 ] } = undef
for 0 .. $#numbers - 1;
$numbers{ triples }{ join ' ', @numbers[ $_ .. $_ +2 ] } = undef
for 0 .. $#numbers - 2;
$numbers{ quads }{ join ' ', @numbers[ $_ .. $_ +3 ] } = undef
for 0 .. $#numbers - 3;
say "Using numbers: ", total_size \%numbers;
say "Numbers array: ", total_size \@numbers;
say "Lookup hash: ", total_size \%lookup;
my $start = time;
my $words = 0;
for my $size ( keys %words ) {
exists $words{ $size }{ $_ } and $words++ for keys %{ $words{ $size } };
}
say "Lookup words: ", time() - $start, ' seconds';
$start = time;
my $numbers = 0;
for my $size ( keys %words ) {
exists $numbers{ $size }{ $_ } and $numbers++
for keys %{ $numbers{ $size } };
}
say "Lookup numbers: ", time() - $start, ' seconds';
say "Lookups words: $words; numbers:$numbers";
__DATA__
There are as I see it two scalability issues in the above code.
The first of these has been addressed well by the various
suggestions to use a sliding window ($i...$i+$n on the array of words).
The second is the keys themselves. You are currently using the actual words
in each phrase as a key. This means that searching for all sequences of N-M
words in a text with K word characters (e.g. alpha-numerics) could
concievably take up (N+N+1+...+M)*K characters for its keys alone.
The actual amount depends on the frequency of each particular word
runs: "a a a ..." will obviously use less space than "a b c d e..."
or "a..z b..za c..zab ...".
If you intend to make either N, the range from N...M or K large,
you might want to consider assigning each word in the text a integer id
and composing your key out of the integers rather than the word strings
as you are doing now. Keys composed out of numerical indexes
would save space and possibly search time.