c:\test>774421-2 Using words: 44501 Using numbers: 40801 #### Numbers array: 12776 Lookup hash: 8860 #### Lookup words: 0.000440120697021484 seconds Lookup numbers: 0.000699043273925781 seconds Lookups words: 515; numbers:515 Lookup words: 0.000420093536376953 seconds Lookup numbers: 0.000321149826049805 seconds Lookups words: 515; numbers:515 Lookup words: 0.000285863876342773 seconds Lookup numbers: 0.000767946243286133 seconds Lookups words: 515; numbers:515 #### #! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use Devel::Size qw[ total_size ]; my $abstract = do{ local $/; }; my @words = split ' ', $abstract; my %words; $words{ doubles }{ join ' ', @words[ $_ .. $_ +1 ] } = undef for 0 .. $#words - 1; $words{ triples }{ join ' ', @words[ $_ .. $_ +2 ] } = undef for 0 .. $#words - 2; $words{ quads }{ join ' ', @words[ $_ .. $_ +3 ] } = undef for 0 .. $#words - 3; say "Using words: ", total_size \%words; my( $n, %lookup ) = 0; my @numbers = map $lookup{ $_ } //= ++$n, @words; my %numbers; $numbers{ doubles }{ join ' ', @numbers[ $_ .. $_ +1 ] } = undef for 0 .. $#numbers - 1; $numbers{ triples }{ join ' ', @numbers[ $_ .. $_ +2 ] } = undef for 0 .. $#numbers - 2; $numbers{ quads }{ join ' ', @numbers[ $_ .. $_ +3 ] } = undef for 0 .. $#numbers - 3; say "Using numbers: ", total_size \%numbers; say "Numbers array: ", total_size \@numbers; say "Lookup hash: ", total_size \%lookup; my $start = time; my $words = 0; for my $size ( keys %words ) { exists $words{ $size }{ $_ } and $words++ for keys %{ $words{ $size } }; } say "Lookup words: ", time() - $start, ' seconds'; $start = time; my $numbers = 0; for my $size ( keys %words ) { exists $numbers{ $size }{ $_ } and $numbers++ for keys %{ $numbers{ $size } }; } say "Lookup numbers: ", time() - $start, ' seconds'; say "Lookups words: $words; numbers:$numbers"; __DATA__ There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words). The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...". If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.