comment on

Keys composed out of numerical indexes would save space ...

Not so! Using the text from your post as a representative abstract, and building the the 3 hashes using both the original words and integers representing those words I get the following hash sizes:

c:\test>774421-2
Using words: 44501
Using numbers: 40801
[download]

Giving a total saving just under 10%. But... If you add in the requirements of the array required to hold the ordered numbers--ordering is a requirement--plus a hash used to unique the words:

Numbers array: 12776
Lookup hash:   8860
[download]

The memory requirement is nearly 50% higher.

As for "and possibly search time.". As the hash keys are the concatenation of the string representations of the integers assigned, and hash keys are hashed byte-wise; and the average (English) words length is a little under 6-chars; and the average length of the integers required to represent them a little under 5-chars; the difference in the time taken to actually hash the keys will be very minimal. And as the number of pairs, triples & quads will be the same, the hashes will contain the same number of keys, so the lookup times will be very similar. These are the timings I got using the above sample:

Lookup words:   0.000440120697021484 seconds
Lookup numbers: 0.000699043273925781 seconds
Lookups words: 515; numbers:515

Lookup words:   0.000420093536376953 seconds
Lookup numbers: 0.000321149826049805 seconds
Lookups words: 515; numbers:515

Lookup words:   0.000285863876342773 seconds
Lookup numbers: 0.000767946243286133 seconds
Lookups words: 515; numbers:515
[download]

Too wildly varying to consider making an assessment.

The code used to produce the figures:

#! perl -sw
use 5.010;
use strict;
use Time::HiRes qw[ time ];
use Data::Dump qw[ pp ];
use Devel::Size qw[ total_size ];

my $abstract = do{ local $/; <DATA> };
my @words = split ' ', $abstract;

my %words;
$words{ doubles }{ join ' ', @words[ $_ .. $_ +1 ] } = undef 
    for 0 .. $#words - 1;
$words{ triples }{ join ' ', @words[ $_ .. $_ +2 ] } = undef 
    for 0 .. $#words - 2;
$words{ quads   }{ join ' ', @words[ $_ .. $_ +3 ] } = undef 
    for 0 .. $#words - 3;

say "Using words: ", total_size \%words;

my( $n,  %lookup ) = 0;
my @numbers = map $lookup{ $_ } //= ++$n, @words;

my %numbers;
$numbers{ doubles }{ join ' ', @numbers[ $_ .. $_ +1 ] } = undef 
    for 0 .. $#numbers - 1;
$numbers{ triples }{ join ' ', @numbers[ $_ .. $_ +2 ] } = undef 
    for 0 .. $#numbers - 2;
$numbers{ quads   }{ join ' ', @numbers[ $_ .. $_ +3 ] } = undef 
    for 0 .. $#numbers - 3;

say "Using numbers: ", total_size \%numbers;

say "Numbers array: ", total_size \@numbers;
say "Lookup hash:   ", total_size \%lookup;

my $start = time;
my $words = 0;
for my $size ( keys %words ) {
    exists $words{ $size }{ $_ } and $words++ for keys %{ $words{ $siz
+e } };
}
say "Lookup words:   ", time() - $start, ' seconds';

$start = time;
my $numbers = 0;
for my $size ( keys %words ) {
    exists $numbers{ $size }{ $_ } and $numbers++
        for keys %{ $numbers{ $size } };
}
say "Lookup numbers: ", time() - $start, ' seconds';

say "Lookups words: $words; numbers:$numbers";

__DATA__
There are as I see it two scalability issues in the above code.
The first of these has been addressed well by the various
suggestions to use a sliding window ($i...$i+$n on the array of words)
+.

The second is the keys themselves. You are currently using the actual 
+words
in each phrase as a key. This means that searching for all sequences o
+f N-M
words in a text with K word characters (e.g. alpha-numerics) could
concievably take up (N+N+1+...+M)*K characters for its keys alone.
The actual amount depends on the frequency of each particular word
runs: "a a a ..." will obviously use less space than "a b c d e..."
or "a..z b..za c..zab ...".

If you intend to make either N, the range from N...M or K large,
you might want to consider assigning each word in the text a integer i
+d
and composing your key out of the integers rather than the word string
+s
as you are doing now. Keys composed out of numerical indexes
would save space and possibly search time.
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP PCW

In reply to Re^2: extract phrases of n-words length by BrowserUk
in thread extract phrases of n-words length by arun_kom

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.