in reply to Re: extract phrases of n-words length
in thread extract phrases of n-words length
Keys composed out of numerical indexes would save space ...
Not so! Using the text from your post as a representative abstract, and building the the 3 hashes using both the original words and integers representing those words I get the following hash sizes:
c:\test>774421-2 Using words: 44501 Using numbers: 40801
Giving a total saving just under 10%. But... If you add in the requirements of the array required to hold the ordered numbers--ordering is a requirement--plus a hash used to unique the words:
Numbers array: 12776 Lookup hash: 8860
The memory requirement is nearly 50% higher.
As for "and possibly search time.". As the hash keys are the concatenation of the string representations of the integers assigned, and hash keys are hashed byte-wise; and the average (English) words length is a little under 6-chars; and the average length of the integers required to represent them a little under 5-chars; the difference in the time taken to actually hash the keys will be very minimal. And as the number of pairs, triples & quads will be the same, the hashes will contain the same number of keys, so the lookup times will be very similar. These are the timings I got using the above sample:
Lookup words: 0.000440120697021484 seconds Lookup numbers: 0.000699043273925781 seconds Lookups words: 515; numbers:515 Lookup words: 0.000420093536376953 seconds Lookup numbers: 0.000321149826049805 seconds Lookups words: 515; numbers:515 Lookup words: 0.000285863876342773 seconds Lookup numbers: 0.000767946243286133 seconds Lookups words: 515; numbers:515
Too wildly varying to consider making an assessment.
The code used to produce the figures:
#! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; use Devel::Size qw[ total_size ]; my $abstract = do{ local $/; <DATA> }; my @words = split ' ', $abstract; my %words; $words{ doubles }{ join ' ', @words[ $_ .. $_ +1 ] } = undef for 0 .. $#words - 1; $words{ triples }{ join ' ', @words[ $_ .. $_ +2 ] } = undef for 0 .. $#words - 2; $words{ quads }{ join ' ', @words[ $_ .. $_ +3 ] } = undef for 0 .. $#words - 3; say "Using words: ", total_size \%words; my( $n, %lookup ) = 0; my @numbers = map $lookup{ $_ } //= ++$n, @words; my %numbers; $numbers{ doubles }{ join ' ', @numbers[ $_ .. $_ +1 ] } = undef for 0 .. $#numbers - 1; $numbers{ triples }{ join ' ', @numbers[ $_ .. $_ +2 ] } = undef for 0 .. $#numbers - 2; $numbers{ quads }{ join ' ', @numbers[ $_ .. $_ +3 ] } = undef for 0 .. $#numbers - 3; say "Using numbers: ", total_size \%numbers; say "Numbers array: ", total_size \@numbers; say "Lookup hash: ", total_size \%lookup; my $start = time; my $words = 0; for my $size ( keys %words ) { exists $words{ $size }{ $_ } and $words++ for keys %{ $words{ $siz +e } }; } say "Lookup words: ", time() - $start, ' seconds'; $start = time; my $numbers = 0; for my $size ( keys %words ) { exists $numbers{ $size }{ $_ } and $numbers++ for keys %{ $numbers{ $size } }; } say "Lookup numbers: ", time() - $start, ' seconds'; say "Lookups words: $words; numbers:$numbers"; __DATA__ There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words) +. The second is the keys themselves. You are currently using the actual +words in each phrase as a key. This means that searching for all sequences o +f N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...". If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer i +d and composing your key out of the integers rather than the word string +s as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: extract phrases of n-words length
by ELISHEVA (Prior) on Jun 25, 2009 at 09:11 UTC | |
by BrowserUk (Patriarch) on Jun 25, 2009 at 17:11 UTC |