in reply to extract phrases of n-words length
There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words).
The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...".
If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.
In pseudocode, this would work something like this:
my @aWords; # words from the abstract my %hWords; # maps words to their id my $iUniqueWordsSoFar; # cheap way to assign ids my @aIds; # sliding window with ids from last N-M words for (0..($#aWords-$n)) { #look up/assign word id my $sWord = $aWords[$_]; my $iWord; if (exists($hWords{$sWord})) { $iWord = $hWords{$sWord}; } else { $iWord = $hWords{$sWord} = ++$iUniqueWordsSoFar; } # update sliding window of ids for last M words shift(@aIds) if scalar(@aIds); push @aIds, $iWord; # add key to hash for N..M length phrases by taking # first X elements of sliding window to construct # the key. } # final pass: convert what is left in @aIds to keys and # update appropriate phrase hashes.
Best, beth
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: extract phrases of n-words length
by BrowserUk (Patriarch) on Jun 25, 2009 at 06:12 UTC | |
by ELISHEVA (Prior) on Jun 25, 2009 at 09:11 UTC | |
by BrowserUk (Patriarch) on Jun 25, 2009 at 17:11 UTC |