Re: extract phrases of n-words length

There are as I see it two scalability issues in the above code. The first of these has been addressed well by the various suggestions to use a sliding window ($i...$i+$n on the array of words).

The second is the keys themselves. You are currently using the actual words in each phrase as a key. This means that searching for all sequences of N-M words in a text with K word characters (e.g. alpha-numerics) could concievably take up (N+N+1+...+M)*K characters for its keys alone. The actual amount depends on the frequency of each particular word runs: "a a a ..." will obviously use less space than "a b c d e..." or "a..z b..za c..zab ...".

If you intend to make either N, the range from N...M or K large, you might want to consider assigning each word in the text a integer id and composing your key out of the integers rather than the word strings as you are doing now. Keys composed out of numerical indexes would save space and possibly search time.

In pseudocode, this would work something like this:

my @aWords;  # words from the abstract
my %hWords;  # maps words to their id
my $iUniqueWordsSoFar;  # cheap way to assign ids
my @aIds;    # sliding window with ids from last N-M words

for (0..($#aWords-$n)) {
   #look up/assign word id
   my $sWord = $aWords[$_];
   my $iWord;
   if (exists($hWords{$sWord})) {
     $iWord = $hWords{$sWord};
   } else {
     $iWord = $hWords{$sWord} = ++$iUniqueWordsSoFar;
   }

   # update sliding window of ids for last M words
   shift(@aIds) if scalar(@aIds);
   push @aIds, $iWord;

   # add key to hash for N..M length phrases by taking
   # first X elements of sliding window to construct 
   # the key.
}

# final pass: convert what is left in @aIds to keys and
# update appropriate phrase hashes.
[download]

Best, beth

Comment on Re: extract phrases of n-words length Select or Download Code

Replies are listed 'Best First'.
Re^2: extract phrases of n-words length by BrowserUk (Patriarch) on Jun 25, 2009 at 06:12 UTC
Keys composed out of numerical indexes would save space ... Not so! Using the text from your post as a representative abstract, and building the the 3 hashes using both the original words and integers representing those words I get the following hash sizes: `c:\test>774421-2 Using words: 44501 Using numbers: 40801` [download] Giving a total saving just under 10%. But... If you add in the requirements of the array required to hold the ordered numbers--ordering is a requirement--plus a hash used to unique the words: `Numbers array: 12776 Lookup hash: 8860` [download] The memory requirement is nearly 50% higher. As for "and possibly search time.". As the hash keys are the concatenation of the string representations of the integers assigned, and hash keys are hashed byte-wise; and the average (English) words length is a little under 6-chars; and the average length of the integers required to represent them a little under 5-chars; the difference in the time taken to actually hash the keys will be very minimal. And as the number of pairs, triples & quads will be the same, the hashes will contain the same number of keys, so the lookup times will be very similar. These are the timings I got using the above sample: `Lookup words: 0.000440120697021484 seconds Lookup numbers: 0.000699043273925781 seconds Lookups words: 515; numbers:515 Lookup words: 0.000420093536376953 seconds Lookup numbers: 0.000321149826049805 seconds Lookups words: 515; numbers:515 Lookup words: 0.000285863876342773 seconds Lookup numbers: 0.000767946243286133 seconds Lookups words: 515; numbers:515` [download] Too wildly varying to consider making an assessment. The code used to produce the figures: Read more... (3 kB) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW	[reply] [d/l] [select]
Re^3: extract phrases of n-words length by ELISHEVA (Prior) on Jun 25, 2009 at 09:11 UTC
What array to hold the ordered numbers? The algorithm I proposed had a moving window - an array that never held more than M integers where M is the maximum phrase length - it used the original word array to keep track of ordering. The code above translates the entire word array into numbers before generating the phrase combinations. That is like comparing slurping to line at a time file reads. The key question is: is my post representative of the word distribution of abstracts? Truthfully, I don't know, but you raise an interesting point. The space efficiency of the algorithm described above is highly dependent on the length of repeated words and the rate of word repetition without word order repetition - of which there is little in that text. It is hardly surprising that you would get the minimal space benefits you did with my post's text (only 10%). There are 186 distinct words (defined as runs bounded by whitespace, start of text or end of text) and both the median, average word repetition is less than 2 and word length is about 4.6 rather than 6. The average 3 word phrase contains only 11 characters including whitespace. This is only slightly longer than its numeric representation. Given that there are 186 words, the average numeric id is closer to three characters. For a phrase, that means an id of between 8 and 11 characters (2+32=8, 2+33=11). As an extreme example of why word order and word length matters, counting word runs in "a b a b a b" extended even ad-infinitum is going to cost more using indexing than not. Alhough words repeat greatly, only two phrases ever appear: "a b" and "b a". As words are the same length as the integers that represent them, the indexing mechanism would save nothing. The indexing hash takes up space and there is 0 saving in key length: key length of "1,2" and "2,1" is exactly the same as "a b", "b a". Best, beth Update:initially misunderstood what BrowserUK meant by array to hold the sequence. Update:added more statistics on text sample; added acknolwegement to BrowserUK.	[reply]
Re^4: extract phrases of n-words length by BrowserUk (Patriarch) on Jun 25, 2009 at 17:11 UTC
I guess the question is: are "abstracts" likely to be hugely long, and contain many repetitions of above average length words? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW	[reply]