What array to hold the ordered numbers? The algorithm I proposed had a moving window - an array that never held more than M integers where M is the maximum phrase length - it used the original word array to keep track of ordering. The code above translates the entire word array into numbers before generating the phrase combinations. That is like comparing slurping to line at a time file reads.

The key question is: is my post representative of the word distribution of abstracts? Truthfully, I don't know, but you raise an interesting point. The space efficiency of the algorithm described above is highly dependent on the length of repeated words and the rate of word repetition without word order repetition - of which there is little in that text.

It is hardly surprising that you would get the minimal space benefits you did with my post's text (only 10%). There are 186 distinct words (defined as runs bounded by whitespace, start of text or end of text) and both the median, average word repetition is less than 2 and word length is about 4.6 rather than 6. The average 3 word phrase contains only 11 characters including whitespace. This is only slightly longer than its numeric representation. Given that there are 186 words, the average numeric id is closer to three characters. For a phrase, that means an id of between 8 and 11 characters (2+3*2=8, 2+3*3=11).

As an extreme example of why word order and word length matters, counting word runs in "a b a b a b" extended even ad-infinitum is going to cost more using indexing than not. Alhough words repeat greatly, only two phrases ever appear: "a b" and "b a". As words are the same length as the integers that represent them, the indexing mechanism would save nothing. The indexing hash takes up space and there is 0 saving in key length: key length of "1,2" and "2,1" is exactly the same as "a b", "b a".

Best, beth

Update:initially misunderstood what BrowserUK meant by array to hold the sequence.

Update:added more statistics on text sample; added acknolwegement to BrowserUK.


In reply to Re^3: extract phrases of n-words length by ELISHEVA
in thread extract phrases of n-words length by arun_kom

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.