Fwiw, my inclination would be to open both files and read/process them simultaneously line by line. Dunno how efficient it is, but I'd create a hash of array ptrs of hash ptrs, where the basic layout was something like this...

$hash->{word is key}->[each word gets an index]->{linenumber} where the value was a list of word numbers (adding another ->[index for each wordnum])or if that's too complex, just create a tally for how many times it appears in the line, and then search the line when you need to.

(you might also consider using objects to make more readable the hash of array of hash of array of hash of ... etc... if that makes you squeamish.)

That would make the second part of your problem not so difficult, because you could immediately access all words in a file, you'd know how many were in the full file by converting the array of hashptrs to scalar context and you'd have a linenumber entry for each entry.

i assume in your second example you meant "un" and "an", not and...

curious, but is this to draw a correllation algorithmically between the meanings of words, by how often they appear in the same lines? IOW is the intent to look at all words on a line, and see if they consistently show up on each corresponding line and thus draw out the meaning?


In reply to Re: term frequency and mutual info by raybies
in thread term frequency and mutual info by perl_lover_always

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.