Re: term frequency and mutual info

Fwiw, my inclination would be to open both files and read/process them simultaneously line by line. Dunno how efficient it is, but I'd create a hash of array ptrs of hash ptrs, where the basic layout was something like this...

$hash->{word is key}->[each word gets an index]->{linenumber} where the value was a list of word numbers (adding another ->[index for each wordnum])or if that's too complex, just create a tally for how many times it appears in the line, and then search the line when you need to.

(you might also consider using objects to make more readable the hash of array of hash of array of hash of ... etc... if that makes you squeamish.)

That would make the second part of your problem not so difficult, because you could immediately access all words in a file, you'd know how many were in the full file by converting the array of hashptrs to scalar context and you'd have a linenumber entry for each entry.

i assume in your second example you meant "un" and "an", not and...

curious, but is this to draw a correllation algorithmically between the meanings of words, by how often they appear in the same lines? IOW is the intent to look at all words on a line, and see if they consistently show up on each corresponding line and thus draw out the meaning?

Comment on Re: term frequency and mutual info

Replies are listed 'Best First'.
Re^2: term frequency and mutual info by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:27 UTC
basically some mutual information would give a hint about how close are the words in context! So having some mutual info in the statistics helps to enrich the feature space.	[reply]