in reply to Re: Advice for optimizing lookup speed in gigantic hashes
in thread Advice for optimizing lookup speed in gigantic hashes

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re^3: Advice for optimizing lookup speed in gigantic hashes
by BrowserUk (Patriarch) on Aug 23, 2011 at 15:25 UTC
    the virtual-memory thrashing that occurs by trying to keep millions of words “in memory” at the same time.

    Do you actually read the posts you reply to, or just divine their content through your lower lumbar regions? Because that's what you are talking out of right now.

    He isn't storing millions of words “in memory”. He clearly states that his dictionary contains 100,000 words. Mine contains 178,000 words:

    @words = do{ local( @ARGV, $/) = 'words.txt'; <> };; print scalar @words;; 178691 undef @words{ @words };; print total_size( \%words );; 14 224 469 ## Spaces added for clarity!

    Mine occupies just 14MB (his less than 10MB). My 3 yr old decidedly unsmart £30 cellphone has 10 times that amount. In case that hasn't sunk in, let me make this clear:

    None of the millions of words from the huge file is ever stored in the hash!

    They are read from the file a line at a time, split into a list of words, looked up in the hash (NOT STORED), and then discarded.

    So will you please, please, please stop trotting out your "virtual memory is disk" missive at every inopportune moment. Read what you are responding to. (Follow your own advice: read it twice, and then once more). Think about it for a while. AND THEN SHUT THE F*** UP. Because it is getting boring trying to keep correcting you over and over and over.

    You remind me for all the world of my dear ol' Nan. Loved her to bits, but towards then end she was getting a little bit fixated.

    How are you Nan?

    During the war we had to make ends meet, so we stained our legs with used coffee grounds, and drew lines on the back of our legs with eye-liner, so we didn't have to buy stocking.

    No, no, no. Nan. Nan. Sit down Nan, you need your rest. You don't need to demonstrate. Really no. We believe you.

    So, how did you get on at the doctors Nan?

    During the war we ....


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The only good answer to this posting is a humble "I stand corrected". :)

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re^3: Advice for optimizing lookup speed in gigantic hashes
by tobek (Novice) on Aug 23, 2011 at 19:30 UTC

    BrowserUk appears to have this covered on technical grounds, but I also wanted to note: I'm not trying to re-invent the spell checker. If counting typos was all I wanted to do then your idea would certainly get the job done. But, as my post says, I want to spell check individual words during a lot of other processing.

    If you must know, I have a million books digitally scanned by the Open Library, and a lot of these books are really old and the character recognition isn't good enough to make the result even slightly useful. I'm running a boat load of stuff on this big corpus (one time preprocessing as well as, later, as-needed lookups) so I wanted to chuck out entirely those books that are so garbled as to be meaningless, to save time and space. So, while I process them I spell check them, and if the number of spelling errors per line reaches a certain threshold after a certain number of lines, I discard the book.

    I had forgotten aspell, though, so I did give it a go. However, unless I'm missing something, I need to do a bit of parsing in order to understand the response aspell gives. In addition, I'll have start a separate instance of aspell for every sentence, and there are literally billions of sentences, so that would be some overhead.

    If you do have any ideas, however, about how I could extract the abovementioned process into something in which I could run aspell over a single file to get my results, I'd be interested to hear it.