Thanks for all of your help!

I tried all of this out, initially testing it on 100 files of about 1MB each - inefficient I know, but it's the format my data originally came in (each file is a book). As you forsaw, threading for zillions of small files wasn't great, and your script actually increased my running time from 11 seconds to 17 seconds. However, when I took out threading but left the other stuff, it cut my running time in half to 5.5 seconds! Wonderful! I had grown accustomed to declaring variables separately more often than creating them on the fly, because I find it easier to read and maintain my code that way, and I never quite noticed how inefficient (at least in Perl) that is.

When I combined my input files from 100 1MB files to 4 25MB files, running the improved script without threading was reduced to 4 seconds, and putting threading back in only slowed it down a little: to 4.7 seconds - so I guess opening up the file takes a little longer that spell checking 25MB - or there is overhead for threading, or both. When I process the whole corpus I was planning on combining everything into about 1000 1GB files, and then I presume that the threading would only make things quicker. (Though it leaves me with a lingering, generic question that I was considering asking on Stack Overflow or somewhere cause it's not specifically Perl related: why not just have a single 1TB file if I'm only ever processing the whole thing? Not that a 1000 file opens takes up a large portion of an operation that takes several days, assuming that the time to open a filehandle is constant relative to the file size.)

Anyway, thanks so much for your help - you cut my execution time in half and pointed me in the right direction for future savings. Cheers!


In reply to Re^6: Advice for optimizing lookup speed in gigantic hashes by tobek
in thread Advice for optimizing lookup speed in gigantic hashes by tobek

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.