Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Pre-allocation is not your problem. Consider that in a hash with 4 entries, the entire wastage is probably no more than 16 bytes per hash. 16 bytes X 120,000 words = 187.5 Kbytes. 187.5 Kbytes is nowhere near being a significant percentage of 70 Mbytes. What you are observing is the overhead required for each hash, each string, and even, each integer.

Definately, as other people have suggested, moving the data to disk will deal with your memory problem. However, it does not deal with the size of the data itself. Even the most space-efficient *DBM_File modules are not able to compact the data on disk more than a few small percentages tighter than Perl packs the data in memory, and there will be a significant performance hit.

I spent some time coming up with an alternative that stored the word=>score hash as a string of encoded 32-bit integer pairs (id,score), however, I finally decided that this is probably not a good approach unless one knows exactly what one is doing, and the best way to show that, is to implement it. I leave it to you to decide whether research in this area is desirable. Basically, you flatten the hash of hashes into 3 hashes of strings. The idscores string could encode the pairs in a manner that could be searched using a binary search, or perhaps a linear search is adequate. Ideas, ideas...


In reply to Re: A memory efficient hash, trading off speed - does it already exist? by MarkM
in thread A memory efficient hash, trading off speed - does it already exist? by JPaul

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-19 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found