I work for a financial company, and due to some changes in the industry we have to change some processing that used to be run over two hours to run in 10-15 minutes instead, fun! I have been assigned to streamline etc and thought I would bounce some ideas off the monks.

Background: Current system uses what we call shells (just a glorified ordered file, fixed length, with honking amounts of data in it) and holdings files, which keep track of what a user wants us to process for them. The holdings file is iterated through and the shell is searched with a binary search for smaller holdings, and by iteration with larger holdings.

My first thought was maybe to database it, I am pretty good with MySQL but I am having trouble actually getting this to go faster, even using HANDLER calls, so I may abandon this approach.

My next thought is to daemonize the process. Right now each of over 1000 reports is started as its own process and handles all its own reading of the shell... I figure I could daemonize the process so that some caching could be done, and less perl procs would need to be started. I figure there are two possibilities for speeding this up... 1) key caching, 2) read the whole damn shell into memory.

Now, as to reading the whole shell into memory... we have a different shells to work with, the largest being 700M... this stuff is running on big sun boxes with 8 procs and 16GB of memory, both of which can be bumped up some. The only thing that would make this really work is if I am right in remembering that when you fork a process memory from the parent is shared by all the children until that memory is written to, is that right? If so I could read the whole shell, fork off X children and have parallel reading of the info into multiple children without completely blowing out the system memory.

Another thing... if I do read the whole thing into memory, should I still use a binary search. I am thinking that if the identifier list I am working from is also in order, there is a lot of opportunity for speeding up a binary search with some custom code. For instance if I see identifier 8000, I know that no further identifiers will be below 8000, so I can search only from there forward. I could also probably compare the two keys and guess how far forward I should set my midpoint to try to shorten my search.

But is there a better in memory method, or would simple key caching and using a file be better in the long run. Whatever I use to search can't take too too long to preload, since the shell is updated and immediately report processing must begin...

Any thoughts would be appreciated.

                - Ant
                - Some of my best work - (1 2 3)


In reply to Speeding up data lookups by suaveant

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.