in reply to Speeding up data lookups

This really is interesting. What you are trying to do is get a tenfold speed increase by changing the way you handle these data. It is clear that unless your present programs are extremely inefficient, you will not solve this assignment by making some small changes.

As has already been pointed out before (++ to all my predecessors) a complete rethinking is necessary.

A databased approach seems promising, but will really only come into its own when you can amortize the cost of setting up the data into the database over many queries, otherwise the loading of the data becomes the bottleneck and you are worse of than before.

The fact that until now nobody came up with a winning solution is perhaps because we are all groping around in the dark. Other than that you have huge data-files and smaller holdings-files, we know not much of the actual tasks you are being asked ot do. E.g. why are smaller holding files using binary search and larger holding files using an iterative approach? How small is smaller and how large is larger as far as holding files go? What process is making the "shells"? Can this process be changed to load a database or provide a separate index file with the security identifiers directly mapping to the data in the shell? Can the security identifiers be hashed so you have an even faster search mechanism than binary search? B-trees perhaps?, ... A lot of these problems have already been solved by databases of course, but perhaps you do not need the overhead of a real database engine (if you are only searching you do not need the code to update, delete, index, ...) and can extract only the necessary knowledge to do the minimum you need.

Another question: how much do the shells differ from one another each next run? Id. for the holdings file. Can you somehow only deal with the differences and calculate some delta between the previous value and the new value? (I was inspired by some video codecs that only store the differences between frames in order to save on storage and processor resources). You will have to do some "resyncing" from time to time to see that you are still on track, but perhaps that can be done in a quiet moment when you can spend a few hours to run a full update?

So many questions, so few answers.

If you can be a bit more concrete, I'm sure our collective mind will give you some valuable pointers.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Replies are listed 'Best First'.
Re^2: Speeding up data lookups
by suaveant (Parson) on Sep 19, 2005 at 21:16 UTC
    Sorry.. again, as I said... I wrote that part poorly... the processing is happening over a 2 hour time, but isn't actually taking two hours... what came in in bits and pieces is now all going to be coming in at once... a conclusive test has not been made yet, but I would guess the speedup needed is more like 2-3 times.

    The binary versus iterative is based on a comparison of the holdings size versus the shell size... it uses a calculation based on log and record numbers to work out the costs of each and compare.

    The shells will not change once the reports start, the reports wait for the data delivery and update, then run. Unfortunately every single report is customized to the user... if they were all the same things would be much, much easier.

                    - Ant
                    - Some of my best work - (1 2 3)

      Are you allowed to show a little glimpse of what a shell and a holding file look like?

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        Well... holding certainly... I will put up a shell line with altered data... since it is financial data who knows what exactly I am allowed to show or not...

        Both fixed width, both sorted on security identifier (in ths case 31283GZY). These are the only two files used in a run.. everything is crammed into the shell as needed ahead of time.

        Holding sample:

        CUSIP 31283GZY HB31283GZY 200409162300 BDABS ABS US +D 16:37:05 S1

        Matching shell line:

        31283GZY6 FHLMC MV DATA COMB 56 324-s234 5.670 +20290201 20170301 9.000000 19830201 DFG + + C V1 + + + + + + + + + + + + + + + + + 20050231M 11 2002091100100010553400100000210435 +5200130000109986500000000026940000000000056960000000000004910000 + + + + + 20071121M 20080321000 +010102900099000000104331349000000104562599000000005014126000000005002 +242000000004190337

                        - Ant
                        - Some of my best work - (1 2 3)