The mystery 300MB file is a trimmed, translated, and sorted EDI dataset. Although it would make much more sense to divide the file up by daily transactions, it is just a little to risky for me - as the file is appended on by several different other scripts. The file (which I lovingly refer to as "The Mother Log") is not only written to by several other Perl scripts, but it is also accessed by several cgi scripts (not all Perl), and queried by Access (lol). I would prefer to load the file into memory once a day, and then when asked for - it grabs the info out of Ram, instead of thrashing about the HD.
UPDATE: I've decided for now that I'll rotate the log more often in order to keep it below 150MB, and setup a RamDisk that will mirror the backup file. I think this setup is best for now for several reasons:
1. The log isn't meant to replace existing datasets, but it is meant to remove duplicate transactions in order to keep the file size down (one day's worth of EDI data for my company is around 700MB, the MotherLog will be about 150MB - and will contain a month's worth of transactions).
2. I don't want to rearrange the data too much because the log has in the past been used to extract charge backs from suppliers breaking their contracts. I'd rather the file remain extremely simple so that errors can be recognized immediately.
3. I'm not getting paid enough to reprogram this thing again :)
Thanks everyone!
| [reply] |
Given the disparate processes needing access, and that you need read and write access. If you have the memory to maintain 300MB in RAM, then a ramdisc is by far your easiest and sanest option.
Ram discs have minimal overhead, and the only change required by the applications would be for the paths to change. Even that could be eliminated if you can use a symbolic link. The only downside is the risk of data loss in the event that the server crashes.
There are several ways that you could approach mitigating the data loss, some simplistic, other quite involved depending on the level of reliablility to you need.
An interesting thought, but I emphasis its nothing more than a thought, would be to create a partition/filesystem on disc the same size as the ram disc, and use mirroring to reflect the ramdisc onto the filesystem. Whether there is any mileage in the idea depends on which os, mirroring software etc and whether the latter can be configured to use a ram disk.
Attempting to use any form of caching is likely to be a problem as caching only really benefits you if the same sections of the data are being repeatedly accessed. Given all the different processes that would be vying for cache space in your scenario, your likely to slow things down with cache thrash rather than speed them up.
Examine what is said, not who speaks.
| [reply] |
I would still think that a DB would be prime for this. Access is not too shabby at getting around DB's, and if nothing else, an export process should be able to keep a copy in Access format.
The problem you should be having is not so much disk read time (Although it should be ridiculous) but the actual retrieval of data from a 300 MB variable. I cannot understand why it would be risky to split the transactions. If they are transactions, the should be sequential and without dependency. (Don't you at least have to split the variable within your script??) I would think that even splitting transactions by some other method, like by date, would have some benefit.
If nothing else, sorting your current records and putting them in a DB would help. Then just update the other scripts to write to a DB not a file. If you still have one or two scripts that need the flat-file, do an export on transaction. Less if you need less. Even halfing the disk reads would have to help. A 300 MB dataset is roughly equivalent to 600 copies of Gulliver's travels. I personally panic when datasets get in the 30-40 MB range.
While caching might work, it is not a long term solution. Heck, just maintaining a 300 MB file in ram without swapping in Win2k means you need > than 1 GIG of RAM. If that Dataset grows too large.... Also, if you are noticing alot of HDD thrashing, it is most likely because of Swapping. For a std file read, the process is quick, and one-time. The problem is that as you load that 300 MB dataset into memory, many things need to be swapped out. Windows has a very aggressive swapping system and will swap well before memory is full. Also, you start to contend with growing and shrinking the swap if it is dynamic in size.
If the machine running this has less than 1 GIG of RAM, look at your swapping performance. Another test is to Manually set Virtual Memory to twice RAM and wait to see if the machine carps about out of memory. If so, you need more memory, more swap is a losing proposition.
~Hammy
| [reply] |