Re: Moving from hashing to tie-ing.
by Fletch (Bishop) on Jul 31, 2006 at 15:21 UTC
|
Given the structure of what you've given, consider either stepping up to a real DBMS (Postgres, MySQL) or take a half step like SQLite. Then just use DBI.
| [reply] |
Re: Moving from hashing to tie-ing.
by ptum (Priest) on Jul 31, 2006 at 15:24 UTC
|
I've found that using Tie::File for large files (like yours) has not given me the performance I was seeking. Is it possible for you to divide your process into two steps, one to parse the file and store the relevant components into a relational database, and the second to process the various components according to your needs, using the power of the relational database? I've had good success with Oracle and MySQL ...
Update: Yeah, like Fletch said. Curse those johnny-on-the-spot monks! :)
No good deed goes unpunished. -- (attributed to) Oscar Wilde
| [reply] |
Re: Moving from hashing to tie-ing.
by CountZero (Bishop) on Jul 31, 2006 at 16:17 UTC
|
A 2.5 Gigabyte dataset usually cannot be kept in memory at once (unless you have an extra-ordinary amount of memory in your computer) , so whether you use a single hash or the Hash-of-Hashes you suggested, your OS is going to swap data to hard-disk all the time, severally hurting performance.If you don't want to go the database-way yet, tie-ing is perhaps not so bad as it is really using a simple database behind the scenes. Much will depend on how large your key-space needs to be. If you cannot keep the keys in memory at once, performance will still suffer. How many records do you have in the file and how long is each pin?
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] |
Re: Moving from hashing to tie-ing.
by Ieronim (Friar) on Jul 31, 2006 at 15:54 UTC
|
| [reply] [d/l] |
Re: Moving from hashing to tie-ing.
by eff_i_g (Curate) on Jul 31, 2006 at 15:45 UTC
|
Honestly, that is what should be done--moving their data into a database. In fact, the file we receive is actually an extract from their database. At the moment I am supposed to understand the project and try to "improve" it before really improving it. The whole thing needs to be rewritten in my opinion, yet my time is limited.
Until I am able to spend time on this, it seems like the old method should stay in place?
Thanks again. | [reply] |
|
|
eff_i_g,
Presumably, your problem is one of performance though you never come out and say it. You haven't told us enough about how this datastructure works within the program to provide any ideas beyond just changing your data structure.
Here are some things to think about. Perhaps instead of changing the pipes into hashes (HoH), you could use arrays instead (HoA) as they take up less space. You don't indicate if having the entire data structure in memory at once is even necessary. One possibility would be to load only the portion of the structure necessary to do any unit of work at a time. While this will add I/O, it should allow you to trade memory for time since your memory requirements will now be manageable.
I have lots of other ideas but they are all what-ifs until you share more about how your program works and how the datastructure works within that program.
| [reply] |
|
|
Limbic,
The files we receive are fixed length and many of the fields are not even used. A file may contain a dozen fields, but we may only need 2 or 3. Since these files are extracted from their database there is a lot of id matching to be done, which is mainly what the hashes do, such as $name_hash{'123'} = { first => 'John', last => 'Doe' };. Once all of the supporting files are hashed in this manner, a main script uses them as lookup tables. Therefore, every time it sees a record that has '123' in a certain field, it knows to use "John Doe" during the processing.
Having all of the data in memory is not necessary; however, I do not know how this could be done without using a database. Each record, from a few dozen to a few thousand, needs to use these supporting hashes.
Let me know if you need more information, I appreciate your help.
| [reply] [d/l] |
|
|
|
|
|
Re: Moving from hashing to tie-ing.
by kwaping (Priest) on Jul 31, 2006 at 17:10 UTC
|
| [reply] |
Re: Moving from hashing to tie-ing.
by BrowserUk (Patriarch) on Jul 31, 2006 at 16:00 UTC
|
How many fields are there per record? How long is each record?
Update: Also, what is $pin? 4-digit numeric? Alpha? Is there a defined range for $pin?
Also, is this a single-user-at-a-time or multi-user application?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
Browser,
The fields range from about 3 to 100. There are many types of pins; these are no more than a dozen characters. Many users can run the script, which works in this fashion:
- Determine the section and id being processed.
- Hash all of the needed customer files for this section.
- Process.
If user A and user B are running the same section, but different ids, there are two instances of the script which hash all of the same information into memory. The processing is only slightly different due to the ids.
| [reply] |
|
|
| [reply] [d/l] |
|
|
|
|
Re: Moving from hashing to tie-ing.
by rodion (Chaplain) on Jul 31, 2006 at 16:18 UTC
|
I assume the issue is interactive lookup access time.
Reloading the DB dump you receive into a local DB, as Flech suggests, is the best option, but if you can't afford to set up a database server, then your only other choice, is to use something like Search::Binary, and implement it's read() function using seek. It wouldn't be as fast in searching, but it might be acceptable, and way better than linear search which would take a long time on a file that size, even with a very big disk cache. Other than DB or binary search, trying to squeeze it into memory is your only only option, and storing it as strings with separators and/or with some compression is the thing to try.
On the other hand, if you have a batch process, with a list of keys (pins), you can do an fgrep against the full file with the list of pins, if it isn't too big, and then pipe that into a hash.
| [reply] |
Re: Moving from hashing to tie-ing.
by Ieronim (Friar) on Jul 31, 2006 at 17:27 UTC
|
A idea nobody has yet suggested—maybe because it's completely wrong.
If your big file is static and contains 100 different fields referring to pin as a key, you can replace it with 100 DBM files, every file containing only pin-field pairs.
Then you can create a big hash containing references to DBM-tied hashes, and get a value for a certain field like $name = $fields{name}{$pin}. As you know which fields you need, you can load only corresponding DBM files.
s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
| [reply] [d/l] [select] |
Re: Moving from hashing to tie-ing.
by srdst13 (Pilgrim) on Jul 31, 2006 at 19:49 UTC
|
If I understand correctly, you have a single key ($pin) that has an associated hashref as the value. Could you use BerkeleyDB and Storable? This should be pretty fast, is a substitute for in-memory processing, but doesn't then require writing SQL/DBI code and your code could remain hash/tie based. Sean
| [reply] |
Re: Moving from hashing to tie-ing.
by GrandFather (Saint) on Aug 01, 2006 at 03:50 UTC
|
It's still not clear to me what the structure of your data file (or maybe files - even that's not clear) is. But it seems that the main issue is that you have data associated with keys (pins) and that a lookup needs to be done using the keys. Can you parse your data file(s) to find the pins and generate a lookup hash keyed by pin and storing an index (and possibly file name) to the data associated with the pin? You can use Storable to write out a copy of the hash which can then be used as required.
If that is way off the mark, please describe the actual structure of the data and how it needs to be accessed. So far this is a very XY Problemish thread!
DWIM is Perl's answer to Gödel
| [reply] |
Re: Moving from hashing to tie-ing.
by eff_i_g (Curate) on Aug 02, 2006 at 04:53 UTC
|
Everyone,
Apparently I'm thick-skinned this week. I'll give it a try starting from the top.
- The entire project is already written (poorly) in Perl and it consists upwards of 30,000 lines. I have inherited it and am reworking it the best I can.
- We are provided a number of files (database dumps) from the customer. These are alike in that they all have fixed-width fields; however, the number of fields and sizes of these files differ. The files arive in ASCII and the records are separated by new lines. We have documentation on the layouts of each file, so we know how to get certain fields; this is currently being done with substr. For example, substr($record, 0, 10) would give you the ID.
- Multiple (hundreds of) types of output can be generated from these files based on certain field criteria; e.g., pulling ID=123 and TYPE=this is going to give you a different data set that pulling ID=456 and TYPE=that.
- The challenge is the source doesn't have everything that I need. I need to print a name, but I only have the ID in the source. The ID is in another file. If this was SQL I would be using JOIN, but it's not: it's a group of oftentimes huge files, like the aforementioned 2.5 gigger.
A very small example:
a_supporting_data_file:
123456789John Doe LotsOfOtherFields
987654321Bob SmithLotsOfOtherFields
...hundreds of thousands of more records...
source_file:
LotsOfFields123456789LotsOfFields
LotsOfFields987654321LotsOfFields
...tens of thousands of more records...
When I'm while-ing through the source file and I substr out the ID of 123456789 (I know this is the ID due to the documentation that says field at position X with a length of Y is the such-and-such), I need to have the data "John Doe" available to me. When I substr out the ID of 987654321, I need to have the data "Bob Smith" available to me.
What if user A and user B are running separate instances of the program and they both need to use a_supporting_data_file? It is re-parsed and re-hashed.
I hope things are clearer. Thanks for your help thus far.
A lot of your answers have been helpful. I just have to see how they would fit into the existing code, and the existing time frame. As I said before, I think it needs to be rewritten, but here's the situation: This project is no longer the responsibility of the old programmer, therefore I'm in the seat of "You need to understand as much as possible as soon as possible." I've already spent a great deal of time making structures, routines, modules, documentation (and adding "use warnings;" and "use strict;") that my timeline for improvements is slowly decreasing. I guess you could say that I'm looking for a half-way solution. I can always leave the code as is because it does work, it's just hideously inefficient, and I'd like to learn new methods of handling this.
| [reply] [d/l] |
|
|
| [reply] |
|
|
I'm in agreement with your approach kwaping, thank you. Well, at least I learned a little bit about tie through this whole debacle: don't use it on huge files; and I started reading the chapter.
When and if the rewrite takes place, I think putting the data back into a database is the way to go. Some advanced SQL statements would eliminate a great deal of the Perl coding. I'll see.
Thanks for your time.
| [reply] |
|
|
eff_i_g,
Ok, this still doesn't answer my questions but I do have what I believe to be a half-decent suggestion for you. You do not indicate how often you are provided these dumps from the customer or how many "runs" are done on the data in between new dump files. Assuming the dumps arrive no more then once a day and that the number of "runs" in between new dumps is more than a few - the following methodology should improve the efficiency of the existing code with only minor modifications:
First, create a pre-process script that parses the huge source file and supporting data file one time. Its job is to index the position of each ID in the file. This information should be stored in a database (DBD::SQLite or some such) or in a serialized datastructure (Storable or some such). What this buys you is the ability to, given an ID - open the 2 files and quickly read in just the record associated with that ID. No searching required and no parsing of non-related IDs necessary.
Second, make a minor modification to the current script that uses the pre-processed index to pull in just the record(s) associated with that ID. Now you can create as complex a datastructure as makes sense and need not constantly re-split.
This ultimately is not what I would like to suggest but given the lack of details it is the best I can offer.
| [reply] |
|
|
What is the objection against using a database? You told us the files are database table dumps.
Maybe you can get the sql commands to recreate the database locally, then just load up the files and use the recommended cpan database stack (e.g. DBI, DBD::mysql, SQL::Translator, DBIx::Class).
What sort of time frame are we looking at?
| [reply] |
Re: Moving from hashing to tie-ing.
by Moron (Curate) on Aug 02, 2006 at 13:12 UTC
|
I can't reproduce this sensibly on my machine (32G of memory being too much to stress-test). But my idea is this: You could split the pins into large groups e.g. based on the first four digits and put each group in a Storable. Then maintain a hash of objects that references all the storables - this hash will only have 10000 keys max. This might be enough preparation for the machine's own VM architecture to resolve the problem for you.
If not, enhance this with a fixed-length league table (of length 1000 * memory-limit/2.5G: 1000 instead of 10000 to allow for uneven grouping of pins) in an array of array of recently-accessed storables and their hit counts and do your own virtual memory management of the 10000 or so stored hashes on that basis (can refine this by including timestamps i.e. hit ageing factor) - it's what a DBMS would do for you anyway, but such a targeted solution will perform faster than a general-solution DBMS.
| [reply] |
Re: Moving from hashing to tie-ing.
by Anonymous Monk on Aug 04, 2006 at 17:59 UTC
|
FWIW, I think the following approach will help you, while preserving the existing code base (mostly anyway). As I understand, the main bottleneck of the problem is to lookup a record based on its key, in our case it is the customer pin, and do it efficiently.
NOTE The proposed approach depends on the data stability, i.e. it won't work if dataset is modified in the middle of the processing.
First, add a preprocessing step when the user data arrives and build simple index file 'pin' => 'record-offset' doing a linear file scan. You can use any available DBM storages. Even if your dataset is huge, resulting index file should be small enough to allow quick processing (further improvements are possible).
Second, modify your original scripts to use this index file to quickly lookup records in the original dataset based on the file offset value in the index, i.e. simple seek operation.
Hope this helps.
| [reply] [d/l] |