Re: Fast seeking in a large array of hashes to generate a report.
by Zaxo (Archbishop) on Jun 23, 2005 at 07:29 UTC
|
Your splice is one obvious inefficiency. You could push onto another array or grep. The resulting array of indexes could be used in a slice of @Persons.
Where is @Persons populated from? If a file, you could do your filtering as you read it. If a database, you can make it do all the work.
To make executable filters from form data, be cautious. You'll be tempted to eval, but that is inviting trouble. Try setting up a dispatch table (a hash of code references) of functions to call via the field names, with field name and value as arguments. Where you have several fields, grep is again your friend.
| [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by demerphq (Chancellor) on Jun 23, 2005 at 08:26 UTC
|
One thing to do is to construct a subroutine to do the processing logic. Your XML input file presumably isnt changing for the duration of the run, so theres no need to do a loop of the possibilities for each input record. Instead your loop over the xml file should happen once, manufacturing perl code that does the condition logic. This is then wrapped up in a subroutine and eval()ed into existance. The resulting sub will be considerably faster than the doing all of that processing over and over.
IMO you are encountering a common problem for perl programmers, using low level coding tactics inside of a scripting language. Consider that you are in essence writing an interpreter, and you arent doing it particularly efficiently. Not only that you are writing an interpreter in a language that is itself interpreted (although at an opcode level). So in order to bypass the problem of an interpreter running in an interpreter what you should be doing is exploiting the perl compiler/interpreter to do this for you.
IOW what you want to be writing a translator, that converts from your mini-language to perl. Then let perl rip. Also since the code you write will be writing code some of the normal style and good practice rules dont apply in the generated code, which allows you to do things like strip out all of the unnecessary code and do things very efficiently. For instance by unrolling the field processing loop you might end up with hundreds of statements almost the same but not quite, the sort of thing that one would advise against, but since its generated code who care? The code doing the generation doesnt look like that so its not really a problem.
Ive seen approaches like this cut per record processing times massively. Make the processing loop bare, unroll every loop and method call and subroutine inside of it that you can. IOW treat subroutines as macros and do the looping in the code generation if you possibly can. Reuse lexical vars when you know you can, precompute as much as you can outside of the subroutine and bind it in via perls closure mechanism. Even things like using hashes as records can be sacrificed. Instead in the generated code use fetchrow_array and put the results into lexicals and then where your logic normally would have been $rec->{field} replace it with $field to avoid the unnecessary hash lookup (ie, where you can preresolve a hash lookup at compile time do so). All of this stuff can make a really serious difference to how fast your code runs.
Update: A note about using a DB, while if you can possibly do your filtering in a DB then you should. But its worthwhile learning techniques for when using the DB is not possible or practical. For instance if the filtering rules were complex then it might mean a fetch per rule, which on a large table or one without relevent indexes could be quite problematic. Likewise the data source could be flat file, or some other medium where queries against prestablished indexes werent possible.
---
$world=~s/war/peace/g
| [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by Tanalis (Curate) on Jun 23, 2005 at 07:20 UTC
|
If you're populating the dataset from a database, then restricting that dataset from a config file, why not simply generate some SQL to restrict the dataset you're downloading from the DB in the first place?
The database can almost certainly perform this search-and-restrict quicker server-side than you can by downloading all data and then restricting what you display.
Edited for clarity. | [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by barrachois (Pilgrim) on Jun 23, 2005 at 07:37 UTC
|
First, you might consider using a relational
database (SQLite, MySQL, ...) - that's what
they're for, eh?
Second, if you really want to code it yourself,
and depending on how large the menu of possible
selections is, you could generate lookup hashes
ahead of time. If, for example, you have 200 possible
surnames and 50 possible ages, then you could generate
a hash with keys 'surname-age' (e.g. 'smith-25') whose
values are something like a reference to a list of indices into @Persons (e.g. [ 10, 102, 145, ... ] ). Seems like a classic "space-vs-speed" trade off.
An intermediate approach might be to somehow cache the
results of each filter loop,
so that the slow part only happens once.
| [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by rev_1318 (Chaplain) on Jun 23, 2005 at 07:23 UTC
|
Why would you store all your records in a hash and then drop all you don't need? Wouldn't it be faster to query the database based on the match-requirements? That saves you the trouble of walking through the array.
If the database queries aren't an option, to could look for a unique key and transform the array into a hash, based on that key. That can save you a lot of time.
HTH
| [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by robartes (Priest) on Jun 23, 2005 at 08:49 UTC
|
Perhaps you could change the way you build your Dataobject. Instead of one array of hashes, you could build four hashes of (arrays of) hashes, each with a different field as the key. So, you could for example build a HoAoH with surname as key:
%Persons_by_surname = ( 'Schwartz' => [ { ..person 1 },
{ ..person 2},
],
'Wall' => [ {..person 1..} ] );
And the same for the other fields (except perhaps for age, where there will be a limited number of discrete values). Instead of looping over an array, this changes your algorithm to a hash lookup, which should be faster.
To facilitate your later lookup phase, you could put these data hashes in a wrapper hash like this:
my %data= ( 'surname' => \%Persons_by_surname,
'firstname' => \%Persons_by_firstname,
# ... and so on
);
Doing this, you'll have an initial hit on performance when building the extra hashes, but you should have faster lookups in the end. You'll have to benchmark your scripts to see whether this is worth it.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
| [reply] |
Re: Fast seeking in a large array of hashes to generate a report.
by themage (Friar) on Jun 23, 2005 at 11:34 UTC
|
Hi ppl,
Hi jbrugger,
I created a small script that generates an array with 1.000.000 records as listed in your example as follow:
use strict;
use Time::HiRes qw(time);
my @data=();
my @names=qw(albert arthur bernard bill charlie david jack jonh joseph
+ mark michael peter steven);
my @surnames=qw(bell brown canfield cornwell devlin doyle golden hoffm
+an maclean powell rowling tovard twain warlick);
my @places=qw(amsterdan athens belfast berlin bern brussels copenhagen
+ helsinki lisbon london luxenbourg madrid oslo paris rome stockholm v
+aduz vienna);
for (my $i=0;$i<=1000000;$i++) {
my $name=$names[int(rand()*$#names)];
my $surname=$surnames[int(rand()*$#surnames)];
my $place=$places[int(rand()*$#places)];
my $age=int(rand()*50)+25;
my $rec={name=>$name,surname=>$surname,place=>$place,age=>$age
+};
push @data, $rec;
}
Copied @data to @Persons and @Person2, and used the @Filters arrays as follow:
my @Fields=(
{ key => 'age', content=>35 },
{ key => 'place', content=>'london'},
);
my @Persons=@data;
my @Person2=@data;
print "Persons: $#Persons\nPerson2: $#Person2\n\n";
my $field;
my $person;
my $i=0;
Timed your implementation with:
my $ta=time();
foreach $field (@Fields) { # <- age and surname in this case
scalar(@Persons)==0 && last;
for ($i=0;$i<scalar(@Persons);$i++) {
($Persons[$i]->{$field->{key}} eq $field->{content}) && next;
splice(@Persons,$i,1);
$i--;
}
}
my $tb=time();
And then made a small change in your filtering algorith, so the main interaction is against the data, not against the filters as follow:
my $use=1;
my @Rset=();
foreach $person (@Person2) {
$use=1;
foreach $field (@Fields) {
if ($person->{$field->{key}} ne $field->{content}) {
$use=0;
last;
}
}
if ($use) {
push @Rset, $person;
}
}
my $tc=time;
And then printed some stats:
my $iab=$tb-$ta;
my $ibc=$tc-$tb;
print "TIMES:\nab: $iab\nbc: $ibc\n";
print "COUNTS:\nab: ", $#Persons,"\nbc: ",$#Rset,"\n";
I was expecting some diference (not know better or worst, was just a test), and got some. Enexpected, I admit.
In my P4 3.2Ghz, 1GB memory this are some results:
mpneves@voyager perl$ ./test.pl
Persons: 1000000
Person2: 1000000
TIMES:
ab: 17.5640239715576
bc: 1.96199607849121
COUNTS:
ab: 1153
bc: 1153
mpneves@voyager perl$ ./test.pl
Persons: 1000000
Person2: 1000000
TIMES:
ab: 18.5665609836578
bc: 1.95595502853394
COUNTS:
ab: 1185
bc: 1185
mpneves@voyager perl$ ./test.pl
Persons: 1000000
Person2: 1000000
TIMES:
ab: 24.4221110343933
bc: 3.58428502082825
COUNTS:
ab: 1213
bc: 1213
Which means that my version is performing better at least 1 to 3 than yours.
Obviously, this is a test in my notebok, with random data. with your computers, with real data and filters it can performe diferently.
All code I used is here. If you join all code segments as they are presented you will have my complete code. | [reply] [d/l] [select] |
Re: Fast seeking in a large array of hashes to generate a report.
by srdst13 (Pilgrim) on Jun 23, 2005 at 10:43 UTC
|
It seems to me that what is missing here is a comparison of the relative speed of using direct database queries versus using the "hash-in-memory" method. Since you seem most worried about speed, it seems reasonable to just code up the database query version and see if it is faster or not. I'm guessing that it will be, presuming you have indices on all your lookup columns and that you use appropriate DBI calls (sticking to arrayrefs, fetchall, etc.). But there really isn't any way to know unless you try, is there? Perl is a wonderful tool, but modern relational databases are also. Using both to their fullest is the goal. Also, note that you may not have to fetch all records in every case. If you have queries that only generate summary statistics, the DB can usually do those quickly as well, so you never have to retrieve those records directly at all.
Sean
| [reply] |
|
|
| [reply] |
|
|
Sorry. Despite your explanation above, I didn't really grasp the complexity of the problem. Thanks for clarifying....
Sean
| [reply] |