Optimizing PDB data structures

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I'm writing data structures to contaim atomic co-ordinates from a PDB file, and am trying to keep an idea of structure whilst at the same time making the data very accessible for any LEVEL of structure, hence this:

foreach my $ref (keys %$data){
    if($data->{$ref}->type eq 'ATOM'){
        my $atom = $data->{$ref}->atom;
        $self->{'atoms'}{$ref}=$atom;
        $self->{'residues'}{$atom->chainId.$atom->resNumber}{$ref}=1;
        $self->{'chain'}{$atom->chainId}{$atom->chainId.$atom->resNumb
+er}{$ref}=1;
    }
    }
[download]

as you can see, the three levels are chain, residue and atom. The data itself is atomic, hence all $refs are to the atomic data. I added the residue and chain data structures, so i could step through the chains, or the residues as so desired. Im not copying the data, just the references.

my question is, can anyone see a better way of doing this?

Thanks
Sam Seaver

20030923 Edit by jeffa: Changed title from 'Data structures '

Comment on Optimizing PDB data structures Download Code

Replies are listed 'Best First'.
Re: Optimizing PDB data structures by kvale (Monsignor) on Sep 22, 2003 at 21:47 UTC
You dont tell us what $ref is, so I'll assume that is is some primary key in a flat database. From the PDB format spec , each atom has one chainID and one resSeq, which I guess you are calling resNumber. So I'd rejig the data strucure as `foreach my $ref (keys %$data){ if($data->{$ref}->type eq 'ATOM'){ my $atom = $data->{$ref}->atom; $self->{$ref}{atoms} = $atom; $self->{$ref}{'residues'} = $atom->resNumber; $self->{$ref}{'chain'} = $atom->chainId; } }` [download] Using this, one can still step through chainIDs and resNumbers by extracting all $ref keys. -Mark	[reply] [d/l]
Re: Re: Optimizing PDB data structures by seaver (Pilgrim) on Sep 23, 2003 at 13:15 UTC
Mark Thanks for your reply, I apologise for not explaining further. $ref is simply the atom number in the PDB file, which is a unique number for every atom. Hence, your solution would work just as well, as every line has in itself, the chain id and residue details However, PDB files can be large, for example, the one Im dealing with right now has 30 models, of two chains with thousands of atoms, so thats 50k $ref to step through, and in many case, Im just stepping through the models, chains, or residues. Being able to choose one chain directly would half the number of atoms to step through. Any more suggestions? Thanks Sam Seaver	[reply]
Re: Re: Re: Optimizing PDB data structures by kvale (Monsignor) on Sep 23, 2003 at 17:17 UTC
Ah, I see your goal now. I have two answers to your problem. The first is to simply ignore this possible speed optimization. If you are picking one of two chains, use `foreach my $ref (keys %$self){ next unless $self->{$ref}{'chain'} = 1; # process chain 1 atoms }` [download] The cost of looping and one nested dereference is probably negligible compared with the other processing you need to do, so don't waste your time on it until you have verified that this is a bottleneck and that the slowdown matters to you. If the bottleneck is a real problem, you will have to promote the variables you will subset on and create a more ugly data structure: `foreach my $ref (keys %$data){ next unless $data->{$ref}->type eq 'ATOM'; my $atom = $data->{$ref}->atom; $self->{$atom->chainId}{$ref}{atoms} = $atom; $self->{$atom->chainId}{$ref}{'residues'} = $atom->resNumber; } # ... foreach my $ref (keys %{$self->{1}}) { # process chain 1 atoms }` [download] With an extra dereference per atom, I am not convinced that this will be noticably faster. -Mark	[reply] [d/l] [select]
Re: Optimizing PDB data structures by Anonymous Monk on Sep 23, 2003 at 20:58 UTC
Incidentally, as best as I can see (you could double-check, though), BioPerl doesn't have a PDB parser or object model. I would strongly encourage you to contribute your work back, if at all possible.	[reply]
Re: Re: Optimizing PDB data structures by seaver (Pilgrim) on Sep 25, 2003 at 13:43 UTC
I did have a look at BioPerl in the hope fo finding a PDB parser. I haven't taken the time yet to investigate their requirements. As it happens, following a bit of Mark's advice, I've created data structures very specific to my project needs, and also have yet to test extensively against the variety of typos within the PDB. As it is, since I hope to run the parser on every file in the PDB, (creating my own database), I should be able to give BioPerl something very reliable, but not yet. Cheers Sam	[reply]