seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I'm writing data structures to contaim atomic co-ordinates from a PDB file, and am trying to keep an idea of structure whilst at the same time making the data very accessible for any LEVEL of structure, hence this:

foreach my $ref (keys %$data){ if($data->{$ref}->type eq 'ATOM'){ my $atom = $data->{$ref}->atom; $self->{'atoms'}{$ref}=$atom; $self->{'residues'}{$atom->chainId.$atom->resNumber}{$ref}=1; $self->{'chain'}{$atom->chainId}{$atom->chainId.$atom->resNumb +er}{$ref}=1; } }

as you can see, the three levels are chain, residue and atom. The data itself is atomic, hence all $refs are to the atomic data. I added the residue and chain data structures, so i could step through the chains, or the residues as so desired. Im not copying the data, just the references.

my question is, can anyone see a better way of doing this?

Thanks
Sam Seaver

20030923 Edit by jeffa: Changed title from 'Data structures '

Replies are listed 'Best First'.
Re: Optimizing PDB data structures
by kvale (Monsignor) on Sep 22, 2003 at 21:47 UTC
    You dont tell us what $ref is, so I'll assume that is is some primary key in a flat database. From the PDB format spec , each atom has one chainID and one resSeq, which I guess you are calling resNumber.

    So I'd rejig the data strucure as

    foreach my $ref (keys %$data){ if($data->{$ref}->type eq 'ATOM'){ my $atom = $data->{$ref}->atom; $self->{$ref}{atoms} = $atom; $self->{$ref}{'residues'} = $atom->resNumber; $self->{$ref}{'chain'} = $atom->chainId; } }
    Using this, one can still step through chainIDs and resNumbers by extracting all $ref keys.

    -Mark

      Mark

      Thanks for your reply, I apologise for not explaining further. $ref is simply the atom number in the PDB file, which is a unique number for every atom.

      Hence, your solution would work just as well, as every line has in itself, the chain id and residue details

      However, PDB files can be large, for example, the one Im dealing with right now has 30 models, of two chains with thousands of atoms, so thats 50k $ref to step through, and in many case, Im just stepping through the models, chains, or residues.

      Being able to choose one chain directly would half the number of atoms to step through.

      Any more suggestions?

      Thanks
      Sam Seaver

        Ah, I see your goal now. I have two answers to your problem.

        The first is to simply ignore this possible speed optimization. If you are picking one of two chains, use

        foreach my $ref (keys %$self){ next unless $self->{$ref}{'chain'} = 1; # process chain 1 atoms }
        The cost of looping and one nested dereference is probably negligible compared with the other processing you need to do, so don't waste your time on it until you have verified that this is a bottleneck and that the slowdown matters to you.

        If the bottleneck is a real problem, you will have to promote the variables you will subset on and create a more ugly data structure:

        foreach my $ref (keys %$data){ next unless $data->{$ref}->type eq 'ATOM'; my $atom = $data->{$ref}->atom; $self->{$atom->chainId}{$ref}{atoms} = $atom; $self->{$atom->chainId}{$ref}{'residues'} = $atom->resNumber; } # ... foreach my $ref (keys %{$self->{1}}) { # process chain 1 atoms }
        With an extra dereference per atom, I am not convinced that this will be noticably faster.

        -Mark

Re: Optimizing PDB data structures
by Anonymous Monk on Sep 23, 2003 at 20:58 UTC

    Incidentally, as best as I can see (you could double-check, though), BioPerl doesn't have a PDB parser or object model. I would strongly encourage you to contribute your work back, if at all possible.

      I did have a look at BioPerl in the hope fo finding a PDB parser.

      I haven't taken the time yet to investigate their requirements. As it happens, following a bit of Mark's advice, I've created data structures very specific to my project needs, and also have yet to test extensively against the variety of typos within the PDB.

      As it is, since I hope to run the parser on every file in the PDB, (creating my own database), I should be able to give BioPerl something very reliable, but not yet.

      Cheers
      Sam