Re: Re: Re: sorting according to greek alphabet in roman letters

Sorry, but you are still couching this is terms I don't really understand. They may be second nature to you, but the terms, "distances, "N-terminals", "main chains" etc. all have a meaning to me, but not one that makes any sense in this context. You should try to phrase your questions in terms people without your specialist knowledge can understand if you are to get a good response.

Every atom name breaks down into 4 parts, which I already have broken up. The function atomName, simply returns the concatenation of all 4 parts.
Column 1 & 2 are the actual atomic element(1 or 2 letters, right-justified) except in the case of hydrogens, where column 1 is actually the number of the hydrogen.
Column 3 is the distance specified by a greek letter
Column 4, mostly empty, is the number of the heavy atom at the distance.
eg Leucine has two 'CD's so they are CD1 and CD2. the hydrogens respectively are: 1HD1, 2HD1, 3HD1, 1HD2, 2HD2 and 3HD3.

What this tells me is that you have the four parts that decide the sort order already separated, but that you are concatenating these together in order to sort them. This results in a variable length enitity, where

the first 1 or two characters are a single element
....except that sometimes the first character actually is a number which acts as a multiplier.
The 3rd character is the second element,
...except that if the first element is only a single character, then this element is actually in the second character position,
unless of course the first element is a single character and is prefixed by this "multiplier" number in which case its back to the 3rd character position.
What about the possibility that the first element is two characters (CA?) and is also prefixed by a "multiplier" number?
Does this mean that the second element could also be in the 4th character position?
Can the "multiplier" number be 2 digits long? Could the second element (one of your 6 single-character "distances") actually be located in either the 2nd, 3rd or 4th character position?
The there is your 3rd element, which may or may not exist, but if it does, probably lives in the 4th character position, but you dont make it clear whether it might also be present when the first element is a single character that isn't preceeded by a "multiplier digit(s), so it maybe could be in the 3rd character position sometimes?

I realise that I have exaggerated this somewhat. It is possible to infer (or negate the possibilities) from your description, butthere are still enough left to make it extremely difficult to know where to begin in trying to help you.

Suffice to say, if the rest of your application could be bent to handling the attoms as arrays or hashes where the parts are separate, it would make the sorting easier. Never the less, I *think* that this would do the trick. It's a fairly standard ST in its basic layout, but the devil is in the details. The values used for the comparisons are numerical weights which are calculated in the first mapping.

The awkward part was n coming up with a mapping that spread the possible variations and combinations into a number space such that the result met your requirements. It uses two hash table lookups. The first %mainAtoms establishes the basic ordering in the range 1 .. 9. The second, %distances using floating point multipliers to map N[BGDEZH] to 10.0 .. 10 .5; CA[BGDEZH] to 20.0 .. 20.5 and so on upto 5H[BGDEZH] to 90.0 .. 90.5. The third, numeric element of the atomNames is divided by 100 and added, thus providing the final arbitration.

#! perl -slw
use strict;

## Lookup table to map main atoms to a numerical value

my %mainAtoms = (
    ''  => 0,    N  => 1,  CA  => 2,   C  => 3,   O  => 4,
     H  => 5,  '2H' => 6, '3H' => 7, '4H' => 8, '5H' => 9
);

## Lookup table for "distance"  multiplier
## Using '' => 1 ensures that unadorned main atom weights
## remain in the range 1 to 9.
## Using 10.n  for the distance weights
## maps the weights to 10.0 .. 10.5 for N,
##                     20.0 .. 20.5 for CA etc.

my %distances = (
    '' => 1, B => 10.0, G => 10.1, D => 10.2,
    E => 10.3,  Z => 10.4, H => 10.5
);

## Some test data.

my @unordered = qw[
    2HB 3HB C CA CB CG CD1 CD2 CE1
    CZ CE2 HE2 HE1 HH HD1 HD2 N O OH
];

## The following is a 'standard' Swartzian Transform
## You have to read the blocks backwards to understand the process.

my @sorted = map{

## This just maps the original value back
## from the anonymous array created below

    $_->[ 1 ]

} sort {

## This sorts the anonymous arrays according to
## the numerical value in element 0 of the Anon. arrays
## This is the weight calculated below

    $a->[ 0 ] <=> $b->[ 0 ]

} map {

## The first part of the transform extracts the 3 fields
## from the catenated atomName into $1, $2, $3 or dies if it fails

    m[

        ( N | CA | O | C | (?: \d?H ) )
        ( [BGDEZH] )?
        ( \d  )?
    ]x or die "Failed to separate '$_'";

## This builds the anon. arrays. The atomName is in ->[ 1 ]
## The calculated weight is in ->[ 0 ]

    [
        $mainAtoms{ $1 }             ## 1 .. 9

        * $distances{ $2 || ''}      ## 1 or 10.x

        + ( $3 || 0 )/100            ## 0 or 0.0n

        , $_                         ## The atomName
    ]
} @unordered; ## The unordered data.

## Display the results

print join ' | ', @sorted;
__END__

P:\test>295668
N | CA | C | O | CB | CG | CD1 | CD2 | CE1 | CE2 | CZ | 
OH | HD1 | HD2 | HE1 | HE2 | HH | 2HB | 3HB
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Comment on Re: Re: Re: sorting according to greek alphabet in roman letters Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: sorting according to greek alphabet in roman letters by seaver (Pilgrim) on Oct 02, 2003 at 16:50 UTC
I do apologise, I've spent so much time thinking biochemically, it's all i can speak sometime! For starters, Im going to rewrite the tyrosine list I posted in my original reply, but Im going to format it according to the columns so that it can be seen better, and ive added a little ascii art to show the residue in '3D' `1234 N CA C O CB CG CD1 CD2 CE1 CE2 CZ OH H 2HB 3HB HD1 HD2 HE1 HE2 HH HD1 HE1 \| \| 1HB CD1-CE1 H-N \| / \ CA-CB-CG CZ-OH-HH C \| \ / O 2HB CD2-CE2 \| \| HD2 HE2` [download] Im only dealing with single letter elements here and those elements are: C O N H so in column two, you would only find those letters. The main chain is the ONLY part that has single letters for the name: C, O & N However, in ADDITION, CA is in the main chain, it's the first atom of the side chain so it's 'alpha'. All other names are AT LEAST two letters, indicating the element, and its distance. If there are more than one name that has the SAME NAME, and its not 'H', then a quantifier is added at the end of the name to distinguish between them: 1,2 and so on. If it's 'H', then even with the quantifier, there can be more than one 'H' with the same name, so 'H' names have an extra quantifier if needed, which is added at the beginning of the name. Thanks for the code though, you're right in saying that the devil is in the details, and from what you coded I can 'fine-tune' to get the result I'd like. I'd already started to code my own way, I'd seperated the atoms up into main chain, side chain, protons. The actual length of the numbers used can be sorted using the default 'cmp', so for the main chain, I only used one number, 1-9. Then for the side chain I used double digits 51-59 Then for the protons I used triple digits: #########sort atoms into three groups######### foreach my $a (@names){ push @main, $self->{'all'}{$a} if $self->{'all'}{$a}->mainHeavy; push @heavy, $self->{'all'}{$a} if $self->{'all'}{$a}->sideHeavy; push @proton, $self->{'all'}{$a} if $self->{'all'}{$a}->proton; $names{$self->{'all'}{$a}->atomName}=$a; } ##########do main chains, single digits############ foreach my $a (@main){ if($a->atomName eq 'N'){ $main{1}=$a; }elsif($a->atomName eq 'CA'){ $main{2}=$a; }elsif($a->atomName eq 'C'){ $main{3}=$a; }elsif($a->atomName eq 'O'){ $main{4}=$a; } } #########Do side chain################ my %heavyweights = ( 'C' => 1, 'N' => 2, 'O' => 3); my %greekweights = ( 'A' => 1, 'B' => 2, 'G' => 3, 'D' => 4, 'E' => 5, 'Z' => 6, 'H' => 7 ); foreach my $a (@heavy){ $main{$heavyweights{$a->atomEl}.$greekweights{$a->atomRemote}} = $ +a; } ####Do protons############ #I actually used 4 digits, because of the proton quanitifier #itself foreach my $a (@proton){ print $a->atomName; my $n = '9'; $n .= $greekweights{$a->atomRemote} if $a->atomRemote; $n .= 0 unless $a->atomRemote; $n .= $a->atomBranch if $a->atomBranch; $n .= 0 unless $a->atomBranch; $n .= $a->hydNumber if $a->hydNumber; $n .= 0 unless $a->hydNumber; $main{$n} = $a; } foreach my $i ( sort{ $a<=>$b } keys %main ){ print $main{$i}->atomName." $i\n"; push @result, $main{$i}; } __END__ N 1 CA 2 C 3 O 4 CB 12 CG 13 CD 14 NE2 25 OE1 35 H 9000 HA 9100 HB 9202 HB 9203 HG 9302 HG 9303 HE2 9521 HE2 9522 [download] Mine own needed fine tuning, but I posted the code here because you could use the same method of numerical lengths to do a sort using 'cmp', am I right? But cheers mate! Sam	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Re: sorting according to greek alphabet in roman letters
by seaver (Pilgrim) on Oct 02, 2003 at 16:50 UTC

For starters, Im going to rewrite the tyrosine list I posted in my original reply, but Im going to format it according to the columns so that it can be seen better, and ive added a little ascii art to show the residue in '3D'

1234
 N
 CA
 C
 O
 CB
 CG
 CD1
 CD2
 CE1
 CE2
 CZ
 OH
 H
2HB
3HB
 HD1
 HD2
 HE1
 HE2
 HH
             HD1 HE1
              |   |
      1HB    CD1-CE1
 H-N   |   /        \
   CA-CB-CG          CZ-OH-HH
   C   |   \        /
   O  2HB    CD2-CE2
              |   |
             HD2 HE2
[download]

C O N H

so in column two, you would only find those letters.

The main chain is the ONLY part that has single letters for the name: C, O & N However, in ADDITION, CA is in the main chain, it's the first atom of the side chain so it's 'alpha'.

All other names are AT LEAST two letters, indicating the element, and its distance.

If there are more than one name that has the SAME NAME, and its not 'H', then a quantifier is added at the end of the name to distinguish between them: 1,2 and so on.

If it's 'H', then even with the quantifier, there can be more than one 'H' with the same name, so 'H' names have an extra quantifier if needed, which is added at the beginning of the name.

Thanks for the code though, you're right in saying that the devil is in the details, and from what you coded I can 'fine-tune' to get the result I'd like.

I'd already started to code my own way, I'd seperated the atoms up into main chain, side chain, protons. The actual length of the numbers used can be sorted using the default 'cmp', so for the main chain, I only used one number, 1-9. Then for the side chain I used double digits 51-59 Then for the protons I used triple digits:

#########sort atoms into three groups#########
foreach my $a (@names){
    push @main, $self->{'all'}{$a} if $self->{'all'}{$a}->mainHeavy;
    push @heavy, $self->{'all'}{$a} if $self->{'all'}{$a}->sideHeavy;
    push @proton, $self->{'all'}{$a} if $self->{'all'}{$a}->proton;
    $names{$self->{'all'}{$a}->atomName}=$a;
    }

##########do main chains, single digits############
    foreach my $a (@main){
    if($a->atomName eq 'N'){
        $main{1}=$a;
    }elsif($a->atomName eq 'CA'){
        $main{2}=$a;
    }elsif($a->atomName eq 'C'){
        $main{3}=$a;
    }elsif($a->atomName eq 'O'){
        $main{4}=$a;
    }
    }

#########Do side chain################
    my %heavyweights = ( 'C' => 1,
             'N' => 2,
             'O' => 3);
    my %greekweights = ( 'A' => 1,
             'B' => 2,
             'G' => 3,
             'D' => 4,
             'E' => 5,
             'Z' => 6,
             'H' => 7 );
    
    foreach my $a (@heavy){
    $main{$heavyweights{$a->atomEl}.$greekweights{$a->atomRemote}} = $
+a;
    }

####Do protons############
#I actually used 4 digits, because of the proton quanitifier #itself
    foreach my $a (@proton){
    print $a->atomName;
    my $n = '9';
    $n .= $greekweights{$a->atomRemote} if $a->atomRemote;
    $n .= 0 unless $a->atomRemote;
    $n .= $a->atomBranch if $a->atomBranch;
    $n .= 0 unless $a->atomBranch;
    $n .= $a->hydNumber if $a->hydNumber;
    $n .= 0 unless $a->hydNumber;
    $main{$n} = $a;
    }

    foreach my $i ( sort{ $a<=>$b } keys %main ){
    print $main{$i}->atomName." $i\n";
    push @result, $main{$i};
    }
__END__

N 1
CA 2
C 3
O 4
CB 12
CG 13
CD 14
NE2 25
OE1 35
H 9000
HA 9100
HB 9202
HB 9203
HG 9302
HG 9303
HE2 9521
HE2 9522
[download]

But cheers mate!
Sam

[reply]
[d/l]
[select]