hellworld has asked for the wisdom of the Perl Monks concerning the following question:

Greetings everyone, I'm working on a Summer project and I have a little problem with my PERL code. I'm supposed to print all "ATOM" lines of a Protein PDB file and the parts I am supposed to print should not contain Hydrogen "H" atoms in the atom part, which is in between the columns 13-16. I think I got the regex for it right but it prints all the ATOM lines including ones with hydrogen atoms so I definitely got the syntax wrong. Any ideas on how to fix it ? Thanks. This is the code for printing -to a file- the ATOM lines that don't have Hydrogen
while($text =~ m/((ATOM)(\s{1,})(\d{1,})(\s{1,})(\w+)(\s{1,})[A-Z]{1,} +(\s{1,})[A-Z]{1,}(\s{1,})(\d{1,})(\s{1,})(\W+\d+.\d+)(\s{1,})(\W+\d+. +\d+)(\s{1,})(\W+\d+.\d+)(\s{1,})(\W+\d+.\d+)(\s{1,})(\d+.\d+)(\s+)([A +-Z]{1}))/gi) { print MYFILE "$1"; print MYFILE "\n"; }
And this is for printing HETATM lines that don't have HOH string and it has the same syntax with the one above, yet this works while the one at the top doesn't..
while($text =~ m/((ATOM)(\s{1,})(\d{1,})(\s{1,})(\w+)(\s{1,})[A-Z]{1,} +(\s{1,})[A-Z]{1,}(\s{1,})(\d{1,})(\s{1,})(\W+\d+.\d+)(\s{1,})(\W+\d+. +\d+)(\s{1,})(\W+\d+.\d+)(\s{1,})(\W+\d+.\d+)(\s{1,})(\d+.\d+)(\s+)([A +-Z]{1}))/gi) { print MYFILE "$1"; print MYFILE "\n"; }

Replies are listed 'Best First'.
Re: Help for a regex problem ?
by CountZero (Bishop) on Jul 13, 2009 at 20:17 UTC
    It is "Perl" (the language) or "perl" (the program that implements Perl), but it is never, ever "PERL". Now go and write a thousand times "I will never write 'PERL'" (you may use Perl to write it).

    If you want us to really help you, it would assist if you could give a few examples of your input file: some lines which have to match and some lines which should not match.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I will start writing that now. Um, the input is like this: ATOM 16 NZ LYS A 7 -19.664 15.558 -9.499 1.00 18.80 N ATOM 17 H LYS A 7 -19.967 21.014 -14.224 1.00 0.00 H -The 18.80 and N parts are supposed to follow 1.00 directly, not from a space below but the comment spaces, doh.- And I'm supposed to print out ATOM lines that don't have "H" near the 17 part. It keeps printing out all lines despite the syntax. HETATM is working properly and a sample of it -from another input pdb file since this 1GRL.pdb doesn't have HETATM's in it, this is from the input file 1FFL.pdb- HETATM 1 N CXM A 1 -12.588 -1.070 15.591 1.00 25.28 N HETATM 2 CA CXM A 1 -11.877 -0.094 16.395 1.00 25.28 C and I'm supposed to print HETATMS that don't have HOH on the part where CXM stands. HETATM 2153 O HOH A 300 -38.403 0.000 33.125 0.50 13.41 O HETATM 2154 O HOH A 301 -29.459 12.090 33.186 1.00 31.37 O these ones are successfully ignored by the program.
        Am I right in assuming that these lines are produced by some other program or machine? So there is a certain format it adheres to? We have to analyse the format.

        Guessing from what you gave here, it *seems* as if every record starts with ATOM and goes like this:

        ATOM 16 NZ LYS A 7 -19.664 15.558 -9.499 1.00 18.80 N ATOM 17 H LYS A 7 -19.967 21.014 -14.224 1.00 0.00 H
        I have added some spaces to align the fields.

        If my guess is correct, you need to find lines which do not have a "H" in the third field.

        use strict; while (<DATA>) { print if (split ' ')[2] ne 'H'; } __DATA__ ATOM 16 NZ LYS A 7 -19.664 15.558 -9.499 1.00 18.80 N ATOM 17 H LYS A 7 -19.967 21.014 -14.224 1.00 0.00 H
        Update: I leave it up to you to apply this to the "HETATM" file.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Help for a regex problem ?
by biohisham (Priest) on Jul 14, 2009 at 00:48 UTC
    Something's strange in the way you have used these regexes, A standard PDB file format does not look so much like the example you have mentioned, the data generated from PDB comes with ATOM or HETATM at the beginning of the line, then the atom position & type, going through again until the end of a line in an atom name again... you seem to have included a lot of brackets for backreferencing in your arrays, that is not really endorsed unless you really strictly have to, why don't you look for another way of reading these lines and putting a condition that enable you to select the lines you want printed and discard those you do not want instead of squeezing your brains this way!!!, however, try to show us an example of a couple of full PDB lines between the code tags, that can make us be able to have better visuals and get closer to understanding what you need to do.
    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind
Re: Help for a regex problem ?
by ashish.kvarma (Monk) on Jul 14, 2009 at 06:10 UTC
    I have never looked at a PDB before but I am referring http://www.wwpdb.org/documentation/format32/sect9.html for it. Also assuming (from the above write ups) that you are looking to exclude lines with '\sH\s\s' (when it starts with ATOM) and '\sHOH' (when it starts with HETATM).
    It can be done with following 2 regexs
    /ATOM .{6} H / /HETATM.{6} HOH/
    or it can be joined in as shown below
    use strict; while (<DATA>) { my $data = $_; print $data if ($data !~ /(?:ATOM .{6} H )|(?:HETATM.{6} HOH)/); } __DATA__ ATOM 601 H LEU A 75 -17.070 -16.002 2.409 1.00 55.63 + N ATOM 602 CA LEU A 75 -16.343 -16.746 3.444 1.00 55.50 + C ATOM 603 C LEU A 75 -16.499 -18.263 3.300 1.00 55.55 + C ATOM 604 H LEU A 75 -16.645 -18.789 2.195 1.00 55.50 + O ATOM 605 CB LEU A 75 -16.776 -16.283 4.844 1.00 55.51 + C TER 606 LEU A 75 ATOM 1185 O LEU B 75 26.292 -4.310 16.940 1.00 55.45 + O ATOM 1186 CB LEU B 75 23.881 -1.551 16.797 1.00 55.32 + C TER 1187 LEU B 75 + HETATM 1188 HOH SRT A1076 -17.263 11.260 28.634 1.00 59.62 + H HETATM 1189 HA SRT A1076 -19.347 11.519 28.341 1.00 59.42 + H HETATM 1190 H3 SRT A1076 -17.157 14.303 28.677 1.00 58.00 + H HETATM 1191 HOH SRT A1076 -15.110 13.610 28.816 1.00 57.77 + H HETATM 1192 O1 SRT A1076 -17.028 11.281 31.131 1.00 62.63 + O ATOM 295 HB2 ALA A 18 4.601 -9.393 7.275 1.00 0.00 + H ATOM 296 HB3 ALA A 18 3.340 -9.147 6.043 1.00 0.00 + H TER 297 ALA A 18
    Refer below the output
    # Output # ATOM 602 CA LEU A 75 -16.343 -16.746 3.444 1.00 55.50 + C # ATOM 603 C LEU A 75 -16.499 -18.263 3.300 1.00 55.55 + C # ATOM 605 CB LEU A 75 -16.776 -16.283 4.844 1.00 55.51 + C # TER 606 LEU A 75 # ATOM 1185 O LEU B 75 26.292 -4.310 16.940 1.00 55.45 + O # ATOM 1186 CB LEU B 75 23.881 -1.551 16.797 1.00 55.32 + C # TER 1187 LEU B 75 + # HETATM 1189 HA SRT A1076 -19.347 11.519 28.341 1.00 59.42 + H # HETATM 1190 H3 SRT A1076 -17.157 14.303 28.677 1.00 58.00 + H # HETATM 1192 O1 SRT A1076 -17.028 11.281 31.131 1.00 62.63 + O # ATOM 295 HB2 ALA A 18 4.601 -9.393 7.275 1.00 0.00 + H # ATOM 296 HB3 ALA A 18 3.340 -9.147 6.043 1.00 0.00 + H # TER 297 ALA A 18