Distant Global Regex Challenge

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, monks. I'm fairly old had at perl regex, but I've stumbled across a problem I can't seem to get past. Yes, there is an easier solution, but now that I've come across the problem, I'd like to figure it out just for the sake of solving it.

Here's the issue: I have the contents of a file in a string, and I want to remove lines that have near identical data, but sometimes, these lines are separated by other lines. To complicate matters, I would like to be able to do a single global match to identify and remove all duplicate lines.

Here's a bit of the actual data. Specifically, this is a single amino acid in a protein data bank (PDB) file (and a quick breakdown of the data)

ATOM   1489  CA AARG A 181      21.615  11.671  -0.581  0.50 14.29 
ATOM   1490  C  AARG A 181      21.176  11.705   0.880  0.50 13.13 
ATOM   1491  O  AARG A 181      21.097  10.666   1.534  0.50 11.72 
ATOM   1492  CB AARG A 181      20.905  10.524  -1.299  0.50 15.58
ATOM   1493  CG AARG A 181      19.464  10.823  -1.680  0.50 18.22
ATOM   1494  CD AARG A 181      19.399  11.628  -2.968  0.50 21.19 
ATOM   1495  NE AARG A 181      20.181  10.979  -4.017  0.50 24.97
ATOM   1496  CZ AARG A 181      19.785  10.842  -5.278  0.50 25.56 
ATOM   1497  NH1AARG A 181      18.606  11.311  -5.660  0.50 27.74 
ATOM   1498  NH2AARG A 181      20.567  10.230  -6.156  0.50 25.96 
ATOM   1499  N  BARG A 181      23.059  11.454  -0.580  0.50 14.86 
ATOM   1500  CA BARG A 181      21.613  11.672  -0.589  0.50 14.84
ATOM   1501  C  BARG A 181      21.172  11.705   0.874  0.50 13.78 
ATOM   1502  O  BARG A 181      21.092  10.664   1.525  0.50 13.73
ATOM   1503  CB BARG A 181      20.908  10.523  -1.319  0.50 18.29 
ATOM   1504  CG BARG A 181      19.470  10.822  -1.731  0.50 22.11 
ATOM   1505  CD BARG A 181      19.428  11.725  -2.959  0.50 23.55
ATOM   1506  NE BARG A 181      19.985  11.063  -4.138  0.50 24.53  
ATOM   1507  CZ BARG A 181      19.322  10.200  -4.904  0.50 23.44 
ATOM   1508  NH1BARG A 181      18.062   9.888  -4.628  0.50 21.30 
ATOM   1509  NH2BARG A 181      19.926   9.642  -5.944  0.50 24.00

1      2     3  45   6 7        8
[download]

1 -- denotes record type (e.g., ATOM); 2 -- atom number; 3 -- atom name; 4 -- alternate location; 5 -- amino acid; 6 -- chain; 7 -- residue number; 8 -- extra stuff. The important part is #4 -- I want to keep only version, i.e., all atoms without any alternate data and only one of the alternate versions (for simplicity, the first).

This example is fairly simple -- there are are only two alternate versions (A and B) and every atom has an alternate location. It is fairly common for only a few atoms to have alternate versions and for those that do have alternates to have 2-3 different versions.

Here's my problem: I can easily identify matches, e.g., the CA atoms for both the A and B versions, but
1) I also have to match everything in between, so a single global regex won't work,
2) a lookbehind would work great...except that it's variable length, and
3) I can't seem to get the look ahead to work

Here's a regex that works if the lines are grouped according to atom type (i.e., the two CA lines are consecutive). It matches any atom record (including those without a specific altloc -- this does happen) and checks the next line for an alternate version. I've broken it up and commented for this post, but it should still work. There are a couple of extra capture groups in there -- just ignore 'em.

my $rx_altloc = qr/
   ^(ATOM.{9}   # match only atom records
    (.{3})      # capture atom name (and any space included)
    (.)         # capture alternate location identifier
    \w{3}       # match the amino acid name
    (.{6})      # capture chain and residue number
    .+$ [\r\n]) # match the rest of the line
   (?:(^ATOM.{9} # make sure the next line(s) is an atom record
   (?=\2).{3}    # make sure same atom name
   (?!\3).       # make sure different alternate location
   \w{3}         # don't care if it's the same residue type
   (?=\4).{6}    # make sure it's the same chain and residue
   .+$ [\r\n])+) # match the rest of the line
   (?{
      $altloc = $3;
   })
/xm;
...

$file_contents =~ s/$rx_altloc/$1/g;
/xm;
[download]

Unfortunately, there are many cases where the corresponding alternate locations are grouped together rather than by atom type, i.e., first example. I cannot simply remove all lines containing a different alternate id because...well, there are several issues with standardization. Because I need variable length, I cannot use lookbehind, so I decided to use lookahead. The following is a regex that correctly matches without lookaround, but fails with the lookahead.

my $rx_altloc_groupByAlt = qr/
   #(?=
   (^ATOM.{9}(.{3})(\w)\w{3}(.{6}).+$ [\r\n])
   ((?:^.+$ [\r\n])*)
   #)
   (ATOM.{9}\2(?!\3).\w{3}\4.+$ [\r\n])
   (?{
      $altloc = $3;
   })
/xm;

...

$file_contents =~ s/$rx_altloc_groupByAlt//g;
[download]

However, because it matches all the in between lines, it only matches once, not globally, hence the need for lookaround.

Sorry for the lengthy explanation. I'd appreciate any help / love to discuss possible options. Thanks!!

Comment on Distant Global Regex Challenge Select or Download Code

Replies are listed 'Best First'.
Re: Distant Global Regex Challenge by SuicideJunkie (Vicar) on Mar 13, 2012 at 21:20 UTC
What if you spin through the file once up front, making a hash. Keys are the important data. Values are an array of line numbers where that data was found. Filter out all the hash entries with a single line number, and you'll be left with the duplicates. Take all those line numbers, and sort. Second pass through, you can use the list of line numbers to drop the duplicates.	[reply]
Re^2: Distant Global Regex Challenge by muppetjones (Novice) on Mar 13, 2012 at 21:38 UTC
Right -- good solution. Like I said, there are easier ways to do it, but I'm curious if there is a way to do it with a single, one pass, global regex. (I've actually implemented something similar to what you suggested just for the time being. At each line, I check for an existing altloc-atom name combo. If it exists, I skip the line, otherwise I update the hash with the new info. This gives me a single pass, though without the elegance of a regex) Thanks!	[reply]
Re: Distant Global Regex Challenge by muppetjones (Novice) on Mar 13, 2012 at 17:38 UTC
Oops, thought I was logged on -- I posted this. Sorry.	[reply]