Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, monks. I'm fairly old had at perl regex, but I've stumbled across a problem I can't seem to get past. Yes, there is an easier solution, but now that I've come across the problem, I'd like to figure it out just for the sake of solving it.

Here's the issue: I have the contents of a file in a string, and I want to remove lines that have near identical data, but sometimes, these lines are separated by other lines. To complicate matters, I would like to be able to do a single global match to identify and remove all duplicate lines.

Here's a bit of the actual data. Specifically, this is a single amino acid in a protein data bank (PDB) file (and a quick breakdown of the data)

ATOM 1489 CA AARG A 181 21.615 11.671 -0.581 0.50 14.29 ATOM 1490 C AARG A 181 21.176 11.705 0.880 0.50 13.13 ATOM 1491 O AARG A 181 21.097 10.666 1.534 0.50 11.72 ATOM 1492 CB AARG A 181 20.905 10.524 -1.299 0.50 15.58 ATOM 1493 CG AARG A 181 19.464 10.823 -1.680 0.50 18.22 ATOM 1494 CD AARG A 181 19.399 11.628 -2.968 0.50 21.19 ATOM 1495 NE AARG A 181 20.181 10.979 -4.017 0.50 24.97 ATOM 1496 CZ AARG A 181 19.785 10.842 -5.278 0.50 25.56 ATOM 1497 NH1AARG A 181 18.606 11.311 -5.660 0.50 27.74 ATOM 1498 NH2AARG A 181 20.567 10.230 -6.156 0.50 25.96 ATOM 1499 N BARG A 181 23.059 11.454 -0.580 0.50 14.86 ATOM 1500 CA BARG A 181 21.613 11.672 -0.589 0.50 14.84 ATOM 1501 C BARG A 181 21.172 11.705 0.874 0.50 13.78 ATOM 1502 O BARG A 181 21.092 10.664 1.525 0.50 13.73 ATOM 1503 CB BARG A 181 20.908 10.523 -1.319 0.50 18.29 ATOM 1504 CG BARG A 181 19.470 10.822 -1.731 0.50 22.11 ATOM 1505 CD BARG A 181 19.428 11.725 -2.959 0.50 23.55 ATOM 1506 NE BARG A 181 19.985 11.063 -4.138 0.50 24.53 ATOM 1507 CZ BARG A 181 19.322 10.200 -4.904 0.50 23.44 ATOM 1508 NH1BARG A 181 18.062 9.888 -4.628 0.50 21.30 ATOM 1509 NH2BARG A 181 19.926 9.642 -5.944 0.50 24.00 1 2 3 45 6 7 8

1 -- denotes record type (e.g., ATOM); 2 -- atom number; 3 -- atom name; 4 -- alternate location; 5 -- amino acid; 6 -- chain; 7 -- residue number; 8 -- extra stuff. The important part is #4 -- I want to keep only version, i.e., all atoms without any alternate data and only one of the alternate versions (for simplicity, the first).

This example is fairly simple -- there are are only two alternate versions (A and B) and every atom has an alternate location. It is fairly common for only a few atoms to have alternate versions and for those that do have alternates to have 2-3 different versions.

Here's my problem: I can easily identify matches, e.g., the CA atoms for both the A and B versions, but
1) I also have to match everything in between, so a single global regex won't work,
2) a lookbehind would work great...except that it's variable length, and
3) I can't seem to get the look ahead to work

Here's a regex that works if the lines are grouped according to atom type (i.e., the two CA lines are consecutive). It matches any atom record (including those without a specific altloc -- this does happen) and checks the next line for an alternate version. I've broken it up and commented for this post, but it should still work. There are a couple of extra capture groups in there -- just ignore 'em.

my $rx_altloc = qr/ ^(ATOM.{9} # match only atom records (.{3}) # capture atom name (and any space included) (.) # capture alternate location identifier \w{3} # match the amino acid name (.{6}) # capture chain and residue number .+$ [\r\n]) # match the rest of the line (?:(^ATOM.{9} # make sure the next line(s) is an atom record (?=\2).{3} # make sure same atom name (?!\3). # make sure different alternate location \w{3} # don't care if it's the same residue type (?=\4).{6} # make sure it's the same chain and residue .+$ [\r\n])+) # match the rest of the line (?{ $altloc = $3; }) /xm; ... $file_contents =~ s/$rx_altloc/$1/g; /xm;

Unfortunately, there are many cases where the corresponding alternate locations are grouped together rather than by atom type, i.e., first example. I cannot simply remove all lines containing a different alternate id because...well, there are several issues with standardization. Because I need variable length, I cannot use lookbehind, so I decided to use lookahead. The following is a regex that correctly matches without lookaround, but fails with the lookahead.

my $rx_altloc_groupByAlt = qr/ #(?= (^ATOM.{9}(.{3})(\w)\w{3}(.{6}).+$ [\r\n]) ((?:^.+$ [\r\n])*) #) (ATOM.{9}\2(?!\3).\w{3}\4.+$ [\r\n]) (?{ $altloc = $3; }) /xm; ... $file_contents =~ s/$rx_altloc_groupByAlt//g;

However, because it matches all the in between lines, it only matches once, not globally, hence the need for lookaround.

Sorry for the lengthy explanation. I'd appreciate any help / love to discuss possible options. Thanks!!

Replies are listed 'Best First'.
Re: Distant Global Regex Challenge
by SuicideJunkie (Vicar) on Mar 13, 2012 at 21:20 UTC

    What if you spin through the file once up front, making a hash. Keys are the important data. Values are an array of line numbers where that data was found.

    Filter out all the hash entries with a single line number, and you'll be left with the duplicates. Take all those line numbers, and sort.

    Second pass through, you can use the list of line numbers to drop the duplicates.

      Right -- good solution. Like I said, there are easier ways to do it, but I'm curious if there is a way to do it with a single, one pass, global regex.

      (I've actually implemented something similar to what you suggested just for the time being. At each line, I check for an existing altloc-atom name combo. If it exists, I skip the line, otherwise I update the hash with the new info. This gives me a single pass, though without the elegance of a regex)

      Thanks!

Re: Distant Global Regex Challenge
by muppetjones (Novice) on Mar 13, 2012 at 17:38 UTC
    Oops, thought I was logged on -- I posted this. Sorry.