Murcia has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a long html text. For this text I want to get all database
Identifiers (4 letters, beginning with a digit (not 0)) e.g 1TGS
These IDs are introduced by the database name (PDB) e.g.
PDB ID 1TGS || PDB code ID 1TGS || PDB 1TGS.
These terms can occure multiple times in the text.

$string = "<html>Test text PDB code ID 1TGS 1O6S 1TGS bla bla \nPDB code
1ILW 1ILV"; I want to get all these IDs. this is not correct: while($string=~/(?<=PDB).+?([1-9][A-Z0-9]{3})/g){ print $1, "\n";}
but how to get all IDs?? Thanks in advance! Yours Murcia

Replies are listed 'Best First'.
Re: Regexp find all matches
by holli (Abbot) on Mar 03, 2005 at 10:05 UTC
    $string = qq { <html>Test text PDB code ID 1TGS 1O6S 1TGS bla bla PDB code 1ILW 1ILV PDB 2XXX }; while($string=~/PDB (?:code )?(?:ID )?(([1-9][A-Z0-9]{3} ?)+)+/mg){ pr +int $1, "\n";}


    holli, /regexed monk/
Re: Regexp find all matches
by tphyahoo (Vicar) on Mar 03, 2005 at 12:43 UTC
    In the future when posting about regex help, you may want to use the __DATA__ feature for your test data, explaining which data you want matched/substited/whatever, and which data isn't working.

    For example, look at Celsius to Fahrenheit using s///.

    Then it is easier for your brother monks to help you with your problem.

Re: Regexp find all matches
by manav (Scribe) on Mar 03, 2005 at 10:13 UTC
    Is there any reason you are using look-behind?? Your requirement that you state amount to
    my (@arr) = $string =~ /[1-9][a-zA-Z0-9]{3}/g ;
    Or am I missing something here??

    Manav
      The reason is, not to find many false positives!
        Can you manually list what will be the true positives in this $string

        $string = "<html>Test text PDB code ID 1TGS 1O6S 1TGS bla bla \nPDB code 1ILW 1ILV";
        Manav