in reply to extract ids

This regex will work as well:
/molecule_idref="([^"]+)/
This will match everything in between double quotes after molecule_idref=. If you are sure your id only contains numbers then you indeed better check for "digits" (\d+) as was already suggested.

Note that when parsing XML-files, there is no guarantee white-space, EOL, ... will be where you expect them to be, so reading such files on a line by line basis or expecting your "start of line" anchors to always be reliable may be causing subtle errors. What would you have done if your tags did not start at the beginning of the line, or the tag was broken over several lines?

Consider using an XML-parser, such as XML::Simple which will turn your XML into a nice Perl-datastructure.

For example:

use strict; use warnings; use XML::Simple; use Data::Dumper; my $xml; { local $/=''; $xml = <DATA>; } my $xs = XML::Simple->new(); my $ref = $xs->XMLin($xml); print Dumper($ref); __DATA__ <xml><ComplexComponent1 molecule_idref="1"/> <ComplexComponent2 molecule_idref="2"/><ComplexComponent3 molecule_idr +ef="3"/> <ComplexComponent4 molecule_idref="4"/><ComplexComponent5 molecule_idref="5"/> </xml>
Will turn the mess in the __DATA__ section into:
$VAR1 = { 'ComplexComponent3' => {'molecule_idref' => '3'}, 'ComplexComponent5' => {'molecule_idref' => '5'}, 'ComplexComponent1' => {'molecule_idref' => '1'}, 'ComplexComponent4' => {'molecule_idref' => '4'}, 'ComplexComponent2' => {'molecule_idref' => '2'} };
a nice hash-of-hashes which you can access in any "Perlish"-way.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^2: extract ids
by snape (Pilgrim) on Sep 23, 2009 at 20:48 UTC
    Hi, Thanks a lot for answering my doubts. I would like to know how
    /molecule_idref="([^"]+)/
    will match everything between the quotes. I didn't understand why "([^"]+)will match anything between the quotes. Thanks.
      [^"]
      in a regex means: everything BUT a double quote.

      "([^"]+)
      therefore means: start with a double quote, then capture everything but a double quote and end the capture. In other words, the capture starts after the first double quote and ends just before the next double quote.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James