in reply to extract ids
This will match everything in between double quotes after molecule_idref=. If you are sure your id only contains numbers then you indeed better check for "digits" (\d+) as was already suggested./molecule_idref="([^"]+)/
Note that when parsing XML-files, there is no guarantee white-space, EOL, ... will be where you expect them to be, so reading such files on a line by line basis or expecting your "start of line" anchors to always be reliable may be causing subtle errors. What would you have done if your tags did not start at the beginning of the line, or the tag was broken over several lines?
Consider using an XML-parser, such as XML::Simple which will turn your XML into a nice Perl-datastructure.
For example:
Will turn the mess in the __DATA__ section into:use strict; use warnings; use XML::Simple; use Data::Dumper; my $xml; { local $/=''; $xml = <DATA>; } my $xs = XML::Simple->new(); my $ref = $xs->XMLin($xml); print Dumper($ref); __DATA__ <xml><ComplexComponent1 molecule_idref="1"/> <ComplexComponent2 molecule_idref="2"/><ComplexComponent3 molecule_idr +ef="3"/> <ComplexComponent4 molecule_idref="4"/><ComplexComponent5 molecule_idref="5"/> </xml>
a nice hash-of-hashes which you can access in any "Perlish"-way.$VAR1 = { 'ComplexComponent3' => {'molecule_idref' => '3'}, 'ComplexComponent5' => {'molecule_idref' => '5'}, 'ComplexComponent1' => {'molecule_idref' => '1'}, 'ComplexComponent4' => {'molecule_idref' => '4'}, 'ComplexComponent2' => {'molecule_idref' => '2'} };
CountZero
A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: extract ids
by snape (Pilgrim) on Sep 23, 2009 at 20:48 UTC | |
by CountZero (Bishop) on Sep 24, 2009 at 18:16 UTC |