in reply to Extract version attribute value from xml header line

AFAIK, a tag name may not start with the string "xml". So I doubt your document really qualifies as being an "XML file".

Anyway, matching the version attribute for a specifically controlrolled xml-ish file, where the tags are in a fixed order, can be as simple as

my($version, $int) = /<xml\sversion="((\d+)\.?\d*)"/;
where $version gets the value "24.0" and $int the value "24". (Note that I'm matching against the text in $_, not in $text, because the code can be somewhat simpler this way — it would just drag the attention away from the important part in the code: the regex.) This can be reduced if you don't need one of those two, thus
my($version) = /<xml\sversion="(\d+\.?\d*)"/;
for the floating point and
my($version) = /<xml\sversion="(\d+)\.?\d*"/;
for the integer representation.

p.s. If the tag layout isn't as fixed, thus when attributes can move around, there are somewhat more complex ways to do it with regular expressions too, but I'll come back to that later when I have some more time to test it. Watch this space for updates.

update As promised, here's a more complex regular expression which can match various variations on this string, complete with some test cases.

#!/usr/bin/perl -w foreach ( '<xml version= "24.0" IP="1.1.2.3" baseVersion="beta_3" lastUpdate=" +22-Apr-06" >', '<xml IP="1.1.2.3" baseVersion="beta_3" version= "24.0" lastUpdate=" +22-Apr-06" >', q<<xml IP='1.1.2.3' baseVersion="beta_3" lastUpdate="22-Apr-06" vers +ion= '24.0'>>, ) { if(/<xml (?> \s+ [a-zA-Z][^\s\/=>'"]* \s* = \s* (?: " [^"]* " | ' [^']* +' ) )*? \s+ version \s* = \s* (?:"([^"]*)"|'([^']*)') /x) { print "Match '$+' in $_\n"; } else { print "No match in $_\n"; } }
Result:
Match '24.0' in <xml version= "24.0" IP="1.1.2.3" baseVersion="beta_3" + lastUpdate="22-Apr-06" > Match '24.0' in <xml IP="1.1.2.3" baseVersion="beta_3" version= "24.0" + lastUpdate="22-Apr-06" > Match '24.0' in <xml IP='1.1.2.3' baseVersion="beta_3" lastUpdate="22- +Apr-06" version= '24.0'>

I'm trying to match as many "attribute="value"" items as I can (single quotes are allowed too), preceded by whitespace, but not matching a "version" attribute yet, using nongreedy matching (PATTERN*?). I'm quite liberal in what I accept in an attribute name, I just exclude some obviously unacceptable characters. When finally matching the version attribute, again I'm accepting either single or double quotes, and I'm using $+ to select the subpattern that actually matched.

And in a regex of this complexity, use of /x is strongly advised, which results in whitespace (when not preceded by a backslash) being ignored, so I can show the subpatterns in logical groups.

Finally, I'm using the cut operator ((?>pattern)), which has two effects: (1) I can group without capturing, just as with (?:pattern), and (2), it'll prevent useless backtracking, which could always happen when you stack repetition quantifiers. You never know, and it doesn't hurt.

Replies are listed 'Best First'.
Re^2: Extract version attribute value from xml header line
by just dave (Acolyte) on Apr 24, 2006 at 04:22 UTC
    Thanks a lot Bart, this is what I needed !

    You've helped me a lot,

    Dave