gcmandrake has asked for the wisdom of the Perl Monks concerning the following question:

Oh, great & powerful Perl monks, please allow this unworthy physicist to ask this question:

I have the task of refactoring some code which determines if a process step is to be re-run or not. The check I am concerned with is comparing whether a script or package is of a newer or later version than the one that the step was previously built with. So, I need to extract a version string from a UN*X path. Of course, there are multiple version types and a new change in the configuration management system complicates things. Incompatible types cause a step to be re-run (and do pick up, hopefully, the new version path.

Here are some examples of the paths that I will need to deal with:

'/tool/a/r/V2/V2DepCheck/1.109.2.1/V2DepCheck.pm' '/tool/a/r/p4/r/main/V2/V2DepCheck/169441/V2DepCheck.pm' '/tool/a/r/p4/r/branches/bd32b/V2/V2DepCheck/175507/V2DepCheck.pm' '/home/me/cvs/V2/V2DepCheck.pm' '/tool/a/r/boost/1.36.0' '/tool/a/r/cadence/itk/itkvd/v007'

The object is to extract the versions as strings, respectively:

1.109.2.1 169441 175507 undef 1.36.0 007

I've come up with a regex which works, but is hardly elegant. To extract a version, I use the following:

$path =~ m#/tool/a/r/(?:p4/r/main/|p4/r/branches/.+/)?.*/\D?([0-9.]+)\ +b#

Which works at least with the samples that I've fed it. If anyone has any suggestions for this poor soul, it would be most appreciated.

Your humble servant,
gcmandrake
physicist

Replies are listed 'Best First'.
Re: Using a regex to extract a version from a Un*x path
by ikegami (Patriarch) on Mar 17, 2010 at 20:45 UTC

    As you asking for a generic solution? There isn't one. For example, you specified that "V2" is not a version when others (example) would consider it a version.

    Worry more about the working aspect (at which you succeed) rather than the elegance aspect (which isn't really attainable here).

      Sadly, you are correct, no generic solution is possible. I'm attempting to work in a flawed system (in which no one can agree on a standard) and I must be backward compatible.

      I probably should have phrased my request differently, to me elegant also means robust.

Re: Using a regex to extract a version from a Un*x path
by Marshall (Canon) on Mar 17, 2010 at 23:20 UTC
    I think you are going to have to decide just "how good" the regex really needs to be, this V2 kind of stuff could be tricky. It could be that something rather simple solves your problem or not...

    One trick is to anchor at the the of the string with $ so that you can work "backwards". Below, I just capture the last string a string consisting of digits and "." characters in the string. I put a restriction of a minimum of 2 characters must exist. And I allow an optional "v" in the front. You can make this case insensitive by adding /i switch to the regex. v2 won't match because "2" is just one character, but if you had say V24, that would match. You have to decide whether this is "good enough" or not.

    I don't know how big your project is (how many people involved), but sometimes agreeing to use something easy to parse like: hey folks for version in path name use: verxxx, is a good way to go.

    Update: if this "tool" part of path is a key differentiator between "good paths" and "bad paths" add that like my commented out regex below.

    Of course the easy answer is that if you are satisfied with your regex and it does what you want...just leave it alone! Perl regex is so fast that I seriously doubt that any slight imperfection will be noticeable at all in terms of performance. an goof in first version, I saw that I was matching 32 on that line, so I added a "/" qualifier for the match. Still not "perfect", but I think the question here is "good enough" or not.

    I guess another update, brain isn't working great today...I got frustrated with the regex complications to ensure only the last matching string on the line was matched. One easy to deal with this is Perl array slice. You can just match them all and then take the "last one" via (below) with or without "/" required in front...any Perl array slice is a good tool for your toolbox as well as these short cuts like \d for digits etc.

    (my $version) = (m|(v?[\d\.]{2,})|g)[-1]; (my $version) = (m|/(v?[\d\.]{2,})|g)[-1];
    #!/usr/bin/perl -w use strict; my @paths = qw ( '/tool/a/r/V2/V2DepCheck/1109.2.1/V2DepCheck.pm' '/tool/a/r/p4/r/main/V2/V2DepCheck/169441/V2DepCheck.pm' '/tool/a/r/p4/r/branches/bd32b/V2/V2DepCheck/175507/V2DepCheck.pm' '/home/me/cvs/V2/V2DepCheck.pm' '/tool/a/r/boost/1.36.0' '/tool/a/r/cadence/itk/itkvd/v007'); foreach (@paths) { chomp ; #not need here, use if reading from file #(my $version) = (m|tool/a/r.*?/(v?[\d.]{2,}).*$|i); (my $version) = (m|/(v?[\d\.]{2,}).*?$|); defined($version)? print "$version\n" : print "undefined\n"; } __END__ prints: 1109.2.1 169441 175507 undefined 1.36.0 v007
Re: Using a regex to extract a version from a Un*x path
by JavaFan (Canon) on Mar 18, 2010 at 13:46 UTC
    Split on "/". Process the chunks backwards: for each chunk, see if it's parsed by version.pm. If it is, this is the version, otherwise, try the next chunk. If no chunk is parseable, you cannot determine the version.

      I hadn't thought it through trying to split on '/'. Interesting. I'll have to see where that goes. Thanks.

      Such an amazing bunch of replies. Thanks so much, all answers are very much appreciated.

        I, too, very much like JavaFan's approach of Re: Using a regex to extract a version from a Un*x path. However, since I've already composed this reply, you might as well see it.

        To avoid the confusion introduced by the presence of  'V2' in some paths, I depend on the presence of the magical  'V2DepCheck' sub-string. (JavaFan neatly avoids this issue by parsing right-to-left, but the regex approach I use must parse left-to-right.) Many more regexes are defined than in other approaches, but I find that it sometimes pays to be painfully explicit when the problem set is ill-defined and mutable, and maintenance may be an issue.

        Code:

        Output:

Re: Using a regex to extract a version from a Un*x path
by elTriberium (Friar) on Mar 17, 2010 at 22:51 UTC
    Given your input I would write the regex as follows:

    Match everything that:

    • Is separated on the left side by the "/" symbol
    • Is separated on the right side by the "/" symbol or end of string ($)
    • Contains only numbers and dots (".")

    It should be pretty straightforward to write such a regex and less complex than what you originally had.

    Edit: OK, sorry, my example wouldn't match this as a version string: "v007". In this case it gets a bit more complicated.

Re: Using a regex to extract a version from a Un*x path
by Anonymous Monk on Mar 18, 2010 at 06:32 UTC
    I've come up with a regex which works, but is hardly elegant.

    Don't use a regex, use a function.

    my $version = get_version( $path ); sub get_version { # complicated logic here my($p) = @_; my $v ; $v = get_version_home_csv($p); return $v if $v; $v = get_version_custom1($p); return $v if $v; ... }
      I am curious as to what get_version_home_csv() and get_version_custom1() do? i.e. show some code.
        Which part of "complicated logic", do you not understand? :) Every time a new version types is added, or there is a new change in the configuration management system, you add a function tailored specifically to that type of path. A single regular expression is the wrong way to deal with this, you need a factory pattern.

      Good idea. I originally organized the different regexes into separate functions, but I later felt that having them in one place would make support easier. I think it depends on how many corner cases I need to support.

Re: Using a regex to extract a version from a Un*x path
by se@n (Initiate) on Mar 21, 2010 at 01:06 UTC

    First, you need an accurate definition from the UNIX documentation. If you look at only 5 examples and extrapolate, you might make a bad assumption. You probably do not need the regex to match the whole path. I'm seeing numeric values \d+, after a \/, that begins with an optional v?. No other item in the path conforms to to these rules, so you don't have to worry about anything else. Try:

    if($path =~ /\/(v?\d+)/) { $version = $1 } else { $version = undef }

      A specification is an evil word around here (as is a plan). My main problem is that there is a constant churn in the types of versions. Users can (and do) create their own types of versions. I thought that I had it captured with the regex described earlier, but now I'm leaning back toward factory methods. So it goes. Thanks for the suggestion.