Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

How to remove everything after last occurrence of a string?

by ovedpo15 (Pilgrim)
on Jun 06, 2022 at 15:40 UTC ( [id://11144451]=perlquestion: print w/replies, xml ) Need Help??

ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have a problem with the following code:
my ($path,$version); $path = "/a/b/version1/c/d"; $version = "version1"; $path =~ s/$version.*/$version/s; print($path."\n"); $path = "/some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/c"; $version = "3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"; $path =~ s/$version.*/$version/s; print($path."\n");
Basiclly I have a path and I need to remove everything after the version.
In that first case the version is version1 so it returns /a/b/version1.
But in the second case, the version is 3.5.2+tcl-tk-8.5.18+sqlite-3.10.0 and it still returns /some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/c. I do understand why it happens - there are special chars in the $version and the code treats them as regex.
After some reasrch, I came across with quotemeta. But it escapes the path:
my ($path,$version); $path = "/a/b/version1/c/d"; $version = quotemeta("version1"); $path =~ s/$version.*/$version/s; print($path."\n"); $path = "/some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/c"; $version = quotemeta("3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"); $path =~ s/$version.*/$version/s; print($path."\n");
Result:
/a/b/version1 /some/path/3\.5\.2\+tcl\-tk\-8\.5\.18\+sqlite\-3\.10\.0
How it can be acheived without escapeing the special chars? How to remove everything after the most neasted $version?

Replies are listed 'Best First'.
Re: How to remove everything after last occurrence of a string?
by Fletch (Bishop) on Jun 06, 2022 at 16:22 UTC

    Something about this raises the "X/Y problem" hackles and makes me want to mumble something about using File::Spec and splitdir to break things up into a list of directory components instead (or Path::Tiny and doing . . . something). Or it could just be Monday and I'm making arbitrary connections that aren't there.

    That aside the previous quotemeta answer(s) obviously fixes the X issue.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Something about this raises the "X/Y problem" hackles and makes me want to mumble something about using File::Spec and splitdir to break things up into a list of directory components instead (or Path::Tiny and doing . . . something).

      definitely x/y, look at the post history

Re: How to remove everything after last occurrence of a string?
by kcott (Archbishop) on Jun 07, 2022 at 10:43 UTC

    G'day ovedpo15,

    I think your biggest problem here is that you've immediately reached for a regex solution. When I first saw the question title, before even looking at the content, my first thought was rindex and substr.

    Perl's string handling functions (and operators) are, in my experience, substantially faster than achieving the same functionality with regexes. There are times when a regex is appropriate; however, often it's not the best solution.

    So, rather than making guesses and assumptions, let's Benchmark.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use constant { PATH => 0, VERSION => 1, WANT => 2, }; use Benchmark 'cmpthese'; use Test::More; my @tests = ( [qw{ /a/b/version1/c/d version1 /a/b/version1 }], [qw{ /some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/c 3.5.2+tcl-tk-8.5.18+sqlite-3.10.0 /some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0 }], ); plan tests => 2*@tests; for my $test (@tests) { is _regex(@$test[PATH, VERSION]), $test->[WANT]; is _rindex(@$test[PATH, VERSION]), $test->[WANT]; } cmpthese 0 => { regex0 => sub { _regex(@{$tests[0]}[PATH, VERSION]); }, regex1 => sub { _regex(@{$tests[1]}[PATH, VERSION]); }, rindex0 => sub { _rindex(@{$tests[0]}[PATH, VERSION]); }, rindex1 => sub { _rindex(@{$tests[1]}[PATH, VERSION]); }, }; sub _regex { my ($path, $version) = @_; $path =~ s/.*\Q$version\E\K.*//s; return $path; } sub _rindex { my ($path, $version) = @_; $path = substr $path, 0, length($version) + rindex $path, $version +; return $path; }

    The Test::More code is just to ensure the functions are producing correct results; which they are. The output from that is identical in all runs, so I'll just post it once:

    1..4 ok 1 ok 2 ok 3 ok 4

    I ran the benchmark five times. Here's the sections of output that relate to that:

    Rate regex1 regex0 rindex0 rindex1 regex1 809214/s -- -7% -69% -75% regex0 866627/s 7% -- -67% -73% rindex0 2632175/s 225% 204% -- -19% rindex1 3257343/s 303% 276% 24% -- Rate regex1 regex0 rindex0 rindex1 regex1 825777/s -- -5% -68% -75% regex0 870952/s 5% -- -66% -73% rindex0 2579956/s 212% 196% -- -21% rindex1 3261289/s 295% 274% 26% -- Rate regex1 regex0 rindex0 rindex1 regex1 807841/s -- -8% -69% -75% regex0 880657/s 9% -- -66% -73% rindex0 2625422/s 225% 198% -- -20% rindex1 3265104/s 304% 271% 24% -- Rate regex1 regex0 rindex0 rindex1 regex1 807626/s -- -7% -69% -75% regex0 873101/s 8% -- -66% -73% rindex0 2567429/s 218% 194% -- -21% rindex1 3255180/s 303% 273% 27% -- Rate regex1 regex0 rindex0 rindex1 regex1 827447/s -- -6% -68% -75% regex0 877972/s 6% -- -66% -73% rindex0 2579110/s 212% 194% -- -21% rindex1 3260240/s 294% 271% 26% --

    Do you still want to go with a regex solution? :-)

    — Ken

      Do you still want to go with a regex solution? :-)

      No, because the rindex approach is likely to be more maintainable than the regex solution.

      While execution time can sometimes be critical, mostly it doesn't matter at all. It is generally much more important for code to be correct and maintainable than fast. If fast is a side effect of correct and maintainable code (often it is) then so much the better, but fast comes way down the list during the first stages of designing a coding solution.

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

        G'day GrandFather,

        ++ Thanks for your input.

        I did possibly end up (implicitly) suggesting that speed was the all-important factor.

        I have seen innumerable cases where regexes have been used to test exact matches (/^some_string$/), test if strings start with some token (/^some_token/), and so on. As I said originally, "Perl's string handling functions (and operators) are, in my experience, substantially faster than achieving the same functionality with regexes."; as such, reaching for a regex first has become something of an annoyance for me.

        The OP had asked "How it can be acheived without escapeing the special chars?" and I rather thought that was implicit in my "rindex" code. Perhaps I should have highlighted that.

        I also answered the OP's title question, "How to remove everything after last occurrence of a string?". Again, I didn't highlight that.

        I hadn't really considered the maintainability aspect but, I agree, the "rindex" code is easily understandable and works in all versions of Perl5; that's not to say that I shy away from regexes (see "Syntax-highlight Non-Perl Code for HTML"). Furthermore, if those maintaining the code are expected to have a solid grounding in regexes, then I'd say that neither solution is particularly complex and both are equally maintainable (a YMMV situation).

        The OP may have a very specific reason for choosing a regex solution; however, if not, why not choose an alternative that's three times faster.

        — Ken

        Do you still want to go with a regex solution? :-)

        No | Yes, because (in this case at least) the rindex solution is wrong.

        How can one say any result is incorrect if the OPer has specified no clear set of requirements for results? I admit this is tricky, but one can say the use of s/// in the OPed example code implies that the string operand should be unchanged if no version substring match is found. _rindex() in the code here fails to do this.

        It's easy enough to define an rindex-based function that handles the no-match case (and it might even be a bit faster). But the argument seems to be that one should avoid using and learning about regexes because they are a bit arcane (indeed, regexes are the most counter-intuitive programming construct I know) and may vary a bit from language to language. This argument can be extended to languages themselves: We should not use Perl because it's not Python; not use Python because it's not C++; not use C++ because it's not...

        To answer a use-case such as that described in the OP, I tend to reach first for a regex solution because it most clearly represents and achieves the required operation, not because it is the fastest (although sometimes it is). Implementing the required operation in terms of index/rindex, substr, etc., is possible, but may have its own pitfalls and drawbacks in terms of basic correctness, readability and maintainability.

        These are all my own very personal preferences; others may differ.

        Update: Rats... Trashed the thrust of the entire post by getting the very first word wrong. Oh, well...


        Give a man a fish:  <%-{-{-{-<

Re: How to remove everything after last occurrence of a string?
by hippo (Bishop) on Jun 06, 2022 at 15:50 UTC
Re: How to remove everything after last occurrence of a string?
by tybalt89 (Monsignor) on Jun 06, 2022 at 16:11 UTC
    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11144451 use warnings; my ($path,$version); $path = "/a/b/version1/c/d"; $version = "version1"; $path =~ s/.*\Q$version\E\K.*//s; print($path."\n"); $path = "/some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/c"; $version = "3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"; $path =~ s/.*\Q$version\E\K.*//s; print($path."\n");

    Outputs:

    /a/b/version1 /some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0
Re: How to remove everything after last occurrence of a string?
by hv (Prior) on Jun 06, 2022 at 16:48 UTC

    There are many possible approaches to this. One is to quotemeta() only inside the pattern:

    $version = "3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"; $path =~ s/\Q$version\E.*/$version/s;

    Another is to replace with what the pattern actually matched:

    $version = quotemeta("3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"); $path =~ s/($version).*/$1/s;

    Another is to use index() to search for a fixed string, rather than a regexp:

    $version = "3.5.2+tcl-tk-8.5.18+sqlite-3.10.0"; my $pos = index($path, $version); # cut everything beyond the match, if there was a match substr($path, $pos + length($version)) = '' if $pos >= 0;

    My guess is that tybalt89's solution with \K will be the fastest, but not necessarily the easiest to understand and maintain.

      Your index() solution does not do exactly what the regex solution does. That may or may not make a difference in this application, but I think worth pointing out. The regex is "greedy" and will wind up matching the last occurrence of $version. Your index() code will find only the first occurrence of $version.

      Try tybalt's code with:

      $path = "/some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/3.5.2+tcl-tk +-8.5.18+sqlite-3.10.0/c"; # and you will see that only the last "/c" is deleted. /some/path/3.5.2+tcl-tk-8.5.18+sqlite-3.10.0/a/b/3.5.2+tcl-tk-8.5.18+s +qlite-3.10.0
      As far as maintainability and understandability goes, I would not use the \K and prefer the more common way:
      instead of : $path =~ s/.*\Q$version\E\K.*//s; I probably would have coded: $path =~ s/(.*\Q$version\E).*/$1/s;
      I think \K is specific to Perl or at least I know that it does not exist in some other regex dialects that I use.

      From the use case presented, the speed of execution doesn't matter at all. I would opt for simplicity and avoid uncommon things like \K. I could code a faster, better version of your index() approach in ASM and it would run like a "super rocket" but to absolutely no effect whatsoever upon total program execution time. And I think this could miss use cases involving wide characters which the regex will handle as part of Perl (the one byte per character assumption although extremely useful for many things, it does have some limitations).

      I don't know why the /s regex modifier is used and the rationale behind that could be a bit obscure? Normally "." matches anything except \n. /s allows "." to include the "\n". I would not expect to see an \n in a path name. I'm not sure that this makes any difference at all, but again, some of these small things can matter depending upon the circumstances.

        Your index() solution does not do exactly what the regex solution does.

        This shortcoming can be addressed by the use of rindex:

        Win8 Strawberry 5.8.9.5 (32) Mon 06/06/2022 20:19:26 C:\@Work\Perl\monks >perl use strict; use warnings; my $path = "/some/3.5.2+tcl/path/3.5.2+tcl/a/b/c"; print "\$path before: '$path' \n"; my $version = "3.5.2+tcl"; my $pos = rindex($path, $version); # cut everything beyond the match, if there was a match substr($path, $pos + length($version)) = '' if $pos >= 0; print "\$path after: '$path' \n"; ^Z $path before: '/some/3.5.2+tcl/path/3.5.2+tcl/a/b/c' $path after: '/some/3.5.2+tcl/path/3.5.2+tcl'

        Personally, I have no objection to \K on the grounds of understandability, readability or maintainability; indeed, it seems desirable on these grounds. One must be aware, however, that it was only introduced with Perl version 5.10. That's over twenty years old now, but one still occasionally sees situations in which \K is not available. AFAIK, your regex approach and the rindex approach will work with any version of Perl 5.


        Give a man a fish:  <%-{-{-{-<

        Your index() solution does not do exactly what the regex solution does.

        Thanks for bringing that up - I had intended to comment on it, but forgot. I think only tybalt89's regexp solution matches the last occurrence, due to its inclusion of a leading /.*/ in the pattern. The pattern in the original post and in my two regexp-based solutions will match the first occurrence, same as the index() solution.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144451]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-19 03:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found