TheMartianGeek has asked for the wisdom of the Perl Monks concerning the following question:

Is there a function similar to substr() except that instead of taking a specific string length, it takes the number of characters until an instance of a certain other character? Something like $var = substr($fullstring, 2, (however many characters are between the character at offset 2 and the next space character)) is what I'm looking for. (Of course, it might not necessarily be a space character; it could be a newline, tab, letter "A", etc.) The only ways I've found to do that involved several steps and were kind of annoying to write and use.
  • Comment on Substring consisting of all the characters until "character X"?

Replies are listed 'Best First'.
Re: Substring consisting of all the characters until "character X"?
by ELISHEVA (Prior) on Mar 07, 2011 at 09:00 UTC

    My guess is that the reason there isn't such a subroutine ready-made in Perl is because that kind of fancy substring extraction is usually handled by a regular expression in Perl. Regular expressions are more capable of handling the huge variety in methods for ending a string: a single character or end of string, a set of terminal characters (space or X or Z, whichever comes first), first occurance of a single character or a maximum number of total characters, a terminal string rather than a single terminal character, and many, many more. Some examples:

    #from chr 2 to right before first space or to the end of $str #if no space is found # - ^.{2} = skip past first two characters # - \S = not whitespace, \s=whitespace # - (\S*) captures zero or more non-whitespace characters # - ($str =~ /^.{2}(\S*)\s/) is a list containing one string, # i.e. ($1) where $1=what was captured by (\S*) printf "substr(2, first ' ' or end): %s\n", ($str =~ /^.{2}(\S*)/); #from chr 2 to lessor of 5 character or first space #\S = not whitespace, \s=whitespace printf "substr(2, first ' ' or 5 chars): %s\n" , ($str =~ /^.{2}(\S{0,5})/); #from chr 3 to first X or end of $str printf "substr(3, first 'X' or end): %s\n" , ($str =~ /^.{3}([^X]*)/); #from chr 3 to lessor of first X or 5 chars printf "substr(3, first 'X' or 5 chars): %s\n" , ($str =~ /^.{3}([^X]{0,5})/); #from chr 3 to first occurance of two or more A's or to the end if #no doubled A's are found printf "substr(3,two or more A's or end): %s\n" , ($str =~ /^.{3}(.*?)(AA|$)/); #from chr 10 to lessor of 5 chars or first of run of 2 or more A's printf "substr(10,two or more A's or 5 chars): %s\n" , ($str =~ /^.{10}((?:[^A]|A(?!A)){0,5})/); #from chr 10 to lessor of 5 chars or first of run of 2 or more X's printf "substr(10,two or more X's or 5 chars): %s\n" , ($str =~ /^.{10}((?:[^X]|X(?!X)){0,5})/); #from chr 5 to first occurance of two or more X's or to the end if #no doubled A's are found printf "substr(3,two or more X's or end): %s\n" , ($str =~ /^.{3}(.*?)(?:XX|$)/); #outputs substr(2, first ' ' or end): XCDEFDGHIXTAAGRAAAAAA substr(2, first ' ' or 5 chars): XCDEF substr(3, first 'X' or end): CDEFDGHI substr(3, first 'X' or 5 chars): CDEFD substr(3,two or more A's or end): CDEFDGHIXT substr(10,two or more A's or 5 chars): IXT substr(10,two or more X's or 5 chars): IXTAA substr(3,two or more X's or end): CDEFDGHIXTAAGRAAAAAA theEnd

    I grant you the syntax of those regular expressions above is somewhat arcane and cryptic. They aren't as obvious to the untrained eye as substr_chr($str,3,'A'). However, they give you much more flexibility to roll your own string endings with just a few keystrokes.

    Have you had a chance to study perlretut and perlre? If not, consider doing so. If you are extract strings based on characters or other textual considerations on a regular basis, you will find regexes a very powerful tool in your toolkit.

    Update: fixed typos in output labels

Re: Substring consisting of all the characters until "character X"?
by Ratazong (Monsignor) on Mar 07, 2011 at 07:26 UTC

    my $str = "ABXCDEFDGXTGRAAAAAA"; my $start = 5; my $char = "X"; my $end = index($str, $char, $start); my $result = substr($str, $start, $end-$start); print $result, "\n";
    What is so complicated/annoying about the code above? If you don't like it, merge lines 4 and 5 into one line - or move them to a sub...

    HTH, Rata
Re: Substring consisting of all the characters until "character X"?
by bart (Canon) on Mar 07, 2011 at 11:49 UTC
    Wow. You're using Perl and you're still not aware of one of the traditional selling points of Perl: regular expressions.
    ($substring) = $fullstring =~ /^.{2}(.*?)X/s;
    That's everything from behind the second character up till the first "X".

      I've been looking for something like this. I even understand most of it. But could you (or someone) please explain why ($substring) needs to be in brackets? I've looked in perlre, perlrequick and perlretut and can't see this construct, although it's always possible that it's there & I've missed it.

      Regards,

      John Davies

        That has nothing to do with regexes (and that's why you didn't find it in perlre) but everything with context of the assignment.

        Perl has 2 main contexts: scalar context, and list context. A function may behave differently depending on the context. (Actually there are 3 contexts: void context is the third, but it's often treated as a special case of scalar context.) For example: and array returns the array items in list context, and the number of items in scalar context. Example:

        @array = ('a', 'b', 'c'); $x = @array; # scalar context => 3 @y = @array; # list context => ('a', 'b', 'c') ($v, $w) = @array; # list context so $v => 'a', $w => 'b'

        When you put parentheses around the assignees on the left of the assignment, you get list context. The result is flattened to a list (individual items) and the items on the left get assigned the value of the item at their own position in the list. If there are too few items, the rest gets assigned undef; if there are too many, the remainder is ignored.

        And that is what's happening here: the regex is called in list context so it returns the captured items (the value for the patterns in parens) and from that list the value of $1 is assigned to $substring.

        Official docs: "Context" in perldata — see also wantarray for making your own functions behave differently depending on context; and scalar to force scalar context on a function call.

        For regexes, the docs on context are in perlop (because the // is considered a kind of quotes, and quotes are treated as operators.)

        The parentheses provide list context, in which a pattern match returns the captured parts.  In scalar context, the return value is just true/false depending on whether the regex matched.

        See perlop.

Re: Substring consisting of all the characters until "character X"?
by jwkrahn (Abbot) on Mar 07, 2011 at 07:16 UTC
    however many characters are between the character at offset 2 and the next space character
    my ( $var ) = substr( $fullstring, 2 ) =~ /([^ ]+)/;
Re: Substring consisting of all the characters until "character X"?
by JavaFan (Canon) on Mar 07, 2011 at 10:13 UTC
    Something like:
    my $i = 2; substr($fullstring, $i, index($fullstring, ' ', $i) - $i);
    should do.

    Or use a regexp.

      You should first test that the return value from index is greater than $i or you may get unexpected results.

      substr($fullstring, $i, index($fullstring, ' ', $i) - $i);

      I'll go with that (with the necessary check to make sure that the return value of index() is greater than $i). If I've learned one thing about regular expressions, it's that they start out easy enough to understand but get complicated and confusing very quickly.