Incognito has asked for the wisdom of the Perl Monks concerning the following question:

Intro

My JavaScript (and VB) friends have brought me a question that I am having trouble determining what the best regular expression would be...

They were discussing the need to parse through a string that contained sets of search parameters separated by spaces. Sounds easy enough... "Just split on the spaces in the string and put it into an array!" we'd say.

Of course, if this was the case, he'd be doing that right now. The problem is that to allow the user to have spaces in any of these search parameters, we must allow them to quote the words with the quote character (").

The original regular expressions we designed for this used a temporary array to hold the value in the escaped quoted strings, but someone suggested just escaping all instances of a slash (\) with (\\) (within the quoted strings), and then replacing all spaces with (\s) within these quotes strings... This sounds much better...

The problem is I can't think of a good regex to look through the string and replace/escape only those spaces within quote with (\s)... Once we do this, doing a split on spaces will give us the array with all desired atoms.

Sample String

SearchAtom1 "Quoted Search Atom2" SearchAtom3 "" SearchAtom5Since4Wa +sEmpty "Search Atom 6"

Assumptions

  • The quoted strings cannot themselves contain quote characters (i.e.. there will be an even number of quotes in the string).
  • The quoted strings could be empty, or contain just spaces.
  • Each atom is separated by one or more space characters
  • The string would already have any leading and trailing spaces removed
  • Any multiple spaces found within a quoted string would be untouched.

    Old JavaScript Solution

    This was the JavaScript version we originally wrote
    <SCRIPT LANGUAGE="JavaScript"> function ParseStringRegExp (strSource) { var intMatchCounter = 0; var aryStoredValues = new Array(); var strUniqueID; var strWorking = strSource; var aryQSMatch; // Iterate through each quoted string and replace with UniqueID while (aryQSMatch = strWorking.match (/"[^"]*"/)) { strUniqueID = "__" + intMatchCounter + "__"; // Generate th +e UniqueID strWorking = strWorking.replace (/"[^"]*"/, strUniqueID); // Re +place the value with UniqueID aryStoredValues[intMatchCounter++] = aryQSMatch[0]; // Store +removed value into array } // Split the modified string by spaces var aryOutput = strWorking.split (/\s+/); // Go through array and replace UniqueIDs with original values. for (i = 0; i < aryOutput.length; i++) { if (aryQSMatchResults = aryOutput[i].match(/__(\d+)__/)) { aryOutput[i] = aryStoredValues[aryQSMatchResults[1]]; // Do r +eplacement here } } return (aryOutput); } var strSource = 'SearchAtom1 "Quoted Search Atom2" SearchAtom3 "" Se +archAtom5Since4WasEmpty "Search Atom 6"'; var aryOutput = ParseStringRegExp (strSource); alert (aryOutput); </SCRIPT>

    Perl Solution Wanted

    Basically, the (simple) Perl regular expression(s) for this problem is all we care at this point, as it will eliminate the need for this temporary aryStoredValues and aryQSMatchResults. We will then take this Perl and convert it to JavaScript... This basically means writing a regex to escape all backslashes and spaces with an appropriate substitute within that quoted string... Here's the result function we wish to write in JavaScript, with the help of our Perl friends:
    <SCRIPT LANGUAGE="JavaScript"> function ParseStringRegExp (strSource) { var intMatchCounter = 0; var aryStoredValues = new Array(); var strUniqueID; var strWorking = strSource; var aryQSMatch; // Escape all backslashes and spaces. // *** // what's a good regex(es) we can write for this section? // *** // Split the modified string by spaces. var aryOutput = strWorking.split (/\s+/); // Go through array and replace UniqueIDs with original values. for (i = 0; i < aryOutput.length; i++) { // *** // Do the unescaping back to regular slashes and spaces for each // array value here for aryOutput[i]. // *** } return (aryOutput); } var strSource = 'SearchAtom1 "Quoted Search Atom2" SearchAtom3 "" Se +archAtom5Since4WasEmpty "Search Atom 6"'; var aryOutput = ParseStringRegExp (strSource); alert (aryOutput); </SCRIPT>
    I'm sure that some of you can think of a good regex for this, but we have to keep in mind that these regexs need to be converted to the 'lesser' languages of JavaScript (and also Visual Basic - PUKE - the VB implementation of regular expressions is SO LAME), so it must be simple, rather than pretty and obfuscated :)
  • Replies are listed 'Best First'.
    Re: Regex for escaping spaces in strings when there are quotes
    by japhy (Canon) on Nov 29, 2001 at 01:09 UTC
      So you want to split 'ABC DEF "GHI JKL" MNO' into the list ('ABC', 'DEF', '"GHI JKL", 'MNO')? If so, here's a Perl regex:
      @parts = $string =~ m{"[^"]*"|\S+}g;
      If you then want to remove the quotes from the fields with them, that's your job.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Kudos to you!!! You get my ++ any day... This is exactly the solution we've needed, and it was such a simple regex... I don't know why I didn't think of it.

        For anyone that's interested, here's the new JavaScript code:

        <SCRIPT LANGUAGE="JavaScript"> function ParseStringRegExp (strSource) { // Split the modified string by spaces or quoted strings. var aryOutput = strSource.match (/"[^"]*"|\S+/g); return (aryOutput); } </SCRIPT>
        No extra arrays needed, and the function is basically a one liner... The VB implementation is just as simple, a single Regex, but of course, we have to iterate through the Matches collection and stored each value into an array for the return value. If anyone wants to see how ugly Regexes are in VBScript, here you are:
        <% Function ParseStringRegExp (strSource) Dim objRegex Dim colMatches Dim objMatch Set objRegex = New RegExp ' Create a regular expression +object. objRegex.Pattern = """[^""]*""|\S+" ' Set pattern. objRegex.Global = True ' Set global applicabi +lity. Set colMatches = objRegex.Execute(strSource) ' Execute search. Dim aryOutput() ReDim aryOutput(colMatches.Count - 1) ' Prepare the output a +rray. Dim lngMatchCounter ' Grab the colMatches collection and store each value into an array. lngMatchCounter = 0 For Each objMatch In colMatches aryOutput(lngMatchCounter) = objMatch.Value lngMatchCounter = lngMatchCounter + 1 Next ParseStringRegExp = aryOutput End Function %>
    Re: Regex for escaping spaces in strings when there are quotes
    by IlyaM (Parson) on Nov 29, 2001 at 00:20 UTC
      Take a look at Text::ParseWords. This is a module for parsing such quoted strings.
        This is not a possibility, since we are ultimately doing a conversion to JavaScript regular expressions. There is a set of regular expressions out there...
          Well. I'm not sure if JavaScript has regexps compatible with Perl but if it does you can use Text::Balanced for autogeneration of required regexp. Use this module to generate regexp and later just insert it into your JavaScript.