TacoVendor has asked for the wisdom of the Perl Monks concerning the following question:

I am just a little less than familiar with regex's. I have used them a bit, but only in simple contexts, i.e. clearing whitespace from a string, things like that.

I have looked around, but don't seem to find the answer to this:

I have a line that I am reading into a variable that comes in the format of: "aaaaaa 1, bbbbbb 2". The a's and b's are a unique identifier that are always letters. The 1 and 2 will always be numbers. I am splitting the string into 4 seperate variables that end up like this: $1="aaaaaa" $2="1" $3="bbbbbb" $4="2". My script throws each portion of the line into its relevant variable when it comes across either whitespace or a comma. I now have data that is coming in like this: "aaaaa aaa 1, bbbbb bbb 2".

Is there some type of regex function that will allow me to remove whitespace only if it is found between letters? Something that will take "abcdef ghi 1, jklmno pqr 2" and make it become "abcdefghi 1, jklmnopqr 2" so that the rest of my script can continue running as it has until now?

Sorry this is long-winded for such a simple question, I just wanted to make sure my question was understood properly.

Replies are listed 'Best First'.
Re: Regex Question
by sauoq (Abbot) on Jan 10, 2003 at 16:28 UTC
    s/([a-z])\s+([a-z])/$1$2/g

    Maybe you can have capital letters too, in which case you need to add a /i to that.

    Another way that avoids capturing altogether: s/(?<=[a-z])\s+(?=[a-z])//g

    Finally, I don't know how your current regex is working, but you can almost certainly do it in there rather than adding a new regex to first remove the whitespace. Something like this

    $ perl -nle 'print "$1|$2" while /\s*([a-z\s]*[a-z])\s+(\d+)/g' foo bar 1, baz qux 2, whatever 3, and some garbage. foo bar|1 baz qux|2 whatever|3
    perhaps?
    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Regex Question
by Zaxo (Archbishop) on Jan 10, 2003 at 16:32 UTC

    Sure, what you want are lookbehind and lookahead assertions: s/(?<=[[:alpha:]])\s+(?=[[:alpha:]])//g; I used the POSIX character class for alphabetics there. You may want to try unicode instead if your perl is up to it and you have the need.

    After Compline,
    Zaxo

Re: Regex Question
by schumi (Hermit) on Jan 10, 2003 at 16:35 UTC
    I'd go about this in two steps: First remove any whitespace between letters, and then split the rest up into variables. Like so:

    s/([a-z])(\s)([a-z])/$1$3/ig

    will remove any whitespace as long as it is between letters. Then, let's fill the variables:

    m/([a-z]+)\s(\d),?\s?([a-z]+)\s(\d)/ig

    (Note the placement of the brackets.) This should put stuff into your variables in the way you want it.

    --cs

    There are nights when the wolves are silent and only the moon howls. - George Carlin

Re: Regex Question
by pike (Monk) on Jan 10, 2003 at 16:58 UTC
    Just for fun, since all the better ones are already taken: eliminate the whitespace and do the split in one shot:

    @x = map {tr/ //d;$_} split /(?<=[[:alpha:]])[\s,]+(?=\d)|(?<=\d)[\s,] ++(?=[[:alpha:]])/, $string;
    First, split the string whereever there is a letter followed by a space or comma followed by a digit or the other way round, then eliminate all remaining spaces. The array @x will contain the strings an digits.

    pike