winter67uk has asked for the wisdom of the Perl Monks concerning the following question:

Hello All(-knowing),

I have a long string, something like this:

AAA+XXXX+1234++here?'s some text+eol1'BBB+XXXX+1234++here?'s some text+eol2'CCC+XXXX+1234++here?'s some text+eol3'etc.

I would like to parse the string by inserting a new line (\n) after each apostrophe (the line terminator). Unfortunately some apostrophes are escaped by a question mark, meaning they are part of a text field. I tried this:

perl -ne "s/[^?]'/\'\n/g" && print" input.txt

...which doesn't work. It loses the last character from each line because the substitution matches two characters, not one. Can someone suggest a simple one-liner that doesn't drop the last character in each line?

Thanks.

Replies are listed 'Best First'.
Re: Simple Substitution
by dragonchild (Archbishop) on Jan 13, 2005 at 17:54 UTC
    There's several options for your regex.
    • You could add the character you're binding with to your replace
      s/([^?])'/$1\'\n/g
    • You could use negative lookbehind
      s/(?<!\?)'/\'\n/g

    Also, instead of -ne, I'd use -pe and get rid of the print statement. I'd also look at making it -pi.bak -e to do in-place editing with a backup file.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      Dragonchild (and all who followed) - thank you. This line did the trick for me:

      perl -pe "s/([^?])'/$1'/g" input.txt

      Things you folks taught me:

    • grouping in a regex - see perlrequick
    • the special variable $1 - ditto
    • the -p switch
    • the -i switch
    • check code before posting. I had two errors in mine (extra double quote and unnecessary escape)

      Cheers - Winter

      Final Update: this is what I used in the end...

      Parses a string in a file. The string uses an apostrophe as line terminator. Ignores apostrophes predeeded by the escape character, "?". Clever enough not to parse the same file more than once.

      perl -p -i.bak -e "s/([^?])'([^\n])/$1'\n$2/g" input.txt

    • perl: Invokes the command interpreter.
    • -p: p switch is 'assume loop like -n but print line also'.
    • -i.bak: i switch is edit in place, .bak is the extension of the backup file.
    • -e: e switch is 'one line of program'.
    • "...": Use double quotes for Windows OS.
    • s/.../.../g: Substitute. Match 1st and substitute 2nd. g means global.
    • (...)...(...): Groups - whatever the value between brackets winds up in $1, $2 etc.
    • ([^?])'([^\n]): Match any apostrophes neither preceeded by the escape character "?" nor followed by a new line "\n" - three characters total.
    • $1'\n$2: Replace matches with the value of group one (see above) followed by an apostrophe and a new line(\n) and the value of group two - four characters total.
    • input.txt: File to parse.
Re: Simple Substitution
by TedYoung (Deacon) on Jan 13, 2005 at 17:54 UTC

    Well, the version you have is very close. Try:

    s/([^?]')/$1\n/g;

    The $1 is replaced by the contents of the first group. Groups are designated by () in the match portion. So, it matches one non-? and one ' and replaces that with what it found and a \n.

    This may be good enough, but keep in mind that it won't work if a ' is at the beginning of the string. A more complete solution would be:

    s/'(?<!\?)/'\n/g;

    Note that this is untested, but should work! :-)

    Ted Young

    ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
Re: Simple Substitution
by friedo (Prior) on Jan 13, 2005 at 17:58 UTC
    I would use a capture group, like this:

    s/([^?]\')/$1\n/g;
Re: Simple Substitution
by ww (Archbishop) on Jan 13, 2005 at 18:38 UTC
    If your sample data is representative, line terminator appears to be eol[0-9]' (or, maybe +eol\d+').

    Looking from that angle, might it be easier to substitute on the (apparently unambiguous) +eol\d'
    update: belated example:

    #!usr/bin/perl -w $foo = <DATA>; $foo =~ s/\+eol\d'/\n/g; print $foo; print "\n\n Done\n"; __DATA__ AAA+XXXX+1234++here?'s some text+eol1'BBB+XXXX+1234++here?'s some text ++eol2'CCC+XXXX+1234++here?'s some text+eol3'etc.
    # OUTPUT:
    #
    # AAA+XXXX+1234++here?'s some text
    # BBB+XXXX+1234++here?'s some text
    # CCC+XXXX+1234++here?'s some text
    # etc.
    # 
    #  Done
    

    The following notion is mine, and may not be wise (CORRECTIVE comments welcome!):
    It's usually worthwhile to match against the largest possible chunk of data, to minimize ambiguity for the regex.

Re: Simple Substitution
by holli (Abbot) on Jan 13, 2005 at 17:55 UTC
    $_ = "AAA+XXXX+1234++here?'s some text+eol1'BBB+XXXX+1234++here?'s som +e text+eol2'CCC+XXXX+1234++here?'s some text+eol3'"; s/(?<!\?)'/\n/g; print;

    holli, regexed monk
Re: Simple Substitution
by ambrus (Abbot) on Jan 13, 2005 at 20:22 UTC

    I think you might be having a shell quoting problem. The backslashes are swallowed by the shell if this is under sh. Try this:

    perl -wne 's/[^?]\47/\47\n/g; print' filename

    This solution won't work if you have two consecutive (unescaped) apostrophes, or if the file has no newlines and is too long to read in memory.

      Thanks ambrus, well done for spotting the unnecessary double quote after the substitution - s/.../.../g". The Windows command environment requires double quotes, but I mistakenly included an extra one in the code I posted.