arunkumarzz has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, Could anyone help me in matching a pattern in perl. Below is my current regex.

 s/((\n)([^0-9])+(-)*(Aa-Zz)*)|((\n)(\d{3})(-)*(Aa-Zz)*)/$2$3/g

we have a file where we have multiple newline characters in final field. We need to replace the newline characters with space. My current regex is deleting the first digit of the number and also replacing the hyphen.

Edit 1: My data looks like something below.

Current data:
99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human




In the above data the delimiter is '~' and we have 4 fields. The last field is a clob in database and I would need to remove the newlines(\n) in the 4th field. Hope you got my issue.


Desired output:

99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human




Replies are listed 'Best First'.
Re: perl regex to match newline followed by number and text
by GrandFather (Saint) on May 31, 2019 at 10:24 UTC

    Care to show us some sample data and how you'd like the result to look?

    Note that you can use (?:...) to group stuff without capturing which can clean things up quite a bit. As a hint, I've added a little white space to your regex to make the groupings more obvious and added a digit at the start of each capture group. Maybe those numbers are not quite what you expect?

    s/((\n) ([^0-9])+ (-)* (Aa-Zz)*) | ((\n) (\d{3}) (-)* (Aa-Zz)*)/$2$3/g +x; # 12 3 4 5 67 8 9 0

    Update:

    Maybe what you want to achieve is something like this:

    use strict; use warnings; my $wholeBallOfWax = do {local $/; <DATA>}; my @records = split /(?<=\n)(?=\d+-)/, $wholeBallOfWax; s/\n+$/\n/s for @records; print join "---\n", @records; __DATA__ 1-12 last non-blank field 2-10 data more data 3-21 stuff more stuff Lots of stuff so much stuff there is no following empty field 4-73 Sneeky record with a blank field in the middle! 5-00 Last record

    Which prints:

    1-12 last non-blank field --- 2-10 data more data --- 3-21 stuff more stuff Lots of stuff so much stuff there is no following empty field --- 4-73 Sneeky record with a blank field in the middle! --- 5-00 Last record

    In the split regex there is a look behind ((?<=\n)) which matches a new line before the current search point, and a look ahead ((?=\d+-)) which matches one or more digits followed by a hyphen. Neither match "consumes" the string that was matched so the split doesn't drop any characters.

    As an aside, the do {local $/; <DATA>} bit suspends end of line detection and reads everything from <DATA> into $wholeBallOfWax (although maybe that was obvious?).

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Hello Grandfather,

      The example you have provided is little bit different from my issue. I have updated my question with some sample data. Could you please have a look at it?

      Thanks in advance
        #!/usr/bin/perl use strict; my $record; while (<DATA>){ s/\n/ /; if (/^\d+~/){ $record =~ s/ +$//; # trim trailing spaces printf "%s\n",$record if ($record); $record = $_; } else { $record .= $_; } } $record =~ s/ +$//; printf "%s\n",$record if ($record); __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human
        poj
        I see your Edit 1.
        Can you enclose the data in code tags so that we can see the new lines?
        The better the problems is described, the better the result will be. Your regex doesn't make much sense to me.

         s/((\n)([^0-9])+(-)*(Aa-Zz)*)|((\n)(\d{3})(-)*(Aa-Zz)*)/$2$3/g
        My brain hurts.

Re: perl regex to match newline followed by number and text
by AnomalousMonk (Archbishop) on May 31, 2019 at 20:05 UTC
    s/((\n)([^0-9])+(-)*(Aa-Zz)*)|((\n)(\d{3})(-)*(Aa-Zz)*)/$2$3/g

    Note that the  Aa-Zz regex subexpression in the quoted regex matches a literal  'Aa-Zz' sequence of these five characters. This subexpression within a (capturing!) group with a  * quantifier means that this sequence may be matched zero or more times.

    Perhaps what was meant was a  [a-zA-Z] character class, in which case  [a-zA-Z]* would have been appropriate (or perhaps better [a-zA-Z]+) since the capturing group seems completely unneeded. (But there are many other problems with the original regex, so going back to the beginning and starting from scratch seems the best course; see other suggestions in this thread.)


    Give a man a fish:  <%-{-{-{-<

      Sorry, my regex might work in Oracle but its different in perl! Thanks for your response.
Re: perl regex to match newline followed by number and text
by hippo (Archbishop) on May 31, 2019 at 15:27 UTC

    Per your data set as it stands just now, here is an SSCCE:

    use strict; use warnings; use Test::More tests => 1; my @have = ( '99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human', '98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human' ); my @want = ( '99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human', '98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human' ); for (@have) { s/\n/ /sg; } is_deeply (\@have, \@want);

    Since you haven't said what your input record separator is I have created the array by hand. See also How to ask better questions using Test::More and sample data.

Re: perl regex to match newline followed by number and text
by GrandFather (Saint) on May 31, 2019 at 22:50 UTC

    So something more like:

    use strict; use warnings; my $wholeBallOfWax = do {local $/; <DATA>}; my @records = split /\s*(?<=\n)(?=\d+~)/, $wholeBallOfWax; s/\n+/ /gs for @records; s/\s+\z//gs for @records; print join "\n", @records; __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human 97~Grand~Father~Mobilenum: 2734-567 , from Mars Ape

    Prints:

    99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human 97~Grand~Father~Mobilenum: 2734-567 , from Mars Ape

    where the only substantive change from my suggested earlier code was to replace hyphen with tilda and replace internal newlines with spaces.

    Update: or Per Hippo's suggestion:

    use strict; use warnings; use Test::More tests => 1; my @want = ( '99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human', '98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human', '97~Grand~Father~Mobilenum: 2734-567 , from Mars Ape' ); my $wholeBallOfWax = do {local $/; <DATA>}; my @records = split /\s*(?<=\n)(?=\d+~)/, $wholeBallOfWax; s/\n+/ /gs for @records; s/\s+\z//gs for @records; is_deeply (\@records, \@want); __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human 97~Grand~Father~Mobilenum: 2734-567 , from Mars Ape
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: perl regex to match newline followed by number and text
by Marshall (Canon) on Jun 01, 2019 at 04:47 UTC
    I am having trouble understanding the problem statement. You data was not enclosed in <code>..</code> tags and that is a problem. I don't really understand what removing the newlines in the 4th field means?

    This is just a wild guess on my part - I guess that extra new lines meant spacers between these records - but maybe not?:

    use strict; use warnings; my $line; while (<DATA>) { chomp; if ( (/:$/../^\s*$/) =~ /^\d+$/) #exclude endpoint. { s/\s,\s/,/; $line .= " $_"; } elsif (defined $line) { $line =~ s/^\s*//; print "$line\n"; $line = undef; } } print "$line\n" if defined $line; # just to be sure # all output is done =prints 99~Arun~Kumar~Mobilenum: 1234-567,from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901,from Earth Human 98~Mahesh~Babu~Mobilenbbb: 5678-901,from Earth Human =cut __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human 98~Mahesh~Babu~Mobilenbbb: 5678-901 , from Earth Human
    I guess something very, very simple like this is possible? 3 input lines to one output line?
    use strict; use warnings; my $input = do {local $/; <DATA>}; my @lines = $input =~ m/(.*\n.*\n.*\n)/g; foreach my $line (@lines) { $line =~ s/\n/ /g; $line =~ s/ , /,/g; print "$line\n"; } =Prints 99~Arun~Kumar~Mobilenum: 1234-567,from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901,from Earth Human 98~Mahesh~Babu~Mobilenbbb: 5678-901,from Earth Human =cut __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human 98~Mahesh~Babu~Mobilenbbb: 5678-901 , from Earth Human
      I guess something very, very simple like this is possible? 3 input lines to one output line?
      Good idea, but why then still use a regex?
      while (<DATA>) { chomp; print; print $. % 3 ? " " : "\n"; } __DATA__ 99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human
      Output
      99~Arun~Kumar~Mobilenum: 1234-567 , from Earth Human 98~Mahesh~Babu~Mobilenum: 5678-901 , from Earth Human


      holli

      You can lead your users to water, but alas, you cannot drown them.
        The number of newlines is not fixed, it might be one or two or three or more newlines in the last field