dipit has asked for the wisdom of the Perl Monks concerning the following question:

eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/eab12345 shell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* (AS 00000) auditclasses=general,files,TCPIP login=true su=true rlogin=true daemon=true admin=false sugroups=ALL admgroups= tpath=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD SYSTEM=AD logintimes= loginretries=5 pwdwarntime=5 account_locked=false minage=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 maxrepeats=2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist=/abc/def/ghi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 core=000000 rss=65536 nofiles=2000 time_last_login=1512632113 time_last_unsuccessful_login=1505304923 tty_last_login=ssh tty_last_unsuccessful_login=ssh host_last_login=0.000.000.000 host_last_unsuccessful_login=0.000.000.000 unsuccessful_login_count=0 roles=

The above is a single row and i want to split values on the basis of <whitespaces>. Suppose the first key : eab12345 has null value but ended with white space. Similarly id=00000 ended with <whitespace>. But Gecos field contain spaces in between its values and if i split, i lose its whole value, For EX: gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* (AS 00000) contain <whitespaces> in its values. I cannot split using white spaces or else its value got merged with some other value. IS there any way i can have all the "key=values" in my array?(Doesn't matter values can be null or anything, in case its null there will be a whitespace, EX: admgroups= ) please help guys!

  • Comment on Splitting a long row with multiple delimiters.

Replies are listed 'Best First'.
Re: Splitting a long row with multiple delimiters.
by hippo (Archbishop) on Jan 19, 2018 at 13:46 UTC

    TIMTOWTDI, but here's one with split:

    use strict; use warnings; use Test::More tests => 1; my $in = 'eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/e +ab12345 shell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CON +TRACTOR* (AS 00000) auditclasses=general,files,TCPIP login=true su=tr +ue rlogin=true daemon=true admin=false sugroups=ALL admgroups= tpath= +nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD + SYSTEM=AD logintimes= loginretries=5 pwdwarntime=5 account_locked=fa +lse minage=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 m +axrepeats=2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist= +/abc/def/ghi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 c +ore=000000 rss=65536 nofiles=2000 time_last_login=1512632113 time_las +t_unsuccessful_login=1505304923 tty_last_login=ssh tty_last_unsuccess +ful_login=ssh host_last_login=0.000.000.000 host_last_unsuccessful_lo +gin=0.000.000.000 unsuccessful_login_count=0 roles= '; my %rec; my $key; for my $term (split /=/, $in) { my ($value, $newkey) = ($term =~ /(.*?) (\S+)$/); $rec{$key} = $value if defined $key; $key = $newkey; } is ($rec{gecos}, 'AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* (AS 000 +00)');

    Add more tests if you want to narrow the spec.

      Thank you Buddy! This solution almost worked well, but found problem with first value and then all values simultaneously, output as : Here, eab12345 should be null as it has no value, but it comes as key-value pair with "id"

      KEY = id, VALUE = eab12345 KEY = pgrp, VALUE = 00000 KEY = groups, VALUE = abcdefgh KEY = home, VALUE = abcdefgh KEY = shell, VALUE = /home/eab12345 KEY = gecos, VALUE = /usr/bin/ksh KEY = auditclasses, VALUE = AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTO +R* (AS 00000) Use of uninitialized value $key in concatenation (.) or string at samp +le.pl line 63. Use of uninitialized value $value in concatenation (.) or string at sa +mple.pl line 63. KEY = , VALUE = KEY = su, VALUE = true KEY = rlogin, VALUE = true KEY = daemon, VALUE = true KEY = admin, VALUE = true

        Is it safe to assume that keys not followed by = sign are null? If so, how would you define a key?

        For example: key1=abcd key2 key3=123

        What is actually expected output? This one?

        key1=abcd key2 key3=123
        Or this one?
        key1=abcd key2=null key3=123
        How do you recognize null keys inside the string? Should key2 be treated as key, or is it actually part of value to be stored in key1?

        Since this looks nothing like the output obtained from my code above, please provide an SSCCE. Thanks.

Re: Splitting a long row with multiple delimiters.
by haukex (Archbishop) on Jan 19, 2018 at 13:53 UTC

    A single sample input is usually not enough to reliably design a regex (see also). <update> Please use <code> tags when posting sample input. Also, can't you get this data in a more parseable format? </update> I have made the following assumptions:

    • Keys always match \w+
    • Values may not contain =
    • Keys are always preceded by whitespace
    use warnings; use strict; my $str = q{eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home +/eab12345 shell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *C +ONTRACTOR* (AS 00000) auditclasses=general,files,TCPIP login=true su= +true rlogin=true daemon=true admin=false sugroups=ALL admgroups= tpat +h=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry= +AD SYSTEM=AD logintimes= loginretries=5 pwdwarntime=5 account_locked= +false minage=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 + maxrepeats=2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlis +t=/abc/def/ghi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 + core=000000 rss=65536 nofiles=2000 time_last_login=1512632113 time_l +ast_unsuccessful_login=1505304923 tty_last_login=ssh tty_last_unsucce +ssful_login=ssh host_last_login=0.000.000.000 host_last_unsuccessful_ +login=0.000.000.000 unsuccessful_login_count=0 roles= }; my $REGEX = qr{ (?| # treat beginning of string as a key only \A \s* (?<key> \w+ ) \s* | # otherwise, a normal key=value pair (?<= \s ) # key must be preceded by a space (?<key> \w+ ) \s* = \s* (?<value> # a value may not look like another key=value (?: (?! \s* \w+ = ) [^=] )* ) \s* ) }msx; pos($str)=undef; while ( $str =~ /\G$REGEX/gc ) { print "<", $+{key}, "> = <", $+{value}//'undef', ">\n"; } die "failed to parse at pos ".pos($str) unless pos($str)==length($str);

      Thank you for you efforts! But getting <undef> as value :

      Also, please explain more how its working, i am unable to understand regex!

      <eab12345> = <undef> <id> = <undef> <pgrp> = <undef> <groups> = <undef> <home> = <undef> <shell> = <undef> <gecos> = <undef> <auditclasses> = <undef> <login> = <undef> <su> = <undef> <rlogin> = <undef> <daemon> = <undef> <admin> = <undef> <sugroups> = <undef> <admgroups> = <undef> <tpath> = <undef> <ttys> = <undef> <expires> = <undef> <auth1> = <undef> <auth2> = <undef> <umask> = <undef> contd........................
        But getting <undef> as value

        Is your Perl 5.12 or earlier, that is, more than six years old? There apparently was a bug with (?| ) that was fixed in v5.14 (this appears to be the commit). You should consider upgrading.

        i am unable to understand regex!

        Are you familiar with concepts such as non-capturing groups (?: ) and other basics like \s* etc.? If not, you probably want to read perlretut first. In fact, as far as I can tell even the advanced regex features I used are explained there (with more details on each in perlre):

Re: Splitting a long row with multiple delimiters.
by 1nickt (Canon) on Jan 19, 2018 at 14:14 UTC

    This does what you *seem* to want, with your single line of input (however, you should try to get the data in CSV format so you can use a real parser -- using a regexp for this is flakey).

    (Update: I treated the first words as special, not as a key, since it has not only no value but no succeeding separator. I see from your later comments that this format matches a key as well. There is no way I can think of that would allow you to differentiate between a key with no separator or value, and a word in a multi-word value. You will have to get your data produced differently if you expect there to be both "bare" keys and multi-word values, in a string with no distinct separators between pairs.)

    (Update 2: I see from your later responses that in fact the first value should be treated as a key, even though it is not followed by the key-value separator. I've updated the code to include it in the data hash, but with the empty string as the value rather than undef.)

    use strict; use warnings; use feature 'say'; use Data::Dumper; $Data::Dumper::Sortkeys = $Data::Dumper::Indent = 1; chomp( my $input = <DATA> ); my ( $first, $txt ) = split / /, $input, 2; my %pairs = $txt =~ / ( \w+ ) # capture key = # separator ( # capture value (?: # group (but don't additionally capture) (?!\w+=) # (must not be followed by the next key-se +parator) .+? # at least one character (even if just the + whitespace), non-greedy )+ # at least one of those ) # end value /msxg; $pairs{ $first } = ''; # trim trailing whitsepace from values $_ =~ s/ $// for values %pairs; say Dumper \%pairs; __END__ eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/eab12345 sh +ell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* ( +AS 00000) auditclasses=general,files,TCPIP login=true su=true rlogin= +true daemon=true admin=false sugroups=ALL admgroups= tpath=nosak ttys +=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD SYSTEM=AD + logintimes= loginretries=5 pwdwarntime=5 account_locked=false minage +=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 maxrepeats= +2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist=/abc/def/g +hi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 core=000000 + rss=65536 nofiles=2000 time_last_login=1512632113 time_last_unsucces +sful_login=1505304923 tty_last_login=ssh tty_last_unsuccessful_login= +ssh host_last_login=0.000.000.000 host_last_unsuccessful_login=0.000. +000.000 unsuccessful_login_count=0 roles=
    Output:
    $VAR1 = { 'SYSTEM' => 'AD', 'account_locked' => 'false', 'admgroups' => '', 'admin' => 'false', 'auditclasses' => 'general,files,TCPIP', 'auth1' => 'SYSTEM', 'auth2' => 'NONE', 'core' => '000000', 'cpu' => '-1', 'daemon' => 'true', 'data' => '-1', 'default_roles' => '', 'dictionlist' => '/abc/def/ghi/jkl', 'eab12345' => '', 'expires' => '0', 'fsize' => '-1', 'gecos' => 'AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* (AS 00000)' +, 'groups' => 'abcdefgh', 'histexpire' => '13', 'histsize' => '8', 'home' => '/home/eab12345', 'host_last_login' => '0.000.000.000', 'host_last_unsuccessful_login' => '0.000.000.000', 'id' => '00000', 'login' => 'true', 'loginretries' => '5', 'logintimes' => '', 'maxage' => '13', 'maxexpired' => '0', 'maxrepeats' => '2', 'minage' => '0', 'minalpha' => '1', 'mindiff' => '1', 'minlen' => '8', 'minother' => '1', 'nofiles' => '2000', 'pgrp' => 'abcdefgh', 'pwdchecks' => '', 'pwdwarntime' => '5', 'registry' => 'AD', 'rlogin' => 'true', 'roles' => '', 'rss' => '65536', 'shell' => '/usr/bin/ksh', 'stack' => '65536', 'su' => 'true', 'sugroups' => 'ALL', 'time_last_login' => '1512632113', 'time_last_unsuccessful_login' => '1505304923', 'tpath' => 'nosak', 'tty_last_login' => 'ssh', 'tty_last_unsuccessful_login' => 'ssh', 'ttys' => 'ALL', 'umask' => '00', 'unsuccessful_login_count' => '0' }

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Splitting a long row with multiple delimiters.
by jahero (Pilgrim) on Jan 19, 2018 at 13:54 UTC

    Update: actually,this does not work. The reregex in question might be visually pleasing, but is plain wrong. Shame on me...

    Second possible approach. Be advised, one = sign out of place, and it does not work anymore.

    use feature qw/say/; use strict; use warnings; use Data::Dumper; while (my $line=<DATA>) { chomp $line; my %record = ($line =~ /([^=]+)=([^=]+) ?/g); say Dumper \%record; } __DATA__ eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/eab12345 sh +ell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP *CONTRACTOR* ( +AS 00000) auditclasses=general,files,TCPIP login=true su=true rlogin= +true daemon=true admin=false sugroups=ALL admgroups= tpath=nosak ttys +=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD SYSTEM=AD + logintimes= loginretries=5 pwdwarntime=5 account_locked=false minage +=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 maxrepeats= +2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist=/abc/def/g +hi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 core=000000 + rss=65536 nofiles=2000 time_last_login=1512632113 time_last_unsucces +sful_login=1505304923 tty_last_login=ssh tty_last_unsuccessful_login= +ssh host_last_login=0.000.000.000 host_last_unsuccessful_login=0.000. +000.000 unsuccessful_login_count=0 roles=

Re: Splitting a long row with multiple delimiters.
by soonix (Chancellor) on Jan 19, 2018 at 15:28 UTC
    That format is not only ugly, but also problematic when the gecos field contains an '=' sign (if it is the same as a /etc/passwd gecos field, I think it would be allowed). If you have the possibility, request that your input format be improved.
Re: Splitting a long row with multiple delimiters.
by karlgoethebier (Abbot) on Jan 19, 2018 at 14:06 UTC
    "...all the "key=values" in my array...help..."

    In a hurry, untested and without any warrenty: my @pairs = grep {/(.+=.*)/} split / /, $line;

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help