Splitting a long row with multiple delimiters.

dipit has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting a long row with multiple delimiters. by hippo (Archbishop) on Jan 19, 2018 at 13:46 UTC
TIMTOWTDI, but here's one with split: use strict; use warnings; use Test::More tests => 1; my $in = 'eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/e +ab12345 shell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CON +TRACTOR (AS 00000) auditclasses=general,files,TCPIP login=true su=tr +ue rlogin=true daemon=true admin=false sugroups=ALL admgroups= tpath= +nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD + SYSTEM=AD logintimes= loginretries=5 pwdwarntime=5 account_locked=fa +lse minage=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 m +axrepeats=2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist= +/abc/def/ghi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 c +ore=000000 rss=65536 nofiles=2000 time_last_login=1512632113 time_las +t_unsuccessful_login=1505304923 tty_last_login=ssh tty_last_unsuccess +ful_login=ssh host_last_login=0.000.000.000 host_last_unsuccessful_lo +gin=0.000.000.000 unsuccessful_login_count=0 roles= '; my %rec; my $key; for my $term (split /=/, $in) { my ($value, $newkey) = ($term =~ /(.?) (\S+)$/); $rec{$key} = $value if defined $key; $key = $newkey; } is ($rec{gecos}, 'AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CONTRACTOR* (AS 000 +00)'); [download] Add more tests if you want to narrow the spec.	[reply] [d/l]
Re^2: Splitting a long row with multiple delimiters. by dipit (Sexton) on Jan 19, 2018 at 14:19 UTC
Thank you Buddy! This solution almost worked well, but found problem with first value and then all values simultaneously, output as : Here, eab12345 should be null as it has no value, but it comes as key-value pair with "id" KEY = id, VALUE = eab12345 KEY = pgrp, VALUE = 00000 KEY = groups, VALUE = abcdefgh KEY = home, VALUE = abcdefgh KEY = shell, VALUE = /home/eab12345 KEY = gecos, VALUE = /usr/bin/ksh KEY = auditclasses, VALUE = AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CONTRACTO +R (AS 00000) Use of uninitialized value $key in concatenation (.) or string at samp +le.pl line 63. Use of uninitialized value $value in concatenation (.) or string at sa +mple.pl line 63. KEY = , VALUE = KEY = su, VALUE = true KEY = rlogin, VALUE = true KEY = daemon, VALUE = true KEY = admin, VALUE = true [download]	[reply] [d/l]
Re^3: Splitting a long row with multiple delimiters. by jahero (Pilgrim) on Jan 19, 2018 at 14:34 UTC
Is it safe to assume that keys not followed by `=` sign are null? If so, how would you define a key? For example: `key1=abcd key2 key3=123` What is actually expected output? This one? `key1=abcd key2 key3=123` [download] Or this one? `key1=abcd key2=null key3=123` [download] How do you recognize null keys inside the string? Should `key2` be treated as key, or is it actually part of value to be stored in `key1`?	[reply] [d/l] [select]
Re^4: Splitting a long row with multiple delimiters. by dipit (Sexton) on Jan 19, 2018 at 14:52 UTC
Re^5: Splitting a long row with multiple delimiters. by poj (Abbot) on Jan 19, 2018 at 15:13 UTC
Some notes below your chosen depth have not been shown here
Re^3: Splitting a long row with multiple delimiters. by hippo (Archbishop) on Jan 19, 2018 at 14:42 UTC
Since this looks nothing like the output obtained from my code above, please provide an SSCCE. Thanks.	[reply]
Re^4: Splitting a long row with multiple delimiters. by dipit (Sexton) on Jan 19, 2018 at 14:50 UTC
Re^5: Splitting a long row with multiple delimiters. by hippo (Archbishop) on Jan 19, 2018 at 15:46 UTC
Re^5: Splitting a long row with multiple delimiters. by soonix (Chancellor) on Jan 19, 2018 at 15:33 UTC
Re: Splitting a long row with multiple delimiters. by haukex (Archbishop) on Jan 19, 2018 at 13:53 UTC
A single sample input is usually not enough to reliably design a regex (see also). `<update>` Please use `<code>` tags when posting sample input. Also, can't you get this data in a more parseable format? `</update>` I have made the following assumptions: Keys always match `\w+` Values may not contain `=` Keys are always preceded by whitespace use warnings; use strict; my $str = q{eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home +/eab12345 shell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP C +ONTRACTOR (AS 00000) auditclasses=general,files,TCPIP login=true su= +true rlogin=true daemon=true admin=false sugroups=ALL admgroups= tpat +h=nosak ttys=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry= +AD SYSTEM=AD logintimes= loginretries=5 pwdwarntime=5 account_locked= +false minage=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 + maxrepeats=2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlis +t=/abc/def/ghi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 + core=000000 rss=65536 nofiles=2000 time_last_login=1512632113 time_l +ast_unsuccessful_login=1505304923 tty_last_login=ssh tty_last_unsucce +ssful_login=ssh host_last_login=0.000.000.000 host_last_unsuccessful_ +login=0.000.000.000 unsuccessful_login_count=0 roles= }; my $REGEX = qr{ (?\| # treat beginning of string as a key only \A \s* (?<key> \w+ ) \s* \| # otherwise, a normal key=value pair (?<= \s ) # key must be preceded by a space (?<key> \w+ ) \s* = \s* (?<value> # a value may not look like another key=value (?: (?! \s* \w+ = ) [^=] )* ) \s* ) }msx; pos($str)=undef; while ( $str =~ /\G$REGEX/gc ) { print "<", $+{key}, "> = <", $+{value}//'undef', ">\n"; } die "failed to parse at pos ".pos($str) unless pos($str)==length($str); [download] Read more... output (2 kB)	[reply] [d/l] [select]
Re^2: Splitting a long row with multiple delimiters. by dipit (Sexton) on Jan 19, 2018 at 15:10 UTC
Thank you for you efforts! But getting <undef> as value : Also, please explain more how its working, i am unable to understand regex! `<eab12345> = <undef> <id> = <undef> <pgrp> = <undef> <groups> = <undef> <home> = <undef> <shell> = <undef> <gecos> = <undef> <auditclasses> = <undef> <login> = <undef> <su> = <undef> <rlogin> = <undef> <daemon> = <undef> <admin> = <undef> <sugroups> = <undef> <admgroups> = <undef> <tpath> = <undef> <ttys> = <undef> <expires> = <undef> <auth1> = <undef> <auth2> = <undef> <umask> = <undef> contd........................` [download]	[reply] [d/l]
Re^3: Splitting a long row with multiple delimiters. by haukex (Archbishop) on Jan 21, 2018 at 13:35 UTC
But getting <undef> as value Is your Perl 5.12 or earlier, that is, more than six years old? There apparently was a bug with `(?\| )` that was fixed in v5.14 (this appears to be the commit). You should consider upgrading. i am unable to understand regex! Are you familiar with concepts such as non-capturing groups `(?: )` and other basics like `\s*` etc.? If not, you probably want to read perlretut first. In fact, as far as I can tell even the advanced regex features I used are explained there (with more details on each in perlre): `(?\| )` - Alternative capture group numbering `(?<name> )` - Named backreferences (see also %+) `(?<= )` and `(?! )` - Looking ahead and looking behind	[reply] [d/l] [select]
Re: Splitting a long row with multiple delimiters. by 1nickt (Canon) on Jan 19, 2018 at 14:14 UTC
This does what you seem to want, with your single line of input (however, you should try to get the data in CSV format so you can use a real parser -- using a regexp for this is flakey). (Update: I treated the first words as special, not as a key, since it has not only no value but no succeeding separator. I see from your later comments that this format matches a key as well. There is no way I can think of that would allow you to differentiate between a key with no separator or value, and a word in a multi-word value. You will have to get your data produced differently if you expect there to be both "bare" keys and multi-word values, in a string with no distinct separators between pairs.) (Update 2: I see from your later responses that in fact the first value should be treated as a key, even though it is not followed by the key-value separator. I've updated the code to include it in the data hash, but with the empty string as the value rather than undef.) use strict; use warnings; use feature 'say'; use Data::Dumper; $Data::Dumper::Sortkeys = $Data::Dumper::Indent = 1; chomp( my $input = <DATA> ); my ( $first, $txt ) = split / /, $input, 2; my %pairs = $txt =~ / ( \w+ ) # capture key = # separator ( # capture value (?: # group (but don't additionally capture) (?!\w+=) # (must not be followed by the next key-se +parator) .+? # at least one character (even if just the + whitespace), non-greedy )+ # at least one of those ) # end value /msxg; $pairs{ $first } = ''; # trim trailing whitsepace from values $_ =~ s/ $// for values %pairs; say Dumper \%pairs; __END__ eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/eab12345 sh +ell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CONTRACTOR ( +AS 00000) auditclasses=general,files,TCPIP login=true su=true rlogin= +true daemon=true admin=false sugroups=ALL admgroups= tpath=nosak ttys +=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD SYSTEM=AD + logintimes= loginretries=5 pwdwarntime=5 account_locked=false minage +=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 maxrepeats= +2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist=/abc/def/g +hi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 core=000000 + rss=65536 nofiles=2000 time_last_login=1512632113 time_last_unsucces +sful_login=1505304923 tty_last_login=ssh tty_last_unsuccessful_login= +ssh host_last_login=0.000.000.000 host_last_unsuccessful_login=0.000. +000.000 unsuccessful_login_count=0 roles= [download] Output: $VAR1 = { 'SYSTEM' => 'AD', 'account_locked' => 'false', 'admgroups' => '', 'admin' => 'false', 'auditclasses' => 'general,files,TCPIP', 'auth1' => 'SYSTEM', 'auth2' => 'NONE', 'core' => '000000', 'cpu' => '-1', 'daemon' => 'true', 'data' => '-1', 'default_roles' => '', 'dictionlist' => '/abc/def/ghi/jkl', 'eab12345' => '', 'expires' => '0', 'fsize' => '-1', 'gecos' => 'AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CONTRACTOR (AS 00000)' +, 'groups' => 'abcdefgh', 'histexpire' => '13', 'histsize' => '8', 'home' => '/home/eab12345', 'host_last_login' => '0.000.000.000', 'host_last_unsuccessful_login' => '0.000.000.000', 'id' => '00000', 'login' => 'true', 'loginretries' => '5', 'logintimes' => '', 'maxage' => '13', 'maxexpired' => '0', 'maxrepeats' => '2', 'minage' => '0', 'minalpha' => '1', 'mindiff' => '1', 'minlen' => '8', 'minother' => '1', 'nofiles' => '2000', 'pgrp' => 'abcdefgh', 'pwdchecks' => '', 'pwdwarntime' => '5', 'registry' => 'AD', 'rlogin' => 'true', 'roles' => '', 'rss' => '65536', 'shell' => '/usr/bin/ksh', 'stack' => '65536', 'su' => 'true', 'sugroups' => 'ALL', 'time_last_login' => '1512632113', 'time_last_unsuccessful_login' => '1505304923', 'tpath' => 'nosak', 'tty_last_login' => 'ssh', 'tty_last_unsuccessful_login' => 'ssh', 'ttys' => 'ALL', 'umask' => '00', 'unsuccessful_login_count' => '0' } [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: Splitting a long row with multiple delimiters. by jahero (Pilgrim) on Jan 19, 2018 at 13:54 UTC
Update: actually,this does not work. The reregex in question might be visually pleasing, but is plain wrong. Shame on me... Second possible approach. Be advised, one `=` sign out of place, and it does not work anymore. use feature qw/say/; use strict; use warnings; use Data::Dumper; while (my $line=<DATA>) { chomp $line; my %record = ($line =~ /([^=]+)=([^=]+) ?/g); say Dumper \%record; } __DATA__ eab12345 id=00000 pgrp=abcdefgh groups=abcdefgh home=/home/eab12345 sh +ell=/usr/bin/ksh gecos=AB/C/Y0000/ABC/XYZ RTYUI, LMNOP CONTRACTOR ( +AS 00000) auditclasses=general,files,TCPIP login=true su=true rlogin= +true daemon=true admin=false sugroups=ALL admgroups= tpath=nosak ttys +=ALL expires=0 auth1=SYSTEM auth2=NONE umask=00 registry=AD SYSTEM=AD + logintimes= loginretries=5 pwdwarntime=5 account_locked=false minage +=0 maxage=13 maxexpired=0 minalpha=1 minother=1 mindiff=1 maxrepeats= +2 minlen=8 histexpire=13 histsize=8 pwdchecks= dictionlist=/abc/def/g +hi/jkl default_roles= fsize=-1 cpu=-1 data=-1 stack=65536 core=000000 + rss=65536 nofiles=2000 time_last_login=1512632113 time_last_unsucces +sful_login=1505304923 tty_last_login=ssh tty_last_unsuccessful_login= +ssh host_last_login=0.000.000.000 host_last_unsuccessful_login=0.000. +000.000 unsuccessful_login_count=0 roles= [download]	[reply] [d/l] [select]
Re: Splitting a long row with multiple delimiters. by soonix (Chancellor) on Jan 19, 2018 at 15:28 UTC
That format is not only ugly, but also problematic when the gecos field contains an '=' sign (if it is the same as a /etc/passwd gecos field, I think it would be allowed). If you have the possibility, request that your input format be improved.	[reply]
Re: Splitting a long row with multiple delimiters. by karlgoethebier (Abbot) on Jan 19, 2018 at 14:06 UTC
"...all the "key=values" in my array...help..." In a hurry, untested and without any warrenty: `my @pairs = grep {/(.+=.*)/} split / /, $line;` Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]