Parsing a line of text items

mikkoi has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing a line of text items by philipbailey (Curate) on Mar 30, 2021 at 12:23 UTC
I often use Text::ParseWords for this problem. It has the advantage of being a core module. `use strict; use warnings; use feature "say"; use Text::ParseWords; my $args = '23 45.67 "John Marcus" Surname'; my @parsed = parse_line('\s+', 0, $args); say for @parsed;` [download] Output: `23 45.67 John Marcus Surname` [download]	[reply] [d/l] [select]
Re^2: Parsing a line of text items by mikkoi (Beadle) on Mar 31, 2021 at 11:26 UTC
Thanks. I had no idea Text-ParseWords existed. This is the ideal solution. And it is in the core! I also tested Text-CSV and while good, it left some problems, especially the possible multiple whitespace between words.	[reply]
Re^3: Parsing a line of text items by Bod (Parson) on Apr 04, 2021 at 22:12 UTC
I had no idea Text-ParseWords existed Likewise... Perhaps more time is needed studying the list of core modules	[reply]
Re^2: Parsing a line of text items by LanX (Saint) on Mar 30, 2021 at 15:58 UTC
Oh Text::ParseWords is pretty cool, thanks for sharing. :) (So many core modules which need more attention) > It has the advantage of being a core module. Indeed. `C:\Strawberry\perl\bin>corelist Text::ParseWords Data for 2021-01-23 Text::ParseWords was first released with perl 5` [download] Tho it's exporting a lot on default `our @EXPORT = qw(shellwords quotewords nested_quotewords parse_line);` And you can tell the documentation is old, could have more examples. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: Parsing a line of text items by hippo (Archbishop) on Mar 30, 2021 at 11:25 UTC
A Text::CSV solution: `use strict; use warnings; use Text::CSV; use Test::More tests => 2; my $in = '23 45.67 "John Marcus" Surname'; my $want = [23, 45.67, 'John Marcus', 'Surname']; my $csv = Text::CSV->new ({sep_char => ' '}); ok $csv->parse ($in), 'Parsing'; is_deeply [$csv->fields], $want, 'Fields match';` [download] You will probably want to extend the tests to better reflect your real-world requirements. 🦛	[reply] [d/l]
Re: Parsing a line of text items by choroba (Cardinal) on Mar 30, 2021 at 12:11 UTC
Use glob. But make sure the input doesn't contain , ?, and {}. `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; sub parse_args { my ($input) = @_; return [glob $input] } use Test::More tests => 1; is_deeply parse_args('23 45.67 "John Marcus" Surname'), [23, 45.67, 'John Marcus', 'Surname'];` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re: Parsing a line of text items (updated) by AnomalousMonk (Archbishop) on Mar 30, 2021 at 16:19 UTC
A Text::CSV (or Text::CSV_XS for speed) solution seems very appropriate, but if you need to roll your own, maybe something like: Win8 Strawberry 5.30.3.1 (64) Tue 03/30/2021 11:53:39 C:\@Work\Perl\monks >perl -Mstrict -Mwarnings use 5.010; # needs (?\|...) branch reset my $rx_dq_body = qr{ [^\\"]* (?: \\. [^\\"]* )* }xms; my $rx_unquoted = qr{ \S+ }xms; for my $args ( '', ' ', '23 45.67 "John Marcus O\"Ddly" Surname', '"only \"quoted\" thing"', 'no quoted stuff', ) { my $got_parsed_args = my @parsed_args = $args =~ m{ \G \s* (?\| " ($rx_dq_body) " \| ($rx_unquoted)) }xmsg; print ">$args< -> "; if ($got_parsed_args) { printf "%s \n", join ' ', map ">$_<", @parsed_args; } else { print "nada \n"; } } ^Z >< -> nada > < -> nada >23 45.67 "John Marcus O\"Ddly" Surname< -> >23< >45.67< >John Marcus +O\"Ddly< >Surname< >"only \"quoted\" thing"< -> >only \"quoted\" thing< >no quoted stuff< -> >no< >quoted< >stuff< [download] This needs Perl version 5.10+ for the `(?\|...)` "branch reset" operator, but modification for pre-5.10 Perls is simple; let me know if you need it. The `$rx_dq_body` regex to match a double-quoted body supports embedded escaped double-quotes (and any other escaped character). You can play with this regex to get exactly what you want/need. Of course, lots of tests should be done to verify this (or any other solution) really does what you want. Update: For some reason, I included a `\G \s*` group in the regex above. It is entirely unnecessary although it does no harm AFAICT. The match regex `m{ (?\| " ($rx_dq_body) " \| ($rx_unquoted)) }xmsg` should be exactly equivalent. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Parsing a line of text items by LanX (Saint) on Mar 30, 2021 at 18:09 UTC
I can understand the challenge to hack it by yourself ... :) But I think the suggested Text::ParseWords is core and offers everything I expect from parsing a command line. It has also tests, is cutomizable and the source is well structured and documented. So if I "wanna roll my own" and need to make special adjustments (like e.g. paired `{quotes}` ) I can take the code as a base. `DB<94> use Text::ParseWords qw/shellwords/ DB<96> x shellwords(q{this is 'an example' "with different quoting a +nd \" escaping" including\ escaped\ whitespace}) 0 'this' 1 'is' 2 'an example' 3 'with different quoting and " escaping' 4 'including escaped whitespace' DB<97>` [download] In case larger files need to be parsed I'll consider a dependency to Text::CSV , but this really looks good. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: Parsing a line of text items by AnomalousMonk (Archbishop) on Apr 01, 2021 at 07:01 UTC
I would tend to agree that an approach using a reliable, common module like Text::ParseWords (of which I had not previously been aware -- thanks, philipbailey++) or Text::CSV is usually best. But I wanted to give an example of a "pure" regex approach. As an aside, I think it's worth emphasizing again that whatever approach is taken, a thorough suite of tests for the final code is advisable even if the approach is based on well-tested modules. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^4: Parsing a line of text items by LanX (Saint) on Apr 01, 2021 at 10:08 UTC
Re: Parsing a line of text items by LanX (Saint) on Mar 30, 2021 at 11:21 UTC
update scratch it, this doesn't work. It could, but it takes too much efforts to figure it out. Better use the Text::CSV approach maybe `DB<45> p $_ 23 45.67 "John Marcus" Surname 23 45.67 "John Marcus" Surname DB<46> say $2 while /(?:^\|("\|\s+))(.*?)\1/g 45.67 John Marcus Surname 45.67 John Marcus Surname DB<47>` [download] Here are dragons, no guaranty whatsoever. edit as expected, it only works if it ends with a whitespace, and I had problems using `(?:$\|\1)` at the end. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^2: Parsing a line of text items by LanX (Saint) on Mar 30, 2021 at 12:20 UTC
Just for fun: This seems to work if the input is surrounded by exactly one whitespace, but don't try to escape doublequotes $1 is the whitespace $2 the optional doublequote $3 the enclosed text `DB<89> p "'$_'" ' 23 45.67 "John Marcus" Surname 23 45.67 "John Marcus" Surname ext +ra ' DB<90> say "'$3'" while /\s(\s)("?)(.?)\2(?=\1)/g '23' '45.67' 'John Marcus' 'Surname' '23' '45.67' 'John Marcus' 'Surname' 'extra' DB<91>` [download] For testing I'd suggest to automatically create strings for random input. Like this you can cover a large set of cases. NB: here are still dragons. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]

update

edit