Parse for a list in a long string

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I want to get a list of items from a long text string with a given format. The format is pretty simple, but the number of items in the list is variable, also it is the number of lists in the same string. Of course, there are many other things in the string that must be discarded.

I tried a single regular expression to capture the items to an array, but I can get only the first or the last element or each identified list...

This is a test code:

#!perl

use strict;
use warnings;

while (<DATA>) {
  chomp;
  s!\s+! !g;
  my $txt = $_;
  print "$_\n";
  my @items = ();
  print "FOUND: @items\n" if (@items = ($txt =~ m!\btest \w+(?:(?: is)
+? \w+)?(?: ?, ?(\w+)(?:(?: is)? \w+)?)+!ig));
}

__DATA__
this line has nothing, nothing, nothing...
1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22,  33 is three,44,55 is 
+ the best, and  this  is not a test 111, 222, 333 as random  words to
+ finish
this should be a test, but nothing must be returned 4444, 7777, 9999 i
+s garbage
[download]

In this example, the lists starts with the string "test", the elements are delimited by a comma, each element could be followed by an optional "is" and another word (must be discarded), and the first element of the list is not important and must be ignored. The given data has 3 lines, and only the 2nd one has two lists, the 1st and 3rd have none. The expected result is:

FOUND: 11 22 33 44 55 222 333
[download]

What I got is:

FOUND: 55 333
[download]

If I remove the last plus sign, I get:

FOUND: 11 222
[download]

If I remove the "g" modifier, I get only one list (with one item):

FOUND: 55
[download]

What am I missing?

Thanks!!!

Comment on Parse for a list in a long string Select or Download Code

Replies are listed 'Best First'.
Re: Parse for a list in a long string by choroba (Cardinal) on Jun 02, 2015 at 17:17 UTC
You can compose a complex regex from simple ones. You don't need to use just one regex to do the all work, either, you can progress in steps: #! /usr/bin/perl use warnings; use strict; my $element = qr/(?: \w+ (?: \s+ is \s+ \w+ )? )/x; while (<DATA>) { chomp; s/\s+/ /g; while (/\b test\s+ ( $element (?: , \s* $element )* )/xg) { my $match = $1; my @elements = split /\s,\s/, $match; shift @elements; s/\s+ is \s+ \w+//x for @elements; print "($_)" for @elements; print "\n"; } } __DATA__ this line has nothing, nothing, nothing... 1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is + the best, and this is not a test 111, 222, 333 as random words to + finish this should be a test, but nothing must be returned 4444, 7777, 9999 i +s garbage [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re: Parse for a list in a long string by AnomalousMonk (Archbishop) on Jun 02, 2015 at 17:52 UTC
Another approach, also factoring the regexes, but this one needs Perl version 5.10+ for the `\K` regex operator. c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; ;; my $s = '1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three +,44,' . '55 is the best, and this is not a test 111, 222, 333 as r +andom ' . 'words to finish' ; print qq{[[$s]]}; ;; my @expected = (11, 22, 33, 44, 55, 222, 333); ;; my $is = qr{ \s+ is \b }xms; my $word = qr{ \s+ [[:alpha:]]+ \b }xms; ;; my $sep = qr{ (?: $is $word?)? \s* , \s* }xms; ;; my $extract = qr{ (?: (?: \G (?<! \A)) \| test \s+ \d+) $sep \K \d+ }xms; ;; my @got; ;; @got = $s =~ m{ $extract }xmsg; is_deeply \@got, \@expected, qq{(@got)}; ;; @got = 'this line has nothing, nothing, nothing...' =~ m{ $extract }x +msg; is_deeply \@got, [], 'empty'; ;; @got = 'this should be a test, with nothing returned 444, 777, 999 is + junk' =~ m{ $extract }xmsg; is_deeply \@got, [], 'also empty'; " [[1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 i +s the best, and this is not a test 111, 222, 333 as random words to finish]] ok 1 - (11 22 33 44 55 222 333) ok 2 - empty ok 3 - also empty 1..3 [download] Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re: Parse for a list in a long string by Anonymous Monk on Jun 02, 2015 at 19:35 UTC
It's really a double loop. #!/usr/bin/perl # http://perlmonks.org/?node_id=1128809 use strict; use warnings; while(<DATA>) { my @items; s/\s+/ /g; # simplify push @items, $1 =~ /, (\w+)/g while /\btest \w+((?:, \w+(?: is \w+ +)?)+)/g; @items and print "FOUND: @items\n"; } __DATA__ this line has nothing, nothing, nothing... 1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is + the best, and this is not a test 111, 222, 333 as random words to + finish this should be a test, but nothing must be returned 4444, 7777, 9999 i +s garbage [download]	[reply] [d/l]
Re: Parse for a list in a long string by vitoco (Hermit) on Jun 02, 2015 at 20:39 UTC
Thanks for all the responses and ideas. I was aware that this was a double loop problem, but I tried it with a single regexp anyway, and crashed against the fact that if I add a quantifier to a grouping, I could get only the last match for that group in the result... I was trying to bypass that! This line solved my problem in a very simple way: `push @items, $1 =~ /, ?(\w+)/g while /\btest \w+(?:(?: is)? \w+)?((?: ?, ?\w+(?:(?: is)? \w+)?)+)/ig;`	[reply] [d/l]


P is for Practical
	PerlMonks