vitoco has asked for the wisdom of the Perl Monks concerning the following question:
I want to get a list of items from a long text string with a given format. The format is pretty simple, but the number of items in the list is variable, also it is the number of lists in the same string. Of course, there are many other things in the string that must be discarded.
I tried a single regular expression to capture the items to an array, but I can get only the first or the last element or each identified list...
This is a test code:
#!perl
use strict;
use warnings;
while (<DATA>) {
chomp;
s!\s+! !g;
my $txt = $_;
print "$_\n";
my @items = ();
print "FOUND: @items\n" if (@items = ($txt =~ m!\btest \w+(?:(?: is)
+? \w+)?(?: ?, ?(\w+)(?:(?: is)? \w+)?)+!ig));
}
__DATA__
this line has nothing, nothing, nothing...
1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is
+ the best, and this is not a test 111, 222, 333 as random words to
+ finish
this should be a test, but nothing must be returned 4444, 7777, 9999 i
+s garbage
In this example, the lists starts with the string "test", the elements are delimited by a comma, each element could be followed by an optional "is" and another word (must be discarded), and the first element of the list is not important and must be ignored. The given data has 3 lines, and only the 2nd one has two lists, the 1st and 3rd have none. The expected result is:
FOUND: 11 22 33 44 55 222 333
What I got is:
FOUND: 55 333
If I remove the last plus sign, I get:
FOUND: 11 222
If I remove the "g" modifier, I get only one list (with one item):
FOUND: 55
What am I missing?
Thanks!!!
Re: Parse for a list in a long string
by choroba (Cardinal) on Jun 02, 2015 at 17:17 UTC
|
You can compose a complex regex from simple ones. You don't need to use just one regex to do the all work, either, you can progress in steps:
#! /usr/bin/perl
use warnings;
use strict;
my $element = qr/(?: \w+ (?: \s+ is \s+ \w+ )? )/x;
while (<DATA>) {
chomp;
s/\s+/ /g;
while (/\b test\s+ ( $element (?: , \s* $element )* )/xg) {
my $match = $1;
my @elements = split /\s*,\s*/, $match;
shift @elements;
s/\s+ is \s+ \w+//x for @elements;
print "($_)" for @elements;
print "\n";
}
}
__DATA__
this line has nothing, nothing, nothing...
1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is
+ the best, and this is not a test 111, 222, 333 as random words to
+ finish
this should be a test, but nothing must be returned 4444, 7777, 9999 i
+s garbage
| [reply] [d/l] |
Re: Parse for a list in a long string
by AnomalousMonk (Archbishop) on Jun 02, 2015 at 17:52 UTC
|
Another approach, also factoring the regexes, but this one needs Perl version 5.10+ for the \K regex operator.
c:\@Work\Perl\monks>perl -wMstrict -le
"use 5.010;
;;
use Test::More 'no_plan';
;;
my $s = '1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three
+,44,'
. '55 is the best, and this is not a test 111, 222, 333 as r
+andom '
. 'words to finish'
;
print qq{[[$s]]};
;;
my @expected = (11, 22, 33, 44, 55, 222, 333);
;;
my $is = qr{ \s+ is \b }xms;
my $word = qr{ \s+ [[:alpha:]]+ \b }xms;
;;
my $sep = qr{ (?: $is $word?)? \s* , \s* }xms;
;;
my $extract = qr{
(?: (?: \G (?<! \A)) | test \s+ \d+) $sep
\K \d+
}xms;
;;
my @got;
;;
@got = $s =~ m{ $extract }xmsg;
is_deeply \@got, \@expected, qq{(@got)};
;;
@got = 'this line has nothing, nothing, nothing...' =~ m{ $extract }x
+msg;
is_deeply \@got, [], 'empty';
;;
@got = 'this should be a test, with nothing returned 444, 777, 999 is
+ junk'
=~ m{ $extract }xmsg;
is_deeply \@got, [], 'also empty';
"
[[1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 i
+s the best, and this is not
a test 111, 222, 333 as random words to finish]]
ok 1 - (11 22 33 44 55 222 333)
ok 2 - empty
ok 3 - also empty
1..3
Give a man a fish: <%-(-(-(-<
| [reply] [d/l] [select] |
Re: Parse for a list in a long string
by Anonymous Monk on Jun 02, 2015 at 19:35 UTC
|
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1128809
use strict;
use warnings;
while(<DATA>)
{
my @items;
s/\s+/ /g; # simplify
push @items, $1 =~ /, *(\w+)/g while /\btest \w+((?:, *\w+(?: is \w+
+)?)+)/g;
@items and print "FOUND: @items\n";
}
__DATA__
this line has nothing, nothing, nothing...
1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is
+ the best, and this is not a test 111, 222, 333 as random words to
+ finish
this should be a test, but nothing must be returned 4444, 7777, 9999 i
+s garbage
| [reply] [d/l] |
Re: Parse for a list in a long string
by vitoco (Hermit) on Jun 02, 2015 at 20:39 UTC
|
Thanks for all the responses and ideas.
I was aware that this was a double loop problem, but I tried it with a single regexp anyway, and crashed against the fact that if I add a quantifier to a grouping, I could get only the last match for that group in the result... I was trying to bypass that!
This line solved my problem in a very simple way:
push @items, $1 =~ /, ?(\w+)/g while /\btest \w+(?:(?: is)? \w+)?((?: ?, ?\w+(?:(?: is)? \w+)?)+)/ig;
| [reply] [d/l] |
|
|