Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parse for a list in a long string

by vitoco (Hermit)
on Jun 02, 2015 at 17:00 UTC ( [id://1128809]=perlquestion: print w/replies, xml ) Need Help??

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I want to get a list of items from a long text string with a given format. The format is pretty simple, but the number of items in the list is variable, also it is the number of lists in the same string. Of course, there are many other things in the string that must be discarded.

I tried a single regular expression to capture the items to an array, but I can get only the first or the last element or each identified list...

This is a test code:

#!perl use strict; use warnings; while (<DATA>) { chomp; s!\s+! !g; my $txt = $_; print "$_\n"; my @items = (); print "FOUND: @items\n" if (@items = ($txt =~ m!\btest \w+(?:(?: is) +? \w+)?(?: ?, ?(\w+)(?:(?: is)? \w+)?)+!ig)); } __DATA__ this line has nothing, nothing, nothing... 1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is + the best, and this is not a test 111, 222, 333 as random words to + finish this should be a test, but nothing must be returned 4444, 7777, 9999 i +s garbage

In this example, the lists starts with the string "test", the elements are delimited by a comma, each element could be followed by an optional "is" and another word (must be discarded), and the first element of the list is not important and must be ignored. The given data has 3 lines, and only the 2nd one has two lists, the 1st and 3rd have none. The expected result is:

FOUND: 11 22 33 44 55 222 333

What I got is:

FOUND: 55 333

If I remove the last plus sign, I get:

FOUND: 11 222

If I remove the "g" modifier, I get only one list (with one item):

FOUND: 55

What am I missing?

Thanks!!!

Replies are listed 'Best First'.
Re: Parse for a list in a long string
by choroba (Cardinal) on Jun 02, 2015 at 17:17 UTC
    You can compose a complex regex from simple ones. You don't need to use just one regex to do the all work, either, you can progress in steps:
    #! /usr/bin/perl use warnings; use strict; my $element = qr/(?: \w+ (?: \s+ is \s+ \w+ )? )/x; while (<DATA>) { chomp; s/\s+/ /g; while (/\b test\s+ ( $element (?: , \s* $element )* )/xg) { my $match = $1; my @elements = split /\s*,\s*/, $match; shift @elements; s/\s+ is \s+ \w+//x for @elements; print "($_)" for @elements; print "\n"; } } __DATA__ this line has nothing, nothing, nothing... 1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is + the best, and this is not a test 111, 222, 333 as random words to + finish this should be a test, but nothing must be returned 4444, 7777, 9999 i +s garbage
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Parse for a list in a long string
by AnomalousMonk (Archbishop) on Jun 02, 2015 at 17:52 UTC

    Another approach, also factoring the regexes, but this one needs Perl version 5.10+ for the  \K regex operator.

    c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; use Test::More 'no_plan'; ;; my $s = '1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three +,44,' . '55 is the best, and this is not a test 111, 222, 333 as r +andom ' . 'words to finish' ; print qq{[[$s]]}; ;; my @expected = (11, 22, 33, 44, 55, 222, 333); ;; my $is = qr{ \s+ is \b }xms; my $word = qr{ \s+ [[:alpha:]]+ \b }xms; ;; my $sep = qr{ (?: $is $word?)? \s* , \s* }xms; ;; my $extract = qr{ (?: (?: \G (?<! \A)) | test \s+ \d+) $sep \K \d+ }xms; ;; my @got; ;; @got = $s =~ m{ $extract }xmsg; is_deeply \@got, \@expected, qq{(@got)}; ;; @got = 'this line has nothing, nothing, nothing...' =~ m{ $extract }x +msg; is_deeply \@got, [], 'empty'; ;; @got = 'this should be a test, with nothing returned 444, 777, 999 is + junk' =~ m{ $extract }xmsg; is_deeply \@got, [], 'also empty'; " [[1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 i +s the best, and this is not a test 111, 222, 333 as random words to finish]] ok 1 - (11 22 33 44 55 222 333) ok 2 - empty ok 3 - also empty 1..3


    Give a man a fish:  <%-(-(-(-<

Re: Parse for a list in a long string
by Anonymous Monk on Jun 02, 2015 at 19:35 UTC

    It's really a double loop.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1128809 use strict; use warnings; while(<DATA>) { my @items; s/\s+/ /g; # simplify push @items, $1 =~ /, *(\w+)/g while /\btest \w+((?:, *\w+(?: is \w+ +)?)+)/g; @items and print "FOUND: @items\n"; } __DATA__ this line has nothing, nothing, nothing... 1 , 2, 3, 4 is four, 5, 6 test 00,11 is one,22, 33 is three,44,55 is + the best, and this is not a test 111, 222, 333 as random words to + finish this should be a test, but nothing must be returned 4444, 7777, 9999 i +s garbage
Re: Parse for a list in a long string
by vitoco (Hermit) on Jun 02, 2015 at 20:39 UTC

    Thanks for all the responses and ideas.

    I was aware that this was a double loop problem, but I tried it with a single regexp anyway, and crashed against the fact that if I add a quantifier to a grouping, I could get only the last match for that group in the result... I was trying to bypass that!

    This line solved my problem in a very simple way:

    push @items, $1 =~ /, ?(\w+)/g while /\btest \w+(?:(?: is)? \w+)?((?: ?, ?\w+(?:(?: is)? \w+)?)+)/ig;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1128809]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-24 17:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found