ParseRecDescent and csv-like data

rkg has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I am trying to get up to speed on ParseRecDescent. The FAQ offers idiom (credited to Merlyn) to match CSV-like data

 CSVLine: QuotedText(s Comma) { do something}
[download]

I'm trying to write a rule to match a term. A term is one or more words, separated by underscores. Neither terms nor words can contain whitespace. These are valid terms: apple, apple_pear, apple_plum_grape. These are not valid terms: _, _apple, apple_, apple__pear. Can anyone give me guidance on writing the rule for term and word? Do I need a <skip> directive to indicate a term cannot contain whitespace? Here's what I have, which does not work. Many thanks!

my $g = q(

    Word: /^[A-Z]+$/i
    {return $item[1]}

    TermSep: '_'

    Term: Word(s TermSep) { use Data::Dumper; Dumper($item[1]) }

    Line: Term
    {$item[1]}

  );
[download]

rkg

Comment on ParseRecDescent and csv-like data Select or Download Code

Replies are listed 'Best First'.
Re: ParseRecDescent and csv-like data by Elian (Parson) on Aug 21, 2003 at 13:45 UTC
You don't have to get that fancy. Something like: `word: /[a-zA-Z]+(_[a-zA-Z]+)*/` [download] should be sufficient.	[reply] [d/l]
Re: ParseRecDescent and csv-like data by dreadpiratepeter (Priest) on Aug 21, 2003 at 13:47 UTC
Wouldn't: `Term: /[A-Z]\|[A-Z][A-Z_][A-Z]/i {return \split(/_/,$item[1])}` [download] Give you what you want? I could be wrong, it's off the top of my head, but it should return a term as a list of it's parts (which seems to be what you want). -pete "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."*	[reply] [d/l]
Re: ParseRecDescent and csv-like data by Abigail-II (Bishop) on Aug 21, 2003 at 14:51 UTC
Instead of using ParseRecDescent, may I suggest a Regexp::Common solution? Parsing lists is one of its options. `#!/usr/bin/perl use strict; use warnings; use Regexp::Common; my $re = $RE{list}{-sep => '_'}{-pat => '(?:(?!_)\w)+'}; while (<DATA>) { chomp; print "'$_' ", /^$re$/ ? "matched\n" : "did not match\n"; } __DATA__ apple_pear apple_plum_grape _ apple _apple apple_ apple__pear 'apple_pear' matched 'apple_plum_grape' matched '_' did not match 'apple' did not match '_apple' did not match 'apple_' did not match 'apple__pear' did not match` [download] Abigail	[reply] [d/l]
Re: ParseRecDescent and csv-like data by gjb (Vicar) on Aug 21, 2003 at 14:18 UTC
Although I agree that it might be better to consider 'apple_pear' as one token rather than two (as suggested by Elian and dreadpiratepeter above, the code below should do what you want: `#!/usr/bin/perl use strict; use warnings; use Parse::RecDescent; my $text = <<EOI apple_pear cherry munchy_nice_apricot mint_ _banana raspberry__pie EOI $Parse::RecDescent::skip = '[ \t]'; my $grammar = q( { use Data::Dumper } data: line(s) line: term endofline term: word(s /_/) ...endofline { print Dumper(\%item); } word: /[a-z]+/ endofline: /[\n\r]+/ ); my $parser = Parse::RecDescent->new($grammar); if ($parser->data($text)) { print "ok\n"; }` [download] Hope this helps, -gjb- Update:* changed the code to suite rkg's requirement that '_apple', 'apple__juice' and 'apple_' should not be accepted. The grammar happens to be more elegant now and I learned about separator patterns. Update 2: removed an unused token from the grammar.	[reply] [d/l]
Re: Re: ParseRecDescent and csv-like data by rkg (Hermit) on Aug 21, 2003 at 16:05 UTC
Thanks for your help. You code, I think, accepts apple__pear (two underscores), _apple (leading underscore), and pear_ (trailing) as valid... in my desired grammar, they shouldn't be. Any thoughts? rkg ps And yes, I like your approach of treating a term as multiple tokens, vs. one. This is a part of a larger grammar, and I need the flexibility of ParseRecDescent.	[reply]
Re: ParseRecDescent and csv-like data by rkg (Hermit) on Aug 21, 2003 at 17:15 UTC
Well, this ugly setup (below) does what I want. If anyone would like to help me clean it, I'd appreciate learning how! `Word1: /^[A-Z0-9\/:"'.-]+$/i {$item[1]} Word2: /[A-Z0-9\/:"'.-]+/i {$item[1]} Term: Word1 \| (<skip: ''> Word2 ('_' Word2)(s))` [download] (Why am I going to all this hassle, you may ask? Words are nested into terms, terms into superterms, and so on, each with a unique delimiter. So the simple regexp 1 token doesn't quite work. rkg	[reply] [d/l]