Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I have been asked to take a bunch of financial data that is being ftp'd to one of our servers, parse it, stuff in in a database and then build dynamic pages to serve quotes to customers that are no less than 15 minutes old. The data files are sent to our server are in CSV format. No quote marks (") exist (and therefore no problems with commas in quotes), so using split on the data should be fine.

That turned out to be overly optimistic. As it turns out, each line of the file represents one type of quote and the format and the format, while consistent for each quote type, varies from type to type. In other words, one line may have five fields and the next line may have eight. As a result, I felt that using Parse::RecDescent would be a good choice. Unfortunately, I do not know Parse::RecDescent. What follows is my first, simplistic attempt to deal with this problem (eventually, everything after the __DATA__ token will be read from a file and the parser rules will be put there).

#!/usr/bin/perl use Parse::RecDescent; use strict; use Data::Dumper; $::RD_ERRORS = 1; # Make sure the parser dies when it encounters an er +ror $::RD_WARN = 1; # Enable warnings. This will warn on unused rules &c +. $::RD_HINT = 1; # Give out hints to help fix problems. # Create and compile the source file my $rules; my $parser = Parse::RecDescent->new( q( get_type : type { $item{ type } } type : /^[^,]+/ comma : "," date : /\d\d\/{2}\d\d/ start_date : date end_date : date time : /\d\d:{2}\d\d/ rate : /\d+\.\d{4}/ start_rate : rate end_rate : rate change : rate whitespace : /\s*/ G017RATEBRKRL : type comma rate comma start_date comma end_date co +mma time { return \%item } G017CP111_D : type comma start_rate comma end_rate comma change +comma date comma time { \%item } G017RPAGO_N : type comma rate comma whitespace comma whitespace + comma date comma time { \%item } G017ONFD : type comma rate comma rate comma rate comma rate c +omma rate comma rate comma date comma time { \@item } G017PDFF : type comma rate comma rate comma rate comma rate c +omma date comma time { \@item } ) ); while ( chomp( my $quote_data = <DATA> ) ) { next if $quote_data !~ /\S/; my $quote_type = $parser->get_type( $quote_data ); next if ! defined $quote_type; $quote_type =~ s/\W/_/g; print "* $quote_type : $quote_data *\n"; if ( defined $quote_type ) { my $data = $parser->$quote_type( $quote_data ); # <-- this doe +sn't work :( if ( defined $data ) { print Dumper $data; } else { print "\$data is undefined for $quote_type\n"; } } } __DATA__ G017RATEBRKRL,4.2500,10/2/01,10/05/01,16:40:57 G017CP111 D,2.3800,2.3300,0.0001,10/05/01,16:40:55 G017RPAGO/N,2.4300, , ,10/05/01,16:40:58 G017ONFD,2.3125,2.3750,2.4375,2.3750,2.4375,2.2500,10/05/01,16:40:56 G017PDFF,2.5000,2.7500,2.2500,2.5000,10/05/01,16:40:56

The intent is to loop through the data, get the type (that worked fine), and then return a reference to the data structure. Eventually, I intend to provide handlers to automatically add the data to the database, depending upon which type is encountered.

Unfortunately, nothing is returning any data. What am I overlooking?

Another problem is more of a style issue (I think). I don't like all of those 'comma' rules in there. Is this how it's done in Parse::RecDescent or am I totally missing something?

I've been reading through a RecDescent tutorial, but don't seem to be able to parse more than simple data with this module. Further, I think that I'm probably taking the wrong approach to this, so any suggestions as to other approaches would be useful (though I'd prefer to stick with Parse::RecDescent as it would be very useful.

Cheers,
Ovid

Vote for paco!

Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Replies are listed 'Best First'.
Re: A Slough of ParseRecDescent Woes
by merlyn (Sage) on Oct 09, 2001 at 03:13 UTC
    Your grammar can look something more like this:
    file: line(s) /\z/ line: "G017RATEBRKRL" comma rate comma start_date comma end_date comma + time newline { ... } line: "G017CP111 D" comma start_rate comma end_rate comma change comma + date comma time newline { ... } line: "G017RPAGO/N" comma rate comma whitespace comma whitespace com +ma date comma time newline { ... }
    and define your subcomponents as you have. That should work nicer. The different alternatives will be tried one after the other, failing after matching the first word. To speed it up a bit, put a <commit> right after that first word.

    -- Randal L. Schwartz, Perl hacker

Re: A Slough of ParseRecDescent Woes
by Masem (Monsignor) on Oct 09, 2001 at 02:28 UTC
    I'm pretty sure (without sitting down and trying it myself) - you need to add a startrule as that is where the parser will try to look for something. (That is, the grammar starts with startrule, does whatever it can from that, and then returns to startrule until EOF. If it can't do something within the definition of the grammer, it reports an incorrect parse). Fortunately, this is easy: your lines are going to be one of the formats of the stock, so add this to your grammar:
    startrule : G017RATEBRKRL | G017CP111_D | G017RPAGO_N | G017ONFD | G01 +7PDFF
    Also note that you probably need to have the unique identifier in front of each line in that line's grammer; that is, for the first type, you'll need:
    G017RATEBRKRL_Key : "G017RATEBRKRL" G017RATEBRKRL : G017RATEBRKRL_Key type comma rate comma start_date com +ma end_date comma time { return \%item }

    As for the commas, I don't think there's a way to get rid of them easily. Punctionuation typically has to be specified in lex grammars, and that's what you're doing here.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    It's not what you know, but knowing how to find it if you don't know that's important

Re: A Slough of ParseRecDescent Woes
by runrig (Abbot) on Oct 09, 2001 at 04:03 UTC
    I've never really played with Parse::RecDescent until now, but here's what I came up with. Your date and time regexes had problems, you're inconsistent about returning an array reference or a hash reference, and my first 'line' rule (below) has an example of how to label the 'type', if that's the sort of result you want. Also you don't have start and end dates on every rule, so I leave that up to you to fix if necessary. I sort of stole merlyn's idea of how to organize the top level rules and ran with that :)
    use Parse::RecDescent; use strict; use warnings; use Data::Dumper; # Make sure the parser dies when it encounters an error $::RD_ERRORS = 1; # Enable warnings. This will warn on unused rules &c. $::RD_WARN = 1; # Give out hints to help fix problems. $::RD_HINT = 1; # Create and compile the source file my $parser = Parse::RecDescent->new( q( comma : "," date : /\b\d{1,2}\/\d{1,2}\/\d{1,2}\b/ start_date : date end_date : date time : /\b\d\d:\d\d:\d\d\b/ rate : /\b\d+\.\d{4}\b/ rates : rate comma { $item{rate} } start_rate : rate end_rate : rate change : rate whitespace : /\s*/ lines : line /\z/ { $item{line} } line : "G017RATEBRKRL" comma rate comma start_date comma end_date comma time { $item{type} = $item[0]; \%item } line : "G017CP111 D" comma start_rate comma end_rate comma change comma date comma time { \%item } line : "G017RPAGO/N" comma rate comma whitespace comma whitespace comma date comma time { \%item } line : "G017ONFD" comma rates(6) date comma time { \%item } line : "G017PDFF" comma rates(4) date comma time { \%item } ) ); while ( my $quote_data = <DATA> ) { next if $quote_data !~ /\S/; my $result = $parser->lines( $quote_data ); if ( defined $result ) { print Dumper $result; } else { print "Failed!\n"; } }
    All that being said, I'm not sure I'd actually use Parse::RecDescent for this problem. I might just quickly get the first field, and use that as a key to a hash of subroutines which use regexes to parse the data and return the results. I'd consider how much you care about efficiency in this routine anyway.
Re: A Slough of ParseRecDescent Woes
by belden (Friar) on Oct 09, 2001 at 04:55 UTC
    Hi Ovid,

    You wrote:

    As it turns out, each line of the file represents one type of quote and the format and the format, while consistent for each quote type, varies from type to type. In other words, one line may have five fields and the next line may have eight.

    It seems to me that by the time you've done something like

    while(<DATA>) { chomp; my @line = split(',',$_); }

    then you can be fairly certain of a few things:
    @line[0] is the request type
    @line[@line-2] is the date
    @line[@line-1] is the timestamp
    (And after you verify those with your regexes then
    you'll know it for sure).

    It seems to me that anything from @line[1..(@line-3)]
    is the data you are interested in. This is one way to deal
    with variable-length fields in your data:

    #!/usr/bin/perl while(<DATA>) { chomp; /^$/ and next; my @line = split(/,/,$_); for(@line[1..(@line-3)]) { /\d+\.[\d]{4}/ or next; printf("%s\t%s\n",$line[0],$_); } } exit;

    I haven't done any of the nifty stuff you are doing-
    building references etc.- because I still have a hard time
    grokking them...

    Tentatively,
    blyman

Re: A Slough of ParseRecDescent Woes
by BrentDax (Hermit) on Oct 09, 2001 at 08:19 UTC
    Unfortunately, nothing is returning any data. What am I overlooking?

    This is a clue; either something is wrong with your actions or the grammar isn't parsing the data correctly. Try adding | <error> clauses to the end of each top-level rule. This will tell you if there's a parsing error, and possibly what the error is. If this doesn't show anything, look hard at the actions. You may want to explicitly set the $return variable in the actions.

    By the way, Parse::RecDescent is kind of overkill for something like this. P::RD is made for much more complicated things (I'm working on using it to parse a subset of Perl) than several stock quote formats; this may be why it seems so clumsy. However, that doesn't mean "don't use it"--it just means "look at alternatives".

    =cut
    --Brent Dax
    There is no sig.

Re: A Slough of ParseRecDescent Woes
by toma (Vicar) on Oct 09, 2001 at 10:37 UTC
    Somewhat OT answer

    You need to parse "a bunch" of data. There seems to be real-time value in your data. Is it possible that a solution with a quick execution time is needed? This idea has led me to an off-topic answer, since it is really about how to make a quick program of the type that you describe.

    One of the my favorite things about perl is the speed of the regular expression engine. It can parse lines very quickly by anchoring a match at the beginning of a line. The rest of the line can be parsed using a fast regular expression.

    if (/^G017RATEBRKRL,([^,]+),([^,]+),([^,]+),(.*)/) { $col[3]=$1; $col[5]=$2; $col[1]=$3; $col[4]=$4; } elsif (/^G017CP111 D,([^,]+),([^,]+),([^,]+),(.*)/) { # etc... }
    The negated character classes run quickly because they only need to look for commas.

    Another perl speedup has to do with minimizing the number of copy operations needed to load a database with DBI. It should be possible to go from $1, $2, etc, into a data structure that can be directly loaded into the database, without being copied again.

    It is great to be write programs that have wonderful abstractions in them. Sometimes it is even better to write programs that are wickedly fast.

    It should work perfectly the first time! - toma

Re: A Slough of ParseRecDescent Woes
by ehdonhon (Curate) on Oct 09, 2001 at 05:57 UTC
    Hello Ovid,

    Maybe I'm not totally understanding the problem here, but are you committed to Parse::RecDescent for any reason other than it may do what you want?

    If all you need to do is read in CSV files where each file contains its own arbitrary format, perhaps the AnyData module would be sufficient for your needs???

      While I agree that using Parse::RecDescent may be overkill, particularly in this day and age of XML-based file formats, it's still a good idea if you want to be a jack-of-all-trades to have understanding and ability to do grammar parsing; at some point, if you are building custom applications, you'll undoubtable come across a grammar-based format that no other parsing method would easily work with. My impression from Ovid's request was that while there were many other ways, possibly simpler, to do this, learning P::RD was a subgoal of this solution.

      -----------------------------------------------------
      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      It's not what you know, but knowing how to find it if you don't know that's important