Statement Parsing and Rendering

PerlSufi has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I have a file that has about 50,000 lines in it which contains statements each statement looks something like this:

200~020000000123~0509112013~0610102013~07JOHN SMITH~08131 MAIN ST~09SO
+MEWHERE TX 77777~12SOMEWHERE~13TX~1477777~15
R011~197~24bobsmithsemail@gmail.com~251`
501~0211.150%~03.030547%`
500~0109112013~020001~03PERSONAL LOAN~05Balance Forward~0694188~0770~0
+87~182~21Original Balance~22138000~24608`
300~01BOBS FRIEND~021`
530~0109182013~0209182013~0312184-~04609~051237~0610338-~0783850~08Pay
+ments by Check~23K~25P`
510~0109182013~0209182013~03Mail Transaction~041`
539`
530~0110082013~0210082013~0312200-~05512~0611688-~0772162~08Payments b
+y Check~23K~25P`
510~0110082013~0210082013~03Mail Transaction~041`
539`
599~0110102013~02Ending Balance~0372162~10Total Aggregate Amount Paid 
+From Open~12Total Interest Paid From Open~136063~141218~15
65838`
570~0112183~0311042013~0411042013~0512162~0712162~0844~1112183`
540~012~0224384`
550~01Interest Paid~026063~031218~06609~071749~0865838~091218`
690~010004~032219~04600`
701~01PERSONAL LOAN~0272162`
[download]

each 200~ indicates the beginning of a statement and each 701~ is the last line of that statement. Each 2 digits following the ~ are not needed. They indicate certain data types and are consistent throughout the whole file. I need pretty much all of these fields for rendering to postscript later on. But first, I need help extracting it in an intelligible way. Below is my slightly altered code so that it can be run by any helpers here:

#! /usr/bin/perl
use strict;
use warnings;
use IO::File;
use Data::Dumper;

my $in_fh = $ARGV[0];
open(my $in_fh = IO::File->new, "<", $infile) 
or die "Can't open $infile: $!.\n";
close $in_fh;

my $chunk =~ s/[\f\r\n]//g;
my @statement = split '`', $chunk;
chomp @statement;

foreach my $line (@statement) {

        my @fields = split '~\d\d', $line;
        next unless length $line;
        next unless scalar(@fields);
        
        my $fieldno = shift @fields;
        print Dumper(@fields);

        }
[download]

You should see by running that code that I get dumped every field that I need stripped of the ~ and 2 digits. I actually do not need the 12 digits after 200~. I have parsed that out previously in my script for later use, so I shifted.
I guess my question is: What is an intelligent way to store every statement like this to use for rendering later on?
Any insight is greatly appreciated..

Comment on Statement Parsing and Rendering Select or Download Code

Replies are listed 'Best First'.
Re: Statement Parsing and Rendering by GotToBTru (Prior) on Oct 24, 2013 at 22:12 UTC
In my $job I am constantly working with delimited records and multiple record types per transaction. If I am going to read the field data into variables, one of the most important things to me is to make the meaning or purpose of each field clear either thru variable names or the position in a data structure. I would suggest an HoA where each record is stored as an array, keyed using the record type. There is a repeating section, 530-510-529, which could be a HoAoA.	[reply]
Re^2: Statement Parsing and Rendering by PerlSufi (Friar) on Oct 24, 2013 at 23:42 UTC
Thanks, GotToBTru. good idea	[reply]
Re: Statement Parsing and Rendering by boftx (Deacon) on Oct 25, 2013 at 00:41 UTC
Taking the HoA a bit further, I would use the record types as keys in a dispatch table consisting of code references to sub-routines that know how to parse an individual record type into a meaningful structure, and of course the start and stop record handlers would know how to initialize the structure and assemble the pieces into a whole for storage, etc., respectively. The answer to the question "Can we do this?" is always an emphatic "Yes!" Just give me enough time and money.	[reply]
Re: Statement Parsing and Rendering by hdb (Monsignor) on Oct 25, 2013 at 09:54 UTC
Based on the assumption that each `$fieldno` will be followed by a fixed number of fields I would propose a data dictionary that contains field names and whether or not there could be multiple entries. Based on this, your code to read the data could be rather short and all intelligence is put into the data dictionary. However, I have already seen that for example 530 can have 9 or 10 fields following it. Anyways, here is some code that explains my thoughts (the data dictionary is incomplete): use strict; use warnings; use Data::Dumper; my %desc = ( 200 => { type => 'single', names => [ 'f1', 'f2', 'f3', 'name', 'street', 'place1', +'place2', 'f8', 'f9', 'f10', 'f11', 'email', 'f13' ] }, 501 => { type => 'single', names => [ 'rate1', 'rate2' ] }, 530 => { type => 'multiple', names => [ 'f1', 'f2', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9' + ] } ); my %data; $/ = "`\n"; foreach my $line (<DATA>) { chomp $line; my @fields = split '~\d\d', $line; next unless length $line; next unless scalar(@fields); my $fieldno = shift @fields; if( exists $desc{$fieldno} ) { if( $desc{$fieldno}{type} eq 'single' ) { @{$data{$fieldno}}{ @{ $desc{$fieldno}{names} } } = @field +s; } else { push @{ $data{$fieldno} }, {}; @{ $data{$fieldno}[-1] }{ @{$desc{$fieldno}{names}} } = @f +ields; } } else { print "Unknown data $fieldno\n"; } } print Dumper \%data; __DATA__ 200~020000000123~0509112013~0610102013~07JOHN SMITH~08131 MAIN ST~09SO +MEWHERE TX 77777~12SOMEWHERE~13TX~1477777~15R011~197~24bobsmithsemail +@gmail.com~251` 501~0211.150%~03.030547%` 500~0109112013~020001~03PERSONAL LOAN~05Balance Forward~0694188~0770~0 +87~182~21Original Balance~22138000~24608` 300~01BOBS FRIEND~021` 530~0109182013~0209182013~0312184-~04609~051237~0610338-~0783850~08Pay +ments by Check~23K~25P` 510~0109182013~0209182013~03Mail Transaction~041` 539` 530~0110082013~0210082013~0312200-~05512~0611688-~0772162~08Payments b +y Check~23K~25P` 510~0110082013~0210082013~03Mail Transaction~041` 539` 599~0110102013~02Ending Balance~0372162~10Total Aggregate Amount Paid +From Open~12Total Interest Paid From Open~136063~141218~1565838` 570~0112183~0311042013~0411042013~0512162~0712162~0844~1112183` 540~012~0224384` 550~01Interest Paid~026063~031218~06609~071749~0865838~091218` 690~010004~032219~04600` 701~01PERSONAL LOAN~0272162` [download]	[reply] [d/l] [select]
Re^2: Statement Parsing and Rendering by PerlSufi (Friar) on Oct 25, 2013 at 13:56 UTC
Awesome, thanks hdb++ I'll give that a whirl and try to post any other problems I encounter.	[reply]
Re^2: Statement Parsing and Rendering by PerlSufi (Friar) on Oct 25, 2013 at 14:34 UTC
hdb: I could only get that to work my changing it to.. open(my $in_fh = IO::File->new, "<", $infile) or die "Can't open $infile: $!.\n"; close $in_fh; my $chunk =~ s/[\f\r\n]//g; my @statement = split '`', $chunk; ## %desc stuff here my %data; $/ = "`\n"; foreach my $line (@statement) { .. } [download] instead of: my %data; $/ = "`\n"; foreach my $line (<DATA>) { .. } [download] Is %data set to the file variable in your version? I think shifting the first field may be a mistake of mine, too. Doing this doesn't parse the 200~ fields. my output was: Unknown data 020000000123 Unknown data 500 Unknown data 300 Unknown data 510 Unknown data 539 Unknown data 510 Unknown data 539 Unknown data 599 Unknown data 570 Unknown data 540 Unknown data 550 Unknown data 690 Unknown data 701 Unknown data 200~ $VAR1 = { '501' => { 'rate1' => '11.150%', 'rate2' => '.030547%' }, '530' => [ { 'f8' => '83850', 'f6' => '1237', 'f1' => '09182013', 'f9' => 'Payments by Check', 'f5' => '609', 'f2' => '09182013', 'f7' => '10338-', 'f4' => '12184-' }, { 'f8' => 'Payments by Check', 'f6' => '11688-', 'f1' => '10082013', 'f9' => 'K', 'f5' => '512', 'f2' => '10082013', 'f7' => '72162', 'f4' => '12200-' } ] }; [download]	[reply] [d/l] [select]
Re: Statement Parsing and Rendering by PerlSufi (Friar) on Oct 25, 2013 at 19:47 UTC
UPDATE. Here is my new code: open(my $in_fh = IO::File->new, "<", $infile) or die "Can't open $infile: $!.\n"; close $in_fh; $chunk =~ s/[\f\r\n]//g; my @statement = split '`', $chunk; my %desc = ( 020 => { type => 'single', names => [ 'account'] }, ##didn't know how else to handle + this 200 => { type => 'single', names => [ 'f1', 'f2', 'f3', 'name', 'street', 'place1', 'place2', 'f8', 'f9', 'f10', 'f11', 'email', ' +f13' ] }, 300 => { type => 'single', names => [ 'name_1']}, 500 => { type => 'multiple', names => ['period_start','loan_id','description','loan_ty +pe','beginning_balance_desc', 'beginning_balance','loan_type_num','loan_branc +h','mail_code','loan_code', 'note_num','late_payment_warn_maxfee']}, 510 => { type => 'single', names => ['transaction_date','posting_date','transaction +_desc','transaction_desc_continued']}, 501 => { type => 'single', names => [ 'rate1', 'rate2' ] }, 530 => { type => 'multiple', names => [ 'transaction_date', 'post_date', 'transaction_ +amount', 'late_fee', 'interest', 'balance_change', 'new_balance', 'transaction_ +desc'] }, 539 => { type => 'single', names => [ 'end_loan']}, 540 => { type => 'single', names => ['deposit_count','deposit_amount']}, 541 => { type => 'single', names => ['withdraw_count', 'withdraw_amount']}, 550 => { type => 'multiple', names => ['interest_desc', 'total_int_YTD','total_fees_YT +D', 'loan_fee','interest_fee' ]}, 551 => { type => 'single', names => ['do_nothing'] }, 250 => { type => 'single', names => ['do_nothing'] }, 334 => { type => 'single', names => ['do_nothing'] }, 330 => { type => 'single', names => ['do_nothing'] }, 333 => { type => 'single', names => ['do_nothing'] }, 343 => { type => 'single', names => ['do_nothing'] }, 344 => { type => 'single', names => ['do_nothing'] }, 340 => { type => 'single', names => ['do_nothing'] }, 412 => { type => 'single', names => ['do_nothing'] }, 570 => { type => 'single', names => ['amount_due','due_date']}, 599 => { type => 'multiple', names => ['period_end_date', 'end_balance_desc','end_bala +nce'] }, 690 => { type => 'single', names => ['additional_info'] }, 701 => { type => 'single', names => ['loan', 'end_balance'] }, ); my %data; #$/ = "`\n"; #would not run through the file if I left this on foreach my $line (@statement) { chomp $line; my @fields = split '~\d\d', $line; next unless length $line; next unless scalar(@fields); my $fieldno = shift @fields; if( exists $desc{$fieldno} ) { if( $desc{$fieldno}{type} eq 'single' ) { @{$data{$fieldno}}{ @{ $desc{$fieldno}{names} } } = @field +s } else { push @{ $data{$fieldno} }, {}; @{ $data{$fieldno}[-1] }{ @{$desc{$fieldno}{names}} } = @f +ields } } else { print "Unknown data $fieldno\n"; } } print Dumper \%data; [download]	[reply] [d/l]