Parsing a Formatted Text File

Pharazon has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of data that I need to parse into a csv file. The data looks like the following:

                                                                 06

   01720168-00000000257980

                                      123 S Somewhere HWY 192
                                       172016-8

        Company NATURAL GAS CO., INC.  Business
        P O BOX 1547                   123 Road Dr.
        Town ST 12345                  SUITE#  1234
                                       Town, ST  12345








                                        6/23/2014            $257.98



 Business                                                  6/23/2014
 123 S Road  HWY 123                172016-8               6/09/2014
 Town  ST  12345                                             $257.98



 02CS   4/30   5/28           3117.0    3259.0      142.0
 Meter #    C204508                                 142.0     232.99
 Pipe Replacement Pgm SNR Comm                                  3.27
 RESEARCH & DEVELOPMENT TARIFF                                   .03
 3.00% Rate Increase County Co Sc Tax on 236.29                 7.09
 6.00% State Tax on 243.38                                     14.60

                                 Current Charges              257.98

                                 Previous Amount Due          351.60
                                 Payment Received  5/22       351.60CR

                                 Total Amount Due             257.98













                     1-877-123-4567               66.0  28       142.0
 8:00am to 4:00pm                                 58.6  30       203.0
                                                  70.3  28       174.0
[download]

The file has one record per 58 lines. I can handle the fine parsing that will need to happen on the lines to pull out the variables but what I am having trouble wrapping my head around is a method for grabbing 58 lines at a time and then performing the necessary processes on each iteration. For example once I have the 58 lines read in I know that line 8-11 from position 1-39 contain the return address that I will be putting as "return1","return2","return3","return4" in the CSV file. This is going to happen on every record as well as each of the other pieces I need to parse.

I thought about just using a counter and resetting it after every 58 lines while looping through the entire file but that didn't seem like it would be the best solution. As I'm by no means an expert at perl I wanted to check here with you guys to see if anyone has a better place to start or some ideas on how to make this a bit more clean and efficient.

If you need any other information please let me know.

UPDATE: I found a control character other than newline in the data at the beginning of each record. I was able to use local $/ = "\014"; (was looking for newlines or double newlines and not this character) to pull the data into a variable one record at a time. I then split the data using my @lines = split /\n/, $record; into an array with one line per.

So now I believe I can pass the array to a subroutine, perform the checks and changes I need to on each of the lines, write the line out to a csv file and then move to the next record.

Comment on Parsing a Formatted Text File Download Code

Replies are listed 'Best First'.
Re: Parsing a Formatted Text File by choroba (Cardinal) on Mar 23, 2015 at 14:49 UTC
Yes, reading 58 lines to an array is the best solution I can think of. The code doesn't seem to be ugly (but ugliness is in the eye of the beholder): `#!/usr/bin/perl use warnings; use strict; while (not eof) { my @lines = map scalar <>, 1 .. 58; # Process the record. }` [download] Update: Simplified as per Re^2: Parsing a Formatted Text File. Thanks, the unnamed one. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Parsing a Formatted Text File by Anonymous Monk on Mar 23, 2015 at 14:54 UTC
No need for the push. `my @lines = map scalar <>, 1 .. 58;` [download]	[reply] [d/l]
Re^2: Parsing a Formatted Text File by GotToBTru (Prior) on Mar 23, 2015 at 14:56 UTC
Not so much ugly as nit-picky. There are templating modules to help you produce formatted reports like this one; are there ones to help you read them? Dum Spiro Spero	[reply]
Re^2: Parsing a Formatted Text File by Pharazon (Acolyte) on Mar 23, 2015 at 14:53 UTC
So after pushing the lines into the array would it make since to then call a subroutine to pull the data needed out of the array then write it out to the csv file since each array would equate to one record which will equate to one line in the csv file?	[reply]
Re: Parsing a Formatted Text File by GotToBTru (Prior) on Mar 23, 2015 at 14:51 UTC
One of the keys in processing files like this is knowing how confident you can be in the location of any piece of data. The most likely thing to mess this up would be data that exceeds its usual field size and causes an extra line in the output. Or data missing that results in a shorter document than you expect. You have very few fields with tags to help you identify them, so position is going to be how you identify what you're seeing at any given location in the document. I would use regexes and code very defensively, making sure dates look like dates, phone numbers like phone numbers, prices like prices. Dum Spiro Spero	[reply]
Re^2: Parsing a Formatted Text File by marinersk (Priest) on Mar 23, 2015 at 20:28 UTC
Huge fan of defensive coding. I saw the "skip 58 lines" concept above and shuddered. More war stories than I can count stem from inheriting that kind of blind faith logic. Give me something -- anything I can rely on -- in the data itself, and the shakes subside. I know it works. It works a lot. But it also fails a lot. Line counting for me is a tool of last resort unless I can be convinced the data format is solid. And I'm a hard sell in that department. Too many war wounds. `::` shudder `::`	[reply] [d/l] [select]
Re^2: Parsing a Formatted Text File by Pharazon (Acolyte) on Mar 23, 2015 at 14:57 UTC
I would say overall because I know our client has a program that is outputting these files that the positioning will be fairly good but I am absolutely prepped to regex the crap out of the fields that I can because that formatting is human controlled and on the other end of the process that formatting will matter.	[reply]
Re: Parsing a Formatted Text File by hdb (Monsignor) on Mar 23, 2015 at 15:26 UTC
If you have a treatment for each line of the report you could also put them into an array of `sub`s and then use the `%` operator to call them one by one: use strict; use warnings; use Data::Dumper; my @subs; $subs[ 8 ] = sub { $_[0]->{'return1'} = substr $_[1], 0, 39 }; $subs[ 9 ] = sub { $_[0]->{'return2'} = substr $_[1], 0, 39 }; $subs[ 10 ] = sub { $_[0]->{'return3'} = substr $_[1], 0, 39 }; $subs[ 11 ] = sub { $_[0]->{'return4'} = substr $_[1], 0, 39 }; my @csv; while(<DATA>){ chomp; push @csv, {} if 1 == $. % 58; $subs[$. % 58]( $csv[-1], $_ ) if defined $subs[$. % 58]; } print Dumper \@csv; __DATA__ 06 01720168-00000000257980 123 S Somewhere HWY 192 172016-8 Company NATURAL GAS CO., INC. Business P O BOX 1547 123 Road Dr. Town ST 12345 SUITE# 1234 Town, ST 12345 [download] Be aware, that the sub for the 58th line would be `$subs[0]`. If there is no action for a line, no sub needs to be specified. This approach would only work if you really have a different approach for each line. If there are interactions between lines, then it probably cannot be fixed... Update: fixed (removed) link to mod operator	[reply] [d/l] [select]
Re: Parsing a Formatted Text File by roboticus (Chancellor) on Mar 23, 2015 at 16:17 UTC
Pharazon: I'd suggest rather than using a fixed number of lines, you find features in the files that you can detect and verify. Then you can parse the file without relying on the number of lines. That way, if someone gets a letter with an unexpectedly large number of line item details you won't lose the data. For example, if the line that looks like an account number is the first interesting line in the record, you could write a routine that will bundle up a package of lines split by lines with an account number, something like this: sub get_record { my $FH = shift; state $previous_account; my @record; while (my $line = <$FH>) { if ($line =~ /^\s{1,10}\d{8}-\d{14}\s$/) { # Found an account number if (!defined $previous_account) { # It's the first, so just store it and continue $previous_account = $line; } else { # Add account number to start of record unshift @record, $previous_account; # Save current account for next record $previous_account = $line; return @record } } else { push @record, $line; } } # Be sure to return the record when the file ends, too unshift @record, $previous_account; return @record; } [download] If the file is from a mainframe, then there may also be page-control character that you could use to split the file apart with. ...roboticus When your only tool is a hammer, all problems look like your thumb.*	[reply] [d/l]
Re: Parsing a Formatted Text File by RichardK (Parson) on Mar 23, 2015 at 15:28 UTC
Is there anything in your file you could use as a record separator? Then you just have to read all lines up to the next separator.	[reply]
Re^2: Parsing a Formatted Text File by Pharazon (Acolyte) on Mar 23, 2015 at 15:40 UTC
Using UltraEdit I can see a page break character at the beginning of each record after the first but it is not a newline character.	[reply]
Re^3: Parsing a Formatted Text File by ww (Archbishop) on Mar 23, 2015 at 16:08 UTC
so use the "page break character" as your input_record_separator (paragraph delimiter). see `$/` documented at http://perldoc.perl.org/perlvar.html Spirit of the Monastery	[reply] [d/l] [select]
Re: Parsing a Formatted Text File (unformat) by Anonymous Monk on Mar 23, 2015 at 22:43 UTC
Maybe start with Parse::Report - parse Perl format-ed reports. , its like the opposite of Perl6::Form	[reply]