Pharazon has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of data that I need to parse into a csv file. The data looks like the following:
06 01720168-00000000257980 123 S Somewhere HWY 192 172016-8 Company NATURAL GAS CO., INC. Business P O BOX 1547 123 Road Dr. Town ST 12345 SUITE# 1234 Town, ST 12345 6/23/2014 $257.98 Business 6/23/2014 123 S Road HWY 123 172016-8 6/09/2014 Town ST 12345 $257.98 02CS 4/30 5/28 3117.0 3259.0 142.0 Meter # C204508 142.0 232.99 Pipe Replacement Pgm SNR Comm 3.27 RESEARCH & DEVELOPMENT TARIFF .03 3.00% Rate Increase County Co Sc Tax on 236.29 7.09 6.00% State Tax on 243.38 14.60 Current Charges 257.98 Previous Amount Due 351.60 Payment Received 5/22 351.60CR Total Amount Due 257.98 1-877-123-4567 66.0 28 142.0 8:00am to 4:00pm 58.6 30 203.0 70.3 28 174.0

The file has one record per 58 lines. I can handle the fine parsing that will need to happen on the lines to pull out the variables but what I am having trouble wrapping my head around is a method for grabbing 58 lines at a time and then performing the necessary processes on each iteration. For example once I have the 58 lines read in I know that line 8-11 from position 1-39 contain the return address that I will be putting as "return1","return2","return3","return4" in the CSV file. This is going to happen on every record as well as each of the other pieces I need to parse.

I thought about just using a counter and resetting it after every 58 lines while looping through the entire file but that didn't seem like it would be the best solution. As I'm by no means an expert at perl I wanted to check here with you guys to see if anyone has a better place to start or some ideas on how to make this a bit more clean and efficient.

If you need any other information please let me know.

UPDATE: I found a control character other than newline in the data at the beginning of each record. I was able to use local $/ = "\014"; (was looking for newlines or double newlines and not this character) to pull the data into a variable one record at a time. I then split the data using my @lines = split /\n/, $record; into an array with one line per.

So now I believe I can pass the array to a subroutine, perform the checks and changes I need to on each of the lines, write the line out to a csv file and then move to the next record.

Replies are listed 'Best First'.
Re: Parsing a Formatted Text File
by choroba (Cardinal) on Mar 23, 2015 at 14:49 UTC
    Yes, reading 58 lines to an array is the best solution I can think of. The code doesn't seem to be ugly (but ugliness is in the eye of the beholder):
    #!/usr/bin/perl use warnings; use strict; while (not eof) { my @lines = map scalar <>, 1 .. 58; # Process the record. }

    Update: Simplified as per Re^2: Parsing a Formatted Text File. Thanks, the unnamed one.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      No need for the push.

      my @lines = map scalar <>, 1 .. 58;

      Not so much ugly as nit-picky.

      There are templating modules to help you produce formatted reports like this one; are there ones to help you read them?

      Dum Spiro Spero
      So after pushing the lines into the array would it make since to then call a subroutine to pull the data needed out of the array then write it out to the csv file since each array would equate to one record which will equate to one line in the csv file?
Re: Parsing a Formatted Text File
by GotToBTru (Prior) on Mar 23, 2015 at 14:51 UTC

    One of the keys in processing files like this is knowing how confident you can be in the location of any piece of data. The most likely thing to mess this up would be data that exceeds its usual field size and causes an extra line in the output. Or data missing that results in a shorter document than you expect. You have very few fields with tags to help you identify them, so position is going to be how you identify what you're seeing at any given location in the document. I would use regexes and code very defensively, making sure dates look like dates, phone numbers like phone numbers, prices like prices.

    Dum Spiro Spero

      Huge fan of defensive coding.

      I saw the "skip 58 lines" concept above and shuddered. More war stories than I can count stem from inheriting that kind of blind faith logic.

      Give me something -- anything I can rely on -- in the data itself, and the shakes subside.

      I know it works. It works a lot. But it also fails a lot. Line counting for me is a tool of last resort unless I can be convinced the data format is solid.

      And I'm a hard sell in that department. Too many war wounds.

      :: shudder ::

      I would say overall because I know our client has a program that is outputting these files that the positioning will be fairly good but I am absolutely prepped to regex the crap out of the fields that I can because that formatting is human controlled and on the other end of the process that formatting will matter.
Re: Parsing a Formatted Text File
by hdb (Monsignor) on Mar 23, 2015 at 15:26 UTC

    If you have a treatment for each line of the report you could also put them into an array of subs and then use the % operator to call them one by one:

    use strict; use warnings; use Data::Dumper; my @subs; $subs[ 8 ] = sub { $_[0]->{'return1'} = substr $_[1], 0, 39 }; $subs[ 9 ] = sub { $_[0]->{'return2'} = substr $_[1], 0, 39 }; $subs[ 10 ] = sub { $_[0]->{'return3'} = substr $_[1], 0, 39 }; $subs[ 11 ] = sub { $_[0]->{'return4'} = substr $_[1], 0, 39 }; my @csv; while(<DATA>){ chomp; push @csv, {} if 1 == $. % 58; $subs[$. % 58]( $csv[-1], $_ ) if defined $subs[$. % 58]; } print Dumper \@csv; __DATA__ 06 01720168-00000000257980 123 S Somewhere HWY 192 172016-8 Company NATURAL GAS CO., INC. Business P O BOX 1547 123 Road Dr. Town ST 12345 SUITE# 1234 Town, ST 12345

    Be aware, that the sub for the 58th line would be $subs[0]. If there is no action for a line, no sub needs to be specified.

    This approach would only work if you really have a different approach for each line. If there are interactions between lines, then it probably cannot be fixed...

    Update: fixed (removed) link to mod operator

Re: Parsing a Formatted Text File
by roboticus (Chancellor) on Mar 23, 2015 at 16:17 UTC

    Pharazon:

    I'd suggest rather than using a fixed number of lines, you find features in the files that you can detect and verify. Then you can parse the file without relying on the number of lines. That way, if someone gets a letter with an unexpectedly large number of line item details you won't lose the data.

    For example, if the line that looks like an account number is the first interesting line in the record, you could write a routine that will bundle up a package of lines split by lines with an account number, something like this:

    sub get_record { my $FH = shift; state $previous_account; my @record; while (my $line = <$FH>) { if ($line =~ /^\s{1,10}\d{8}-\d{14}\s*$/) { # Found an account number if (!defined $previous_account) { # It's the first, so just store it and continue $previous_account = $line; } else { # Add account number to start of record unshift @record, $previous_account; # Save current account for next record $previous_account = $line; return @record } } else { push @record, $line; } } # Be sure to return the record when the file ends, too unshift @record, $previous_account; return @record; }

    If the file is from a mainframe, then there may also be page-control character that you could use to split the file apart with.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Parsing a Formatted Text File
by RichardK (Parson) on Mar 23, 2015 at 15:28 UTC

    Is there anything in your file you could use as a record separator? Then you just have to read all lines up to the next separator.

      Using UltraEdit I can see a page break character at the beginning of each record after the first but it is not a newline character.
Re: Parsing a Formatted Text File (unformat)
by Anonymous Monk on Mar 23, 2015 at 22:43 UTC