comment on

I have had a look at the data description and other information you supplied and have come up with some ideas. Firstly, here is the data file I am using (spw579580.inp), hopefully in a form pretty close to what you have although full of gibberish.

^L


            Mary Clary
            345 High Road

            Sometown, MA, 38776
                                    #23 23232#


                                    #23 23232#
                  08 Apr 1997   11 Mar 2002

message to include 1
message to include 2


Beginning of account detail information for a/c #23 23232#
fgbdfgehbetr 1
67jufgjtykutyruk 2
rytjrytjrtyjryt 3
fghmfhmjryyuy 4
hethe5tyhdtghetyh 5
EOE
^L

            Bill V. Jones
            Bill Jones
            1214 Reno Heights

            Burbank, CA, 43523
                                    #444-7777#


                                    #444-7777#
                  22 Nov 2002   13 Aug 2005

message line 1
message line 2


Beginning of account detail information for a/c #444-7777#
This is data line 1
and here is data line 2
and 3 here
oh, and number 4
and a speech
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums chang'd to merry meetings,
Our dreadful marches to delightful measures.
Grim-visag'd war hath smooth'd his wrinkled front;
And now, -- instead of mounting barbed steeds
To fright the souls of fearful adversaries, --
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, -- that am not shap'd for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
Cheated of feature by dissembling nature,
Deform'd, unfinish'd, sent before my time
Into this breathing world scarce half made up,
And that so lamely and unfashionable
That dogs bark at me as I halt by them; --
Why, I, in this weak piping time of peace,
Have no delight to pass away the time,
Unless to spy my shadow in the sun,
And descant on mine own deformity:
And therefore, -- since I cannot prove a lover,
To entertain these fair well-spoken days, --
I am determined to prove a villain,
And hate the idle pleasures of these days.
Plots have I laid, inductions dangerous,
By drunken prophecies, libels, and dreams,
To set my brother Clarence and the king
In deadly hate the one against the other:
And if King Edward be as true and just
As I am subtle, false, and treacherous,
This day should Clarence closely be mew'd up, --
About a prophecy which says that G
Of Edward's heirs the murderer shall be.
Dive, thoughts, down to my soul: -- here Clarence comes.
EOE
^L


            Fred Bloggs
            23 Corporation St.

            New York, NY, 12345
                                    #1234567#


                                    #1234567#
                  13 Mar 1998   24 Oct 2001

message line 1
message line 2


Beginning of account detail information for a/c #1234567#
This is data line 1
and here is data line 2
and 3 here
EOE
[download]

To start with I give below two skeleton scripts which both do the same thing, namely process records held in a file one at a time. The first is more or less the same as my first post and takes the approach of reading the whole input file into memory (slurping) then pulling out the records using a regular expression match. If your data set is too large this approach will not be possible so I have given a second example that reads the file a record at a time, thus saving memory. You will see in the second script that you can set the input record separator so that one read from the filehandle pulls in an entire record. Here's the slurp version

use strict;
use warnings;

# Initialise file names and open files, one
# for reading, two for writing.
#
my $inputFile  = q{spw579580.inp};
my $headerFile = q{spw579580.hdrs};
my $detailFile = q{spw579580.dets};

open my $inputFH, q{<}, $inputFile
   or die qq{open: $inputFile: $!\n};
open my $headerFH, q{>}, $headerFile
   or die qq{open: $headerFile: $!\n};
open my $detailFH, q{>}, $detailFile
   or die qq{open: $detailFile: $!\n};

# Set up start and end sentinels.
#
my $startSentinel = q{^L};
my $endSentinel   = q{EOE};

# Compile regex to pull out records. Note the
# \Q ... \E to quote regex metacharacters if your
# record start and stop sentinels contain them.
#
my $rxExtractRecord = qr
   {(?xms)
      \Q$startSentinel\E\n
      (.*?)
      (?=\Q$endSentinel\E\n|\z)
   };

# Slurp whole file into string.
#
my $completeFile;
{
    local $/;
    $completeFile = <$inputFH>
}

# Do a global match against compiled regex to
# pull out records and put them in an array. They
# will have the start and end record sentinels
# removed by this process.
#
my @records = $completeFile =~ m{$rxExtractRecord}g;

# Process each record in a loop.
#
foreach my $record (@records)
{
    # We now have the record in a scalar
    # variable $record ready for processing.
    #

    # FURTHER PROCESSING GOES HERE ...
}

# Close open filehandles.
#
close $inputFH
   or die qq{close: $inputFile: $!\n};
close $headerFH
   or die qq{close: $headerFile: $!\n};
close $detailFH
   or die qq{close: $detailFile: $!\n};
[download]

and here is the record-by-record

use strict;
use warnings;

# Initialise file names and open files, one
# for reading, two for writing.
#
my $inputFile  = q{spw579580.inp};
my $headerFile = q{spw579580.hdrs};
my $detailFile = q{spw579580.dets};

open my $inputFH, q{<}, $inputFile
   or die qq{open: $inputFile: $!\n};
open my $headerFH, q{>}, $headerFile
   or die qq{open: $headerFile: $!\n};
open my $detailFH, q{>}, $detailFile
   or die qq{open: $detailFile: $!\n};

# In a code block localise $/, the input record
# separator, to your end of record string.
#
{
    local $/ = qq{EOE\n};

    # Read each record in a while loop. Strip the
    # start and end record sentinels off.
    #
    while (defined(my $record = <$inputFH>))
    {
        $record =~ s{\A\^L\n}{};
        $record =~ s{EOE\n\z}{};

        # We now have the record in a scalar
        # variable $record ready for processing.
        #

        # FURTHER PROCESSING GOES HERE ...
    }
}

# Close open filehandles.
#
close $inputFH
   or die qq{close: $inputFile: $!\n};
close $headerFH
   or die qq{close: $headerFile: $!\n};
close $detailFH
   or die qq{close: $detailFile: $!\n};
[download]

Note the use strict; and use warnings; at the top of each script; get into the habit of using these as the first forces you to pre-declare your variables with my or,rarely, our, thus catching typos etc., and the second gives warnings about possible problems like using a variable that is undefined.

Once we have a record we have to do three things: 1) separate the header from the detail; 2) count the detail lines then print to output file (spw579580.dets), keeping a count of how many lines have been written to the file; 3) process the header information to form a single formatted line (including offset and length info from the detail section) and print to output file (spw579580.hdrs).

Before we can do these tasks we need to initialise the count of lines written to the details file so add a line to the script like

my $detailsLinesWritten = 0;
[download]

just after the file is opened.

So, the first task. Both the above scripts strip off the ^L and the EOE sentinels so we have 16 lines of header followed by the details block. We can use the three-argument form of split to break the record up on newline boundaries and specify a maximum of 17 items, the 17th being our details block. We assign the items to an array, like this

my @items = split m{\n}, $record, 17;
[download]

We can manipulate the array using pop to pop one element off the right-hand end to get our details block and count how many lines by counting newlines, like this

my $details = pop @items;
my $detailsLineCt = ($details =~ tr/\n//);
[download]

As an aside, read up on push, pop, unshift, shift and splice for messing around with arrays.

Printing the details block and counting the lines written is as simple as

print $detailFH $details;
$detailsLinesWritten += $detailsLineCt;
[download]

Note that there is no comma between the filehandle and the thing to be printed. Wow, we've done tasks 1) and 2) already. Task 3) has got a bit more to it though.

A lot of the data seems to be marooned in the middle or at the end of lines so a subroutine to strip leading and trailing spaces would be useful. Something like this at the end of the script

sub stripSpaces
{
    my $toStrip = shift;
    return
         $toStrip =~ m{\A\s*(.+?)\s*\z}
       ? $1
       : q{};
}
[download]

Array subscripts are zero based and I have already stripped off the ^L which you had as line 1 so your line 2 is in $items[0]. Note the $ sigil is used, not the @ when accessing a single element of an array. We need to build up the output fields ready to assemble the line of data for the headers file. Those fields that are unchanging you can set up before the record processing loop, either with your companies text or an empty string, e.g.

my $fld2  = q{Some company text};
my $fld3  = q{Some other company text};
...
my $fld5  = q{};
...
my $fld14 = q{};
[download]

However, those fields that do change will need to be re-initialised each time around the loop so just about the first piece of code after the # FURTHER PROCESSING GOES HERE ... comment should do this. You can initialise each field one at a time

my $fld1  = q{};
my $fld6  = q{};
...
my $fld15 = q{};
[download]

or you can do it in one fell swoop

my ($fld1, $fld6, $fld7, $fld8, $fld11, $fld12, $fld15)
   = (q{}) x 7;
[download]

Field 1 is the a/c no. which was in either your line 8 or 11. Don't forget though that I have lost the ^L and that my @items array has zero-based subscripts; thus I can find the a/c no. in either $items[6] or $items[9]. We also have to strip off any leading or trailing spaces using the subroutine we declared so the code to populate field 1 becomes

$fld1 = stripSpaces($items[6]);
[download]

Most of the other fields you can populate the same way; the tricky ones are the two dates and the logical somersaults (quite simple ones) for the names. Let's do the dates first. We can pull each date out of line 12 in turn with a regular expression. Once we have done that we can transform the date from dd MMM yyyy to ddMMMyy. Something like this (not tested)

# Pull out two dates from line 12 with a global match of
# 2 digits, a space, three letters, a space 4 digits. The
# round brackets allow you to capture what matches inside
# them.
#
my ($startPeriod, $endPeriod) =
   $items[10] =~ m{(\d\d\s[A-Za-z]{3}\s\d{4})}g;

# Transform date by capturing (round brackets) the day,
# month and last 2 digits of the year in $1, $2 and $3
# then concatenating them; the 'e' flag after the
# regular expression tells the regex engine to execute
# the code to compute the substituting string. The '.'
# is the string concatenation operator.
#
($fld11 = $startPeriod)
   =~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e;
($fld12 = $endPeriod)
   =~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e;
[download]

We need to test whether there is a name on line 3 before we can decide what to do with fields 6 and 15. If there is nothing on line 3, stripSpaces() will return an empty string which is FALSE in boolean tests, so

my $line3 = stripSpaces($items[1]);
my $line4 = stripSpaces($items[2]);
if($line3)
{
    $fld6  = $line3;
    $fld15 = $line4;
}
else
{
    $fld6  = $line4;
}
[download]

I've now shown you how to populate all of the fields and we have the start line and line count as well so all that remains is to construct the header line and print it to file. The obvious function to use is pack; it takes a template string and a list of items and packs the items into a string by applying the template. Consider this code snippet

my $str1 = q{abc};
my $str2 = q{zyxwvut};
my $template = q{A5A5};
my $packed = pack $template, $str1, $str2;
print qq{>$packed<\n};
[download]

prints

>abc  zyxwv<
[download]

The 'A' template letter packs the string and pads with spaces, or truncates if appropriate. The 'a' letter pads with nulls which is not what we want. There's a whole heap of possible templates so it is worth reading this function up. We can construct our template like this (the x string multiplication operator comes in handy here)

my $hdrTemplate
   = q{A18}
   . q{A40} x 9
   . q{A7}  x 2
   . q{A10) x 2
   . q{A40}
   . q{A8}
   . q{A6};
[download]

and as it is not something that changes we should place the code before the record processing loop. Putting the header together and printing it can be done towards the end of the record processing loop just before the details block is written and the line offset updated.

my $headerStr = pack $hdrTemplate
   , $fld1
   , $fld2
   , $fld3
   , $fld4
   , $fld5
   , $fld6
   , $fld7
   , $fld8
   , $fld9
   , $fld10
   , $fld11
   , $fld12
   , $fld13
   , $fld14
   , $fld15
   , $detailsLinesWritten
   , $detailsLineCt;
print $headerFH qq{$headerStr\n};
[download]

Note that you don't have to do anything for the pack to convert the numbers to strings.

I think that just about covers everything. You should be able to put all of this together to get something working but if anything is not clear or if it looks like there is a jigsaw piece missing, please ask.

Cheers,

JohnGG

In reply to Re: File read and re-ordering by johngg
in thread File read and re-ordering by KarmicGrief

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.