I have had a look at the data description and other information you supplied and have come up with some ideas. Firstly, here is the data file I am using (spw579580.inp), hopefully in a form pretty close to what you have although full of gibberish.
To start with I give below two skeleton scripts which both do the same thing, namely process records held in a file one at a time. The first is more or less the same as my first post and takes the approach of reading the whole input file into memory (slurping) then pulling out the records using a regular expression match. If your data set is too large this approach will not be possible so I have given a second example that reads the file a record at a time, thus saving memory. You will see in the second script that you can set the input record separator so that one read from the filehandle pulls in an entire record. Here's the slurp version
and here is the record-by-record
Note the use strict; and use warnings; at the top of each script; get into the habit of using these as the first forces you to pre-declare your variables with my or,rarely, our, thus catching typos etc., and the second gives warnings about possible problems like using a variable that is undefined.
Once we have a record we have to do three things: 1) separate the header from the detail; 2) count the detail lines then print to output file (spw579580.dets), keeping a count of how many lines have been written to the file; 3) process the header information to form a single formatted line (including offset and length info from the detail section) and print to output file (spw579580.hdrs).
Before we can do these tasks we need to initialise the count of lines written to the details file so add a line to the script like
my $detailsLinesWritten = 0;
just after the file is opened.
So, the first task. Both the above scripts strip off the ^L and the EOE sentinels so we have 16 lines of header followed by the details block. We can use the three-argument form of split to break the record up on newline boundaries and specify a maximum of 17 items, the 17th being our details block. We assign the items to an array, like this
my @items = split m{\n}, $record, 17;
We can manipulate the array using pop to pop one element off the right-hand end to get our details block and count how many lines by counting newlines, like this
my $details = pop @items;
my $detailsLineCt = ($details =~ tr/\n//);
As an aside, read up on push, pop, unshift, shift and splice for messing around with arrays.
Printing the details block and counting the lines written is as simple as
print $detailFH $details;
$detailsLinesWritten += $detailsLineCt;
Note that there is no comma between the filehandle and the thing to be printed. Wow, we've done tasks 1) and 2) already. Task 3) has got a bit more to it though.
A lot of the data seems to be marooned in the middle or at the end of lines so a subroutine to strip leading and trailing spaces would be useful. Something like this at the end of the script
sub stripSpaces
{
my $toStrip = shift;
return
$toStrip =~ m{\A\s*(.+?)\s*\z}
? $1
: q{};
}
Array subscripts are zero based and I have already stripped off the ^L which you had as line 1 so your line 2 is in $items[0]. Note the $ sigil is used, not the @ when accessing a single element of an array. We need to build up the output fields ready to assemble the line of data for the headers file. Those fields that are unchanging you can set up before the record processing loop, either with your companies text or an empty string, e.g.
my $fld2 = q{Some company text};
my $fld3 = q{Some other company text};
...
my $fld5 = q{};
...
my $fld14 = q{};
However, those fields that do change will need to be re-initialised each time around the loop so just about the first piece of code after the # FURTHER PROCESSING GOES HERE ... comment should do this. You can initialise each field one at a time
my $fld1 = q{};
my $fld6 = q{};
...
my $fld15 = q{};
or you can do it in one fell swoop
my ($fld1, $fld6, $fld7, $fld8, $fld11, $fld12, $fld15)
= (q{}) x 7;
Field 1 is the a/c no. which was in either your line 8 or 11. Don't forget though that I have lost the ^L and that my @items array has zero-based subscripts; thus I can find the a/c no. in either $items[6] or $items[9]. We also have to strip off any leading or trailing spaces using the subroutine we declared so the code to populate field 1 becomes
$fld1 = stripSpaces($items[6]);
Most of the other fields you can populate the same way; the tricky ones are the two dates and the logical somersaults (quite simple ones) for the names. Let's do the dates first. We can pull each date out of line 12 in turn with a regular expression. Once we have done that we can transform the date from dd MMM yyyy to ddMMMyy. Something like this (not tested)
# Pull out two dates from line 12 with a global match of
# 2 digits, a space, three letters, a space 4 digits. The
# round brackets allow you to capture what matches inside
# them.
#
my ($startPeriod, $endPeriod) =
$items[10] =~ m{(\d\d\s[A-Za-z]{3}\s\d{4})}g;
# Transform date by capturing (round brackets) the day,
# month and last 2 digits of the year in $1, $2 and $3
# then concatenating them; the 'e' flag after the
# regular expression tells the regex engine to execute
# the code to compute the substituting string. The '.'
# is the string concatenation operator.
#
($fld11 = $startPeriod)
=~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e;
($fld12 = $endPeriod)
=~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e;
We need to test whether there is a name on line 3 before we can decide what to do with fields 6 and 15. If there is nothing on line 3, stripSpaces() will return an empty string which is FALSE in boolean tests, so
my $line3 = stripSpaces($items[1]);
my $line4 = stripSpaces($items[2]);
if($line3)
{
$fld6 = $line3;
$fld15 = $line4;
}
else
{
$fld6 = $line4;
}
I've now shown you how to populate all of the fields and we have the start line and line count as well so all that remains is to construct the header line and print it to file. The obvious function to use is pack; it takes a template string and a list of items and packs the items into a string by applying the template. Consider this code snippet
my $str1 = q{abc};
my $str2 = q{zyxwvut};
my $template = q{A5A5};
my $packed = pack $template, $str1, $str2;
print qq{>$packed<\n};
prints
>abc zyxwv<
The 'A' template letter packs the string and pads with spaces, or truncates if appropriate. The 'a' letter pads with nulls which is not what we want. There's a whole heap of possible templates so it is worth reading this function up. We can construct our template like this (the x string multiplication operator comes in handy here)
my $hdrTemplate
= q{A18}
. q{A40} x 9
. q{A7} x 2
. q{A10) x 2
. q{A40}
. q{A8}
. q{A6};
and as it is not something that changes we should place the code before the record processing loop. Putting the header together and printing it can be done towards the end of the record processing loop just before the details block is written and the line offset updated.
my $headerStr = pack $hdrTemplate
, $fld1
, $fld2
, $fld3
, $fld4
, $fld5
, $fld6
, $fld7
, $fld8
, $fld9
, $fld10
, $fld11
, $fld12
, $fld13
, $fld14
, $fld15
, $detailsLinesWritten
, $detailsLineCt;
print $headerFH qq{$headerStr\n};
Note that you don't have to do anything for the pack to convert the numbers to strings.
I think that just about covers everything. You should be able to put all of this together to get something working but if anything is not clear or if it looks like there is a jigsaw piece missing, please ask. Cheers, JohnGG |