File read and re-ordering

KarmicGrief has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File read and re-ordering by davorg (Chancellor) on Oct 20, 2006 at 14:02 UTC
You're not being very clear, but it sounds like you want to do something like this: while (<INPUT>) { # look for start of record next unless /START_RECORD_MARKER/; # output the next 17 lines into a file my @seventeen; push @seventeen, scalar <INPUT> for 1 .. 17; # reformat the contents of @seventeen in some way open OUTPUT, '>', 'first_part_of_record.txt' or die $!; print OUTPUT @seventeen; close OUTPUT; # output the rest of the record into another file my @rest; while (<INPUT>) { last if /END_RECORD_MARKER/; push @rest, $_; } # reformat @rest in some way open OUTPUT, '>', 'next_part_of_record.txt' or die $!; print OUTPUT @rest; close OUTPUT; } [download] This overwrites the two files each time round the loop, so you'll need to add some extra processing to create unique names for them. And I'm not sure what you mean about "assignments". I haven't used any variable assignments at all - so I'm probably not doing what you want. Update: Changed the logic a bit. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re: File read and re-ordering by liverpole (Monsignor) on Oct 20, 2006 at 13:42 UTC
Hi KarmicGrief, Please show us what code you've got already, and give us specifics about exactly what the output should look like. I'm afraid your question is too vague to be able to give you better help without more details on what you've already tried, and where exactly you're stuck. s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/	[reply]
Re: File read and re-ordering by johngg (Canon) on Oct 20, 2006 at 16:04 UTC
Like davorg and shmem, I am not sure what it is you are trying to do. Perhaps you could supply a data sample as well as the code you have tried so far and your expected output. I have had a stab at my best guess making the assumption that your data, although possibly hundreds of records, is still small enough to fit in memory. Thus I am "slurping" the entire data file into memory. If your file is too large this may not be an option. I have annotated the code with (hopefully) explanatory comments. My example data uses just three header lines to illustrate the code in order to save space. Here it is use strict; use warnings; # Set up start and end record sentinels. # my $startSentinel = q{start record}; my $endSentinel = q{end record}; # Compile regex to pull out records. Note the # \Q ... \E to quote regex metacharacters if your # record start and stop sentinels contain them. # my $rxExtractRecord = qr {(?xms) \Q$startSentinel\E\n (.?) (?=\Q$endSentinel\E\n\|\z) }; # Slurp file into string; I'm using the <DATA> # filehandle but you would open your file and # slurp that. # my $completeFile; { local $/; $completeFile = <DATA> } # Do a global match against compiled regex to # pull out records and put them in an array. # my @records = $completeFile =~ m{$rxExtractRecord}g; # Process each record in a loop. # foreach my $record (@records) { # Split record up into 4 items on newline. I # have used 4 here as I have only put three # headers in my data for brevity. Get the data # part by pop'ing the last item off the @items # array so that @items only contains the # headers # my @items = split m{\n}, $record, 4; my $data = pop @items # This is where your specification becomes a # bit vague. Perhaps you would pull the name out # of the hdr1:... line and open two files, e.g. # fred.hdr and fred.dat and do a printf of the # headers to the first and a print of the data # to the second. You would open, print and close # the files in this loop. Perhaps your mention # of assignment means you want to transform the # headers in some way before do a printf. # ..... do something we can't guess here ..... } __END__ start record* hdr1:fred hdr2:2002-10-15 hdr3:head honcho and here is some date about fred telling us all what a great guy he is end record start record hdr1:pete hdr2:2005-03-22 hdr3:bottle washer not much data for pete end record start record hdr1:mary hdr2:2004-01-31 hdr3:personal assistant mary is a great asset to the company and will go far end record [download] Hopefully this and the other responses will get you started. Cheers, JohnGG Update: Used variables to hold start and end sentinels.	[reply] [d/l]
Re: File read and re-ordering by shmem (Chancellor) on Oct 20, 2006 at 15:27 UTC
Let's see if I can do something while I read your specification. # "What I essentially need is a way to say..." my $start_tag = 'BEGIN'; my $end_constant = 'END'; my $headerformat = join('; ', "%s=%s" x 17)."\n"; # XXX ? format for +printf() ? my $headerfile = 'header000'; my $detailsection = 'detail000'; open IN, '<', $file) or die "Can't read '$file': $!\n"; RECORD: while(defined($_ = <IN>)) { my %out; # we'll capture the "record content" for pr +intf() here if (/$start_tag/) { # "...start at this character..." $. = 0; # reset line counter while(<IN>) { # "...read each line for the next 17..." # XXX the specs aren't clear he +r. my ($key, $value) = split; # so I'll just split $out{$key} = $value; # "...assign to values to do # a printf statement..." if($. == 17) { # done with reading the record. do "the pri +ntf()" open HEADER, '>', $headerfile or die "Can't write to '$headerfile': $!\n"; printf HEADER $headerformat, map { $_,$out{$_} } keys +%out; close HEADER; $headerfile++; # string increment: header000 -> he +ader001 $. = 0; open DETAIL, '>', $detailsection or die "Can't write to '$detailsection': $!\n"; # "starting at the 18th line of tha +t record..." while(<IN>) { # "...read until the end constant.. +." print DETAIL; # "...and print that straight to # a different file..." last if /$end_constant/; } close DETAIL; $detailsection++; next RECORD; # "...and then start the entire l +oop again." } } } } [download] Poo. That might work, but it's butt ugly and hard to read. Let's refactor that a bit. open my $fh, '<', $file) or die "Can't read '$file': $!\n"; while(<$fh>) { write_record($fh,$headerfile) if /$start_tag/; $headerfile++; } sub write_record { my ($fh, $outfile) = @_; my %out; $. = 0; while(<$fh>) { my ($key, $value) = split; $out{$key} = $value; if ($. == 17) { open my $header, '>', $outfile or die "Can't write '$outfile': $!\n"; printf $header $headerformat, map { $_,$out{$_} } keys %ou +t; close $header; write_detail($fh,$detailsection); $detailsection++; return; } } } sub write_detail { my ($fh, $detailfile) = @_; open my $detail, '>', $detailfile or die "Can't write to '$detailfile': $!\n"; while(<$fh>) { last if /$end_constant/; print $detail; } close $detail; } [download] Does that make sense to you? If it doesn't, write better specifications ;-) --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: File read and re-ordering by johngg (Canon) on Oct 23, 2006 at 10:13 UTC
Right, looking at your updated post there are still a few imponderables. Is the start of record tag the literal caret followed by uppercase L or are we seeing a representation of a <control-L>? Because of line-wrap it is difficult to tell what lines your header data is on. It looks like name, address and country/ZIP are on lines 3, 4 and 5 but are you certain that the address will always fit this pattern? What characters are you likely to find and have to allow for in the data? It looks like you have a/c no. on lines 6 and 8; are the lines identical and what is the format of the number? Did you mean end-period-date rather than end-period-value? What is the date format? Most importantly, is that all of the data you wish to extract from the header? What widths are you going to lay down for each field? By specifying fixed length records you imply that there will be no field separators in your header file; is this what you want or do you want separators to aid legibility? As for the proposed output files, it looks like you intend to have a single header file containing a line of fixed-width data for each customer plus a file of variable length data for each customer. How do you intend to associate the header info with the relevant data file? I assume that a/c no. would be unique so that could form part of the data file name. Answering these questions may help you towards a solution and help us to help you. Cheers, JohnGG	[reply]
Re^2: File read and re-ordering by KarmicGrief (Initiate) on Oct 23, 2006 at 14:08 UTC
Sorry about the formating, being newbish sucks. I will try to answer the questions here to give a better idea of what is happening. Record starts with a literal caret L which I am counting as line 1, line 2 is always blank, line 3 can be blank or contain a name, line 4 will always contain a name, line 5 will always contain an address, line 6 is always blank, line 7 contains city state zip( which would need split out to line 7, line 7a, and line 7b), line 8 contains an a/c # ( in format of #-#), line 9 is blank, line 10 is blank, line 11 contains an a/c # ( in same format as prior), line 12 contains the 2 date fields, beginning period and ending period (format is for example 01 Oct 2006 31 Oct 2006) and I will need those split to a line 12 and line 12a, line 13 is blank, line 14 contains a message line, line 15 contains a message line, line 16 is blank and line 17 is blank, line 18 begins the details of the account, line 19 through variable number of lines is the details and finally it ends with an (EOE), then the next record begins again with the ^L. That is all the data I need to pull for the header and the detail files, the widths on the header file vary depending on which field, it could be a 12 character field or a 40 character field. No need for field delimiters since there is a process already in place to read the exact field positionings. The a/c number is what would be used to associate the detail to the header file and yes the header file needs to be one line for each customer. Does this help?	[reply]
Re^3: File read and re-ordering by johngg (Canon) on Oct 23, 2006 at 15:05 UTC
Re. formatting, have a read of the link shmem gave you and just have a play around to see what works, trying things out on your private scratchpad which you can find on your home node. <p> and <code> ... </code> tags are your friends. So, further questions. Do you want to capture the possible name on line 3 or will you always use the one on line 4? Are the a/c nos. on lines 8 and 11 the same and do you want to capture just one of them? Are the dates ddMMMyyyy or dd MMM yyyy? It looks like the latter. Is information on line 18 significant or is it just a marker with the meat starting on line 19 et seq.? How many output fields, what order, what widths and what pad character? What is your policy on truncating data that is too wide? I think that given answers to the above I can (without writing your whole application for you :-) make some suggestions and code pointers on how you can proceed. Cheers, JohnGG	[reply]
Re^4: File read and re-ordering by KarmicGrief (Initiate) on Oct 23, 2006 at 16:13 UTC
Re^5: File read and re-ordering by johngg (Canon) on Oct 23, 2006 at 18:55 UTC
Some notes below your chosen depth have not been shown here
Re: File read and re-ordering by shmem (Chancellor) on Oct 20, 2006 at 18:53 UTC
well crap, I just updated and the formatting is off so it's just garbled, anyone tell me how to format this so it's legible? Writeup Formatting Tips is for you ;-) --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re: File read and re-ordering by johngg (Canon) on Oct 25, 2006 at 22:50 UTC
I have had a look at the data description and other information you supplied and have come up with some ideas. Firstly, here is the data file I am using (spw579580.inp), hopefully in a form pretty close to what you have although full of gibberish. Read more... (3 kB) To start with I give below two skeleton scripts which both do the same thing, namely process records held in a file one at a time. The first is more or less the same as my first post and takes the approach of reading the whole input file into memory (slurping) then pulling out the records using a regular expression match. If your data set is too large this approach will not be possible so I have given a second example that reads the file a record at a time, thus saving memory. You will see in the second script that you can set the input record separator so that one read from the filehandle pulls in an entire record. Here's the slurp version Read more... (2 kB) and here is the record-by-record Read more... (1345 Bytes) Note the `use strict;` and `use warnings;` at the top of each script; get into the habit of using these as the first forces you to pre-declare your variables with `my` or,rarely, `our`, thus catching typos etc., and the second gives warnings about possible problems like using a variable that is undefined. Once we have a record we have to do three things: 1) separate the header from the detail; 2) count the detail lines then print to output file (spw579580.dets), keeping a count of how many lines have been written to the file; 3) process the header information to form a single formatted line (including offset and length info from the detail section) and print to output file (spw579580.hdrs). Before we can do these tasks we need to initialise the count of lines written to the details file so add a line to the script like `my $detailsLinesWritten = 0;` [download] just after the file is opened. So, the first task. Both the above scripts strip off the `^L` and the `EOE` sentinels so we have 16 lines of header followed by the details block. We can use the three-argument form of `split` to break the record up on newline boundaries and specify a maximum of 17 items, the 17th being our details block. We assign the items to an array, like this `my @items = split m{\n}, $record, 17;` [download] We can manipulate the array using `pop` to pop one element off the right-hand end to get our details block and count how many lines by counting newlines, like this `my $details = pop @items; my $detailsLineCt = ($details =~ tr/\n//);` [download] As an aside, read up on `push`, `pop`, `unshift`, `shift` and `splice` for messing around with arrays. Printing the details block and counting the lines written is as simple as `print $detailFH $details; $detailsLinesWritten += $detailsLineCt;` [download] Note that there is no comma between the filehandle and the thing to be printed. Wow, we've done tasks 1) and 2) already. Task 3) has got a bit more to it though. A lot of the data seems to be marooned in the middle or at the end of lines so a subroutine to strip leading and trailing spaces would be useful. Something like this at the end of the script `sub stripSpaces { my $toStrip = shift; return $toStrip =~ m{\A\s(.+?)\s\z} ? $1 : q{}; }` [download] Array subscripts are zero based and I have already stripped off the `^L` which you had as line 1 so your line 2 is in `$items[0]`. Note the `$` sigil is used, not the `@` when accessing a single element of an array. We need to build up the output fields ready to assemble the line of data for the headers file. Those fields that are unchanging you can set up before the record processing loop, either with your companies text or an empty string, e.g. `my $fld2 = q{Some company text}; my $fld3 = q{Some other company text}; ... my $fld5 = q{}; ... my $fld14 = q{};` [download] However, those fields that do change will need to be re-initialised each time around the loop so just about the first piece of code after the `# FURTHER PROCESSING GOES HERE ...` comment should do this. You can initialise each field one at a time `my $fld1 = q{}; my $fld6 = q{}; ... my $fld15 = q{};` [download] or you can do it in one fell swoop `my ($fld1, $fld6, $fld7, $fld8, $fld11, $fld12, $fld15) = (q{}) x 7;` [download] Field 1 is the a/c no. which was in either your line 8 or 11. Don't forget though that I have lost the `^L` and that my `@items` array has zero-based subscripts; thus I can find the a/c no. in either `$items[6]` or `$items[9]`. We also have to strip off any leading or trailing spaces using the subroutine we declared so the code to populate field 1 becomes `$fld1 = stripSpaces($items[6]);` [download] Most of the other fields you can populate the same way; the tricky ones are the two dates and the logical somersaults (quite simple ones) for the names. Let's do the dates first. We can pull each date out of line 12 in turn with a regular expression. Once we have done that we can transform the date from dd MMM yyyy to ddMMMyy. Something like this (not tested) # Pull out two dates from line 12 with a global match of # 2 digits, a space, three letters, a space 4 digits. The # round brackets allow you to capture what matches inside # them. # my ($startPeriod, $endPeriod) = $items[10] =~ m{(\d\d\s[A-Za-z]{3}\s\d{4})}g; # Transform date by capturing (round brackets) the day, # month and last 2 digits of the year in $1, $2 and $3 # then concatenating them; the 'e' flag after the # regular expression tells the regex engine to execute # the code to compute the substituting string. The '.' # is the string concatenation operator. # ($fld11 = $startPeriod) =~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e; ($fld12 = $endPeriod) =~ s{(\d\d)\s([A-Za-z]{3})\s\d\d(\d\d)}{$1 . $2 . $3}e; [download] We need to test whether there is a name on line 3 before we can decide what to do with fields 6 and 15. If there is nothing on line 3, `stripSpaces()` will return an empty string which is FALSE in boolean tests, so `my $line3 = stripSpaces($items[1]); my $line4 = stripSpaces($items[2]); if($line3) { $fld6 = $line3; $fld15 = $line4; } else { $fld6 = $line4; }` [download] I've now shown you how to populate all of the fields and we have the start line and line count as well so all that remains is to construct the header line and print it to file. The obvious function to use is `pack`; it takes a template string and a list of items and packs the items into a string by applying the template. Consider this code snippet `my $str1 = q{abc}; my $str2 = q{zyxwvut}; my $template = q{A5A5}; my $packed = pack $template, $str1, $str2; print qq{>$packed<\n};` [download] prints `>abc zyxwv<` [download] The 'A' template letter packs the string and pads with spaces, or truncates if appropriate. The 'a' letter pads with nulls which is not what we want. There's a whole heap of possible templates so it is worth reading this function up. We can construct our template like this (the `x` string multiplication operator comes in handy here) `my $hdrTemplate = q{A18} . q{A40} x 9 . q{A7} x 2 . q{A10) x 2 . q{A40} . q{A8} . q{A6};` [download] and as it is not something that changes we should place the code before the record processing loop. Putting the header together and printing it can be done towards the end of the record processing loop just before the details block is written and the line offset updated. `my $headerStr = pack $hdrTemplate , $fld1 , $fld2 , $fld3 , $fld4 , $fld5 , $fld6 , $fld7 , $fld8 , $fld9 , $fld10 , $fld11 , $fld12 , $fld13 , $fld14 , $fld15 , $detailsLinesWritten , $detailsLineCt; print $headerFH qq{$headerStr\n};` [download] Note that you don't have to do anything for the `pack` to convert the numbers to strings. I think that just about covers everything. You should be able to put all of this together to get something working but if anything is not clear or if it looks like there is a jigsaw piece missing, please ask. Cheers, JohnGG	[reply] [d/l] [select]