Knoperl has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

I humbly seek your assistance and beg your patience. I have this huge file with more than a few hundreds of thousands of records which is from the output of tape archive of some sort. I do not have at all access to it to rerun it so as to reconfigure the output. I am stuck with what I got. The file is broken into multi-line columns of data. A new record is indicated by a listing in the NodeName column which is never more than one line in length unlike the other variables. BackupDate is always streaches 2 lines while FileName and PathName can be 1 or more lines in length. Please see a small example snippet:
Table ---------------------------------------------- NodeName FileName PathName BackupDate BD3101 bananaswi \breakfa 2007-03-06 ithapple st\fruit 14:02:31.000000 s.gif s\tree\ TP4223 chocolate \sweet\d 2006-02-28 caramelfu esserts\ 21:16:41.000000 dge.gif hersheys\ EO2123 tofuwith \organic\ 2007-07-16 peas.gif vegetable 13:55:06.000000 s\legumes\ ---------------------------------------------------
Desired Output Should be in a single line form but should be able to kept in straight columns. The total width of the output can be huge as long as each variable is of same maximum column width throughout without going into another line for each record. The output should be printed in the following order: NodeName FileName PathName BackupDate. Using the above listed example, the output I would expect from this would be:

-----------------------------------------------------------
BD3101 bananaswithapples.gif \breakfast\fruits\tree\ 2007-03-06 14:02:31.000000
TP4223 chocolatecarmelfudge.gif \sweet\desserts\hersheys\ 2006-02-28 21:16:41.000000
EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-07-16 13:55:06.000000


The way I tackled this was to declare the position delimited variables (based upon the above table):
Column Position of Table NodeName =1->10 FileName =11->20 PathName =21->30 BackupDate =30->End of Line
Here is what I have written so far but I really do need help:
#!/usr/bin/perl -w use strict; { my $input =$ARGV[0]; #returns filename from command line my $nodename; my $filename; my $pathname; my $backupdate; my $textline; my $nochar =""; my $charposition; my $nextrecord; chomp $input; #strip the carriage return open (DATAFILE, "$input")|| die ("Can not open $input:!\n"); #access the file while (my $textline=<DATAFILE>) { chomp $textline; foreach my $textline { next if ($textline =~ /($charposition = m/({0}/)!= $no +char; else { $nodename = m/({0,9}/; $filename = m/({10,19}/; $pathname = m/({20,29}/; $backupdate = m/({30).*/; printf ("%s%s%s%s\n",$nodename, $filename, $backupdate, $pathname); } } close (DATAFILE); }
Thank you so very much again for all the PerlMonks who have been a real lifesaver!

Replies are listed 'Best First'.
Re: MultiLine Tables into Variables
by BrowserUk (Patriarch) on Aug 14, 2007 at 19:15 UTC

    Going by the spelling error "bananaswiithapples.gif", I assume that you typed the sample in by hand rather than C&Ping from the real data. At least I hope you did because your records are inconsistant.

    In the first record, the third field is wrapped at 8 chars. In the second record, the first two lines of that third field are wrapped at 8 chars and the last line at 9. In the third record, the first two lines of the third field are wrapped at 9 and the last line extends to 10.

    There are similar inconsistacies in the wrapping of the second field/first record. I've adjusted the data to fit my assumption and I apologies if it is wrong.

    This determines the output formatting by finding the maximum width of each field. It assumes that you have enough memory to accumulate all the records in memory. Otherwise it would be necessary to do two passes through the file:

    #! perl -slw use strict; ## The A template takes care of trailing spaces my $inTempl = 'A8 x1 A9 x1 A8 A*'; my @headers = split ' ', <DATA>; ## Are these headers used? my( @output, $accum ); while( my $line = <DATA> ) { chomp $line; my @bits = unpack $inTempl, $line; s[^\s*][]g for @bits; ## Trim leading spaces if( $line =~ m[^\S] ) { ## Start of a new record. ## Add to the list push @output, $accum if $accum; ## Start a new accumulation $accum = \@bits; } else { ## Append to the accumulators $accum->[ $_ ] .= $bits[ $_ ] for 0 .. $#bits; } } push @output, $accum; ## Don't forget the last record. ## Determine the output field widths my @w = ( 0 ) x 4; for my $ref ( @output ) { for my $i ( 0 .. 3 ) { my $len = length( $ref->[ $i ]||'' ); $w[ $i ] = $len if $w[ $i ] < $len; } } ## Build an output template with a extra space between fields my $outTempl = join ' ', map 'A' . ($_+1), @w; ## And output print pack $outTempl, @$_ for @output; __DATA__ NodeName FileName PathName BackupDate BD3101 bananaswi \breakfa 2007-03-06 ithapple st\fruit 14:02:31.000000 s.gif s\tree\ TP4223 chocolate \sweet\d 2006-02-28 caramelfu esserts\ 21:16:41.000000 dge.gif hershey\ EO2123 tofuwith \organic 2007-07-16 peas.gif \vegetab 13:55:06.000000 les\legu mes\

    Produces

    C:\test>632548 BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-03- +0614:02:31.000000 TP4223 chocolatecaramelfudge.gif \sweet\desserts\hershey\ 2006-02- +2821:16:41.000000 EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-07- +1613:55:06.000000

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: MultiLine Tables into Variables
by dwm042 (Priest) on Aug 14, 2007 at 18:52 UTC
    The code below is an example and a fast one, but there are essentially just two points of note:

    1) fixed fields can be easily parsed with unpack.
    2) printf can format your output. Just be sure to give your fields enough room.
    #!/usr/bin/perl use warnings; use strict; package main; my $node_name = ""; my $host_name = ""; my $path_name = ""; my $backup_date = ""; while(<DATA>) { my ($node, $host, $path, $backup) = unpack('A9A10A10A10',$_); if ( $node =~ /^\w+/ ) { if ( $host_name =~ /\w+/ ) { printf "%-9s %-25s %-28s %-17s\n", $node_name, $host_name, $path_name, $backup_date; $node_name = ""; $host_name = ""; $path_name = ""; $backup_date = ""; } $node_name = $node; $host_name = $host; $path_name = $path; $backup_date = $backup; $backup_date =~ s/^\s+//; } else { $host_name .= $host; $path_name .= $path; $backup_date .= $backup; } } printf "%-9s %-25s %-28s %-17s\n", $node_name, $host_name, $path_name, $backup_date; __DATA__ BD3101 bananaswi \breakfa 2007-03-06 ithapple st\fruit 14:02:31.000000 s.gif s\tree\ TP4223 chocolate \sweet\d 2006-02-28 caramelfu esserts\ 21:16:41.000000 dge.gif hersheys\ EO2123 tofuwith \organic\ 2007-07-16 peas.gif vegetable 13:55:06.000000 s\legumes\
    And output is:

    C:\Code>perl unpack.pl BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007- +14:02:31.0 TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006- +21:16:41.0 EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007- +13:55:06.0
Re: MultiLine Tables into Variables
by moritz (Cardinal) on Aug 14, 2007 at 17:58 UTC
    You should really do the reading with unpack, append the read values to the previously read values unless there is an entry in NodeName.

    To write the output I'd recommend Text::Table.

      Dear Moritz,
      My understanding of unpack from perlpacktut is that it does a great job for single line position delimited records. I did not see anything for dealing with multi-line records. I do appreciate your pointing me in that direction and I am going to do some research into it.

      My understanding is that perlpack can be very strict regarding how the template matches and that is going to be a problem as you can see from my above example that I have no idea if some of the fields stretch 2 or 3 or 4 lines long inside the column.

      If you or anyone else can provide anything more I would greatly appreciate it.
        Well, it doesn't do all the magic for you, but quite a bit:

        #!/usr/bin/perl use warnings; use strict; my (@nodename, @filename, @pathname, @backupdate); use Data::Dumper; { # discard heading line my $tmp = <DATA>; } while (my $line = <DATA>){ chomp $line; my ($nn, $fn, $pn, $bd) = unpack('A8xA9xA8xA15', $line); if ($nn =~ m/\S/){ push @nodename, $nn; push @filename, $fn; push @pathname, $pn; push @backupdate, $bd; } else { $nodename[-1] .= $nn; $filename[-1] .= $fn; $pathname[-1] .= $pn; $backupdate[-1] .= $bd; } } print Dumper([\@nodename, \@filename, \@pathname, \@backupdate]); #1234567890123456789012345678901234567890123 __DATA__ NodeName FileName PathName BackupDate BD3101 bananaswi \breakfa 2007-03-06 ithapple st\fruit 14:02:31.000000 s.gif s\tree\ TP4223 chocolate \sweet\d 2006-02-28 caramelfu esserts\ 21:16:41.000000 dge.gif hersheys\ EO2123 tofuwith \organic\ 2007-07-16 peas.gif vegetable 13:55:06.000000 s\legumes\

        Actually the data would better be stored in a two dimensional array.

        Note that all lines that don't have the Backup Date field need to be padded with whitespaces at the end of the line to be long enough, if that's not the case you'd have to pad them manually before using unpack.

Re: MultiLine Tables into Variables
by SuicideJunkie (Vicar) on Aug 14, 2007 at 18:34 UTC

    It sounds like you know the column widths, and looks like the first column will be all whitespace unless you are starting a new record.

    while (my $textline=<DATAFILE>) { if ( substr($textline,0,9) =~ /\S/) { #New record detected - Print out old one, and prep for new printf ("%s%s%s%s\n",$nodename, $filename, $backupdate, $pathname); $nodename = substr($textline, 0, 9); $filename = ''; $pathname = ''; $backupdate = ''; } $filename .= substr($textline, 10,19); $pathname .= substr($textline, 20,29); $backupdate .= substr($textline, 30); }
Re: MultiLine Tables into Variables
by thezip (Vicar) on Aug 14, 2007 at 19:37 UTC

    With this, I throw my code into the ring:

    Update: My apologies -- I missed the part about the huge datafile. This method stores everything in memory, so that could be a problem...

    #!/perl/bin/perl -w use strict; use Data::Dumper; my $spec = 'A9A10A10A15'; my $hash = {}; my $nodename = ""; my $out = {}; $_ = <DATA>; # skip the header line while (<DATA>) { #stuff everything into a hash of array refs my @arr = unpack($spec, $_); @arr = map { s/^\s+//; $_ } @arr; # remove any leading spaces if ($arr[0]) { $nodename = shift(@arr); } else { shift(@arr); } push(@{$hash->{$nodename}}, \@arr); } print Dumper($hash); # contents may be viewed in $VAR1 below for my $key (keys %$hash) { my $rows = $hash->{$key}; for (my $rownum = 0; $rownum <= $#$rows; $rownum++) { my $cols = $rows->[$rownum]; for (my $col = 0; $col <= $#$cols; $col++) { # include a space between the date/time strings my $space = ($rownum == 0 && $col == 2) ? ' ' : ''; $out->{$key}->[$col] .= $cols->[$col] . $space; } } } for my $key (keys %$out) { printf "%-7s %-25s %-29s %-30s\n", $key, @{$out->{$key}}; } __DATA__ NodeName FileName PathName BackupDate BD3101 bananaswi \breakfa 2007-03-06 ithapple st\fruit 14:02:31.000000 s.gif s\tree\ TP4223 chocolate \sweet\d 2006-02-28 caramelfu esserts\ 21:16:41.000000 dge.gif hersheys\ EO2123 tofuwith \organic\ 2007-07-16 peas.gif vegetable 13:55:06.000000 s\legumes\ __OUTPUT__ $VAR1 = { 'TP4223' => [ [ 'chocolate', '\\sweet\\d', '2006-02-28' ], [ 'caramelfu', 'esserts\\', '21:16:41.000000' ], [ 'dge.gif', 'hersheys\\', '' ] ], 'BD3101' => [ [ 'bananaswi', '\\breakfa', '2007-03-06' ], ... etc ... TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006-0 +2-28 21:16:41.000000 BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-0 +3-06 14:02:31.000000 EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-0 +7-16 13:55:06.000000

    Where do you want *them* to go today?
Re: MultiLine Tables into Variables
by FunkyMonk (Bishop) on Aug 14, 2007 at 22:27 UTC
    You're going to have to do two passes if it's a huge file (as in huge = too big for memory). The first pass does most of the work:

    • it keeps track of how wide each field is
    • writes a temp file with one record per line

    The temp file looks something like:

    BD3101|bananaswiithapples.gif|\breakfast\fruits\tree\|2007-03-06 14:02 +:31.000000 TP4223|chocolatecaramelfudge.gif|\sweet\desserts\hersheys\|2006-02-28 +21:16:41.000000 EO2123|tofuwithpeas.gif|\organic\vegetables\legumes\|2007-07-16 13:55: +06.000000

    The second pass processes this temp file and produces the formatted output:

    BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-03- +06 14:02:31.000000 TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006-02- +28 21:16:41.000000 EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-07- +16 13:55:06.000000

    The code that follows doesn't use files at at all (I'll leave that to you - it's trivial) and produces the output above:

    my ( @in, @out, @temp_file ); my @lengths = (0) x 4; pass1(); pass2(); sub pass1 { while ( <DATA> ) { my @in = unpack "A9A10A9A*", $_; if ( $in[0] ) { write_to_temp( @out ) if $out[0]; @out = @in; next; } $out[$_] .= $in[$_] for 0 .. 3; } write_to_temp( @out ); } sub pass2 { my $format = join " ", ( map "%-${_}s", @lengths ), "\n"; for ( @temp_file ) { chomp; my @f = split /\|/; printf $format, @f; } } sub write_to_temp { s/\s+/ /g, s/^\s+//, s/\s+$// for $_[3]; length $_[$_] > $lengths[$_] and $lengths[$_] = length $_[$_] for 0 .. 3; push @temp_file, join( "|", @_ ) . "\n"; }

    PS I've assumed BrowserUk's comment about mistyped sample data to be true.

Re: MultiLine Tables into Variables
by perlofwisdom (Pilgrim) on Aug 14, 2007 at 19:51 UTC
    Yet another solution (boy, you've got to be quick around here :)) #!/usr/bin/perl -w use strict; my $input ='junk_input.txt'; #returns filename from command line my $nodename; my $filename; my $pathname; my $backupdate; my $textline; my $nochar =""; my $charposition; my $nextrecord; chomp $input; #strip the carriage return my %len = ('NODE',0,'FILE',0,'PATH',0,'DATE',0); my @textline = (); open (DATAFILE, "$input")|| die ("Can not open $input:!\n"); # access + the file while (my $textline=<DATAFILE>) { chomp $textline; ################################ # If continuation of last line ################################ if (substr($textline,0,8) =~ /^\s/) { # Append contents to previous values $nodename .= substr($textline,0,8); $filename .= substr($textline,9,9) if (length($textline +) >= 10); $pathname .= substr($textline,19,10) if (length($textline +) >= 20); $backupdate .= ' ' . substr($textline,29) if (length($textline +) >= 30); } else { ################################ # If new line ################################ push @textline, "$nodename|$filename|$pathname|$backupdate"; + # Save previous line # Save new values $nodename = substr($textline,0,8); $filename = substr($textline,9,9) if (length($textline +) >= 10); $pathname = substr($textline,19,10) if (length($textline +) >= 20); $backupdate = substr($textline,29) if (length($textline +) >= 30); } # Remove unwanted spaces at beginning or end, depending on column $nodename =~ s/\s{1,}$//g; $filename =~ s/\s{1,}$//g; $pathname =~ s/\s{1,}$//g; $backupdate =~ s/^\s{1,}//g; # Save longest column length (used later for formatting output) $len{NODE} = length($nodename) if (length($nodename) > $len{ +NODE}); $len{FILE} = length($filename) if (length($filename) > $len{ +FILE}); $len{PATH} = length($pathname) if (length($pathname) > $len{ +PATH}); $len{DATE} = length($backupdate) if (length($backupdate) > $len{ +DATE}); } push @textline, "$nodename|$filename|$pathname|$backupdate"; + # Save last line of input file close (DATAFILE); for my $textline (@textline) { ($nodename,$filename,$pathname,$backupdate) = split(/\|/,$textline) +; # Separate columns # Format column widths $nodename .= ' ' x ($len{NODE} - length($nodename)); $filename .= ' ' x ($len{FILE} - length($filename)); $pathname .= ' ' x ($len{PATH} - length($pathname)); $backupdate .= ' ' x ($len{DATE} - length($backupdate)); print "$nodename $filename $pathname $backupdate\n"; }