Re: MultiLine Tables into Variables
by BrowserUk (Patriarch) on Aug 14, 2007 at 19:15 UTC
|
Going by the spelling error "bananaswiithapples.gif", I assume that you typed the sample in by hand rather than C&Ping from the real data. At least I hope you did because your records are inconsistant.
In the first record, the third field is wrapped at 8 chars. In the second record, the first two lines of that third field are wrapped at 8 chars and the last line at 9. In the third record, the first two lines of the third field are wrapped at 9 and the last line extends to 10.
There are similar inconsistacies in the wrapping of the second field/first record. I've adjusted the data to fit my assumption and I apologies if it is wrong.
This determines the output formatting by finding the maximum width of each field. It assumes that you have enough memory to accumulate all the records in memory. Otherwise it would be necessary to do two passes through the file:
#! perl -slw
use strict;
## The A template takes care of trailing spaces
my $inTempl = 'A8 x1 A9 x1 A8 A*';
my @headers = split ' ', <DATA>; ## Are these headers used?
my( @output, $accum );
while( my $line = <DATA> ) {
chomp $line;
my @bits = unpack $inTempl, $line;
s[^\s*][]g for @bits; ## Trim leading spaces
if( $line =~ m[^\S] ) { ## Start of a new record.
## Add to the list
push @output, $accum if $accum;
## Start a new accumulation
$accum = \@bits;
}
else {
## Append to the accumulators
$accum->[ $_ ] .= $bits[ $_ ] for 0 .. $#bits;
}
}
push @output, $accum; ## Don't forget the last record.
## Determine the output field widths
my @w = ( 0 ) x 4;
for my $ref ( @output ) {
for my $i ( 0 .. 3 ) {
my $len = length( $ref->[ $i ]||'' );
$w[ $i ] = $len if $w[ $i ] < $len;
}
}
## Build an output template with a extra space between fields
my $outTempl = join ' ', map 'A' . ($_+1), @w;
## And output
print pack $outTempl, @$_ for @output;
__DATA__
NodeName FileName PathName BackupDate
BD3101 bananaswi \breakfa 2007-03-06
ithapple st\fruit 14:02:31.000000
s.gif s\tree\
TP4223 chocolate \sweet\d 2006-02-28
caramelfu esserts\ 21:16:41.000000
dge.gif hershey\
EO2123 tofuwith \organic 2007-07-16
peas.gif \vegetab 13:55:06.000000
les\legu
mes\
Produces C:\test>632548
BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-03-
+0614:02:31.000000
TP4223 chocolatecaramelfudge.gif \sweet\desserts\hershey\ 2006-02-
+2821:16:41.000000
EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-07-
+1613:55:06.000000
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: MultiLine Tables into Variables
by dwm042 (Priest) on Aug 14, 2007 at 18:52 UTC
|
The code below is an example and a fast one, but there are essentially just two points of note:
1) fixed fields can be easily parsed with unpack.
2) printf can format your output. Just be sure to give your fields enough room.
#!/usr/bin/perl
use warnings;
use strict;
package main;
my $node_name = "";
my $host_name = "";
my $path_name = "";
my $backup_date = "";
while(<DATA>) {
my ($node, $host, $path, $backup) = unpack('A9A10A10A10',$_);
if ( $node =~ /^\w+/ ) {
if ( $host_name =~ /\w+/ ) {
printf "%-9s %-25s %-28s %-17s\n",
$node_name, $host_name, $path_name, $backup_date;
$node_name = "";
$host_name = "";
$path_name = "";
$backup_date = "";
}
$node_name = $node;
$host_name = $host;
$path_name = $path;
$backup_date = $backup;
$backup_date =~ s/^\s+//;
}
else {
$host_name .= $host;
$path_name .= $path;
$backup_date .= $backup;
}
}
printf "%-9s %-25s %-28s %-17s\n",
$node_name, $host_name, $path_name, $backup_date;
__DATA__
BD3101 bananaswi \breakfa 2007-03-06
ithapple st\fruit 14:02:31.000000
s.gif s\tree\
TP4223 chocolate \sweet\d 2006-02-28
caramelfu esserts\ 21:16:41.000000
dge.gif hersheys\
EO2123 tofuwith \organic\ 2007-07-16
peas.gif vegetable 13:55:06.000000
s\legumes\
And output is:
C:\Code>perl unpack.pl
BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-
+14:02:31.0
TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006-
+21:16:41.0
EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-
+13:55:06.0
| [reply] [d/l] [select] |
Re: MultiLine Tables into Variables
by moritz (Cardinal) on Aug 14, 2007 at 17:58 UTC
|
| [reply] |
|
|
Dear Moritz,
My understanding of unpack from perlpacktut is that it does a great job for single line position delimited records. I did not see anything for dealing with multi-line records. I do appreciate your pointing me in that direction and I am going to do some research into it.
My understanding is that perlpack can be very strict regarding how the template matches and that is going to be a problem as you can see from my above example that I have no idea if some of the fields stretch 2 or 3 or 4 lines long inside the column.
If you or anyone else can provide anything more I would greatly appreciate it.
| [reply] |
|
|
Well, it doesn't do all the magic for you, but quite a bit:
#!/usr/bin/perl
use warnings;
use strict;
my (@nodename, @filename, @pathname, @backupdate);
use Data::Dumper;
{
# discard heading line
my $tmp = <DATA>;
}
while (my $line = <DATA>){
chomp $line;
my ($nn, $fn, $pn, $bd) = unpack('A8xA9xA8xA15', $line);
if ($nn =~ m/\S/){
push @nodename, $nn;
push @filename, $fn;
push @pathname, $pn;
push @backupdate, $bd;
} else {
$nodename[-1] .= $nn;
$filename[-1] .= $fn;
$pathname[-1] .= $pn;
$backupdate[-1] .= $bd;
}
}
print Dumper([\@nodename, \@filename, \@pathname, \@backupdate]);
#1234567890123456789012345678901234567890123
__DATA__
NodeName FileName PathName BackupDate
BD3101 bananaswi \breakfa 2007-03-06
ithapple st\fruit 14:02:31.000000
s.gif s\tree\
TP4223 chocolate \sweet\d 2006-02-28
caramelfu esserts\ 21:16:41.000000
dge.gif hersheys\
EO2123 tofuwith \organic\ 2007-07-16
peas.gif vegetable 13:55:06.000000
s\legumes\
Actually the data would better be stored in a two dimensional array.
Note that all lines that don't have the Backup Date field need to be padded with whitespaces at the end of the line to be long enough, if that's not the case you'd have to pad them manually before using unpack.
| [reply] [d/l] |
Re: MultiLine Tables into Variables
by SuicideJunkie (Vicar) on Aug 14, 2007 at 18:34 UTC
|
It sounds like you know the column widths, and looks like the first column will be all whitespace unless you are starting a new record.
while (my $textline=<DATAFILE>)
{
if ( substr($textline,0,9) =~ /\S/)
{
#New record detected - Print out old one, and prep for new
printf ("%s%s%s%s\n",$nodename,
$filename, $backupdate, $pathname);
$nodename = substr($textline, 0, 9);
$filename = '';
$pathname = '';
$backupdate = '';
}
$filename .= substr($textline, 10,19);
$pathname .= substr($textline, 20,29);
$backupdate .= substr($textline, 30);
}
| [reply] [d/l] |
Re: MultiLine Tables into Variables
by thezip (Vicar) on Aug 14, 2007 at 19:37 UTC
|
With this, I throw my code into the ring:
Update: My apologies -- I missed the part about the huge datafile. This method stores everything in memory, so that could be a problem...
#!/perl/bin/perl -w
use strict;
use Data::Dumper;
my $spec = 'A9A10A10A15';
my $hash = {};
my $nodename = "";
my $out = {};
$_ = <DATA>; # skip the header line
while (<DATA>) { #stuff everything into a hash of array refs
my @arr = unpack($spec, $_);
@arr = map { s/^\s+//; $_ } @arr; # remove any leading spaces
if ($arr[0]) {
$nodename = shift(@arr);
}
else {
shift(@arr);
}
push(@{$hash->{$nodename}}, \@arr);
}
print Dumper($hash); # contents may be viewed in $VAR1 below
for my $key (keys %$hash) {
my $rows = $hash->{$key};
for (my $rownum = 0; $rownum <= $#$rows; $rownum++) {
my $cols = $rows->[$rownum];
for (my $col = 0; $col <= $#$cols; $col++) {
# include a space between the date/time strings
my $space = ($rownum == 0 && $col == 2) ? ' ' : '';
$out->{$key}->[$col] .= $cols->[$col] . $space;
}
}
}
for my $key (keys %$out) {
printf "%-7s %-25s %-29s %-30s\n", $key, @{$out->{$key}};
}
__DATA__
NodeName FileName PathName BackupDate
BD3101 bananaswi \breakfa 2007-03-06
ithapple st\fruit 14:02:31.000000
s.gif s\tree\
TP4223 chocolate \sweet\d 2006-02-28
caramelfu esserts\ 21:16:41.000000
dge.gif hersheys\
EO2123 tofuwith \organic\ 2007-07-16
peas.gif vegetable 13:55:06.000000
s\legumes\
__OUTPUT__
$VAR1 = {
'TP4223' => [
[
'chocolate',
'\\sweet\\d',
'2006-02-28'
],
[
'caramelfu',
'esserts\\',
'21:16:41.000000'
],
[
'dge.gif',
'hersheys\\',
''
]
],
'BD3101' => [
[
'bananaswi',
'\\breakfa',
'2007-03-06'
],
... etc ...
TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006-0
+2-28 21:16:41.000000
BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-0
+3-06 14:02:31.000000
EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-0
+7-16 13:55:06.000000
Where do you want *them* to go today?
| [reply] [d/l] |
Re: MultiLine Tables into Variables
by FunkyMonk (Bishop) on Aug 14, 2007 at 22:27 UTC
|
BD3101|bananaswiithapples.gif|\breakfast\fruits\tree\|2007-03-06 14:02
+:31.000000
TP4223|chocolatecaramelfudge.gif|\sweet\desserts\hersheys\|2006-02-28
+21:16:41.000000
EO2123|tofuwithpeas.gif|\organic\vegetables\legumes\|2007-07-16 13:55:
+06.000000
The second pass processes this temp file and produces the formatted output:
BD3101 bananaswiithapples.gif \breakfast\fruits\tree\ 2007-03-
+06 14:02:31.000000
TP4223 chocolatecaramelfudge.gif \sweet\desserts\hersheys\ 2006-02-
+28 21:16:41.000000
EO2123 tofuwithpeas.gif \organic\vegetables\legumes\ 2007-07-
+16 13:55:06.000000
The code that follows doesn't use files at at all (I'll leave that to you - it's trivial) and produces the output above:
my ( @in, @out, @temp_file );
my @lengths = (0) x 4;
pass1();
pass2();
sub pass1 {
while ( <DATA> ) {
my @in = unpack "A9A10A9A*", $_;
if ( $in[0] ) {
write_to_temp( @out )
if $out[0];
@out = @in;
next;
}
$out[$_] .= $in[$_] for 0 .. 3;
}
write_to_temp( @out );
}
sub pass2 {
my $format = join " ", ( map "%-${_}s", @lengths ), "\n";
for ( @temp_file ) {
chomp;
my @f = split /\|/;
printf $format, @f;
}
}
sub write_to_temp {
s/\s+/ /g, s/^\s+//, s/\s+$//
for $_[3];
length $_[$_] > $lengths[$_]
and $lengths[$_] = length $_[$_]
for 0 .. 3;
push @temp_file, join( "|", @_ ) . "\n";
}
PS I've assumed BrowserUk's comment about mistyped sample data to be true.
| [reply] [d/l] [select] |
Re: MultiLine Tables into Variables
by perlofwisdom (Pilgrim) on Aug 14, 2007 at 19:51 UTC
|
Yet another solution (boy, you've got to be quick around here :))
#!/usr/bin/perl -w
use strict;
my $input ='junk_input.txt'; #returns filename from command line
my $nodename;
my $filename;
my $pathname;
my $backupdate;
my $textline;
my $nochar ="";
my $charposition;
my $nextrecord;
chomp $input; #strip the carriage return
my %len = ('NODE',0,'FILE',0,'PATH',0,'DATE',0);
my @textline = ();
open (DATAFILE, "$input")|| die ("Can not open $input:!\n"); # access
+ the file
while (my $textline=<DATAFILE>) {
chomp $textline;
################################
# If continuation of last line
################################
if (substr($textline,0,8) =~ /^\s/) {
# Append contents to previous values
$nodename .= substr($textline,0,8);
$filename .= substr($textline,9,9) if (length($textline
+) >= 10);
$pathname .= substr($textline,19,10) if (length($textline
+) >= 20);
$backupdate .= ' ' . substr($textline,29) if (length($textline
+) >= 30);
} else {
################################
# If new line
################################
push @textline, "$nodename|$filename|$pathname|$backupdate";
+ # Save previous line
# Save new values
$nodename = substr($textline,0,8);
$filename = substr($textline,9,9) if (length($textline
+) >= 10);
$pathname = substr($textline,19,10) if (length($textline
+) >= 20);
$backupdate = substr($textline,29) if (length($textline
+) >= 30);
}
# Remove unwanted spaces at beginning or end, depending on column
$nodename =~ s/\s{1,}$//g;
$filename =~ s/\s{1,}$//g;
$pathname =~ s/\s{1,}$//g;
$backupdate =~ s/^\s{1,}//g;
# Save longest column length (used later for formatting output)
$len{NODE} = length($nodename) if (length($nodename) > $len{
+NODE});
$len{FILE} = length($filename) if (length($filename) > $len{
+FILE});
$len{PATH} = length($pathname) if (length($pathname) > $len{
+PATH});
$len{DATE} = length($backupdate) if (length($backupdate) > $len{
+DATE});
}
push @textline, "$nodename|$filename|$pathname|$backupdate";
+ # Save last line of input file
close (DATAFILE);
for my $textline (@textline) {
($nodename,$filename,$pathname,$backupdate) = split(/\|/,$textline)
+; # Separate columns
# Format column widths
$nodename .= ' ' x ($len{NODE} - length($nodename));
$filename .= ' ' x ($len{FILE} - length($filename));
$pathname .= ' ' x ($len{PATH} - length($pathname));
$backupdate .= ' ' x ($len{DATE} - length($backupdate));
print "$nodename $filename $pathname $backupdate\n";
}
| [reply] [d/l] |