http://qs1969.pair.com?node_id=205612

pelp has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks:

I need some much needed help to do some data munging. I come from a C++ background (3+ years) and have little experience with PERL (about 2 months). At work today I was given a task to convert 1300+ logs to a new format, and the program is due tomorrow (i know.. how nice?)!!!

Currently a log is defined in this format and needs to be converted to a CSV file in this format

As of right now, I'm stuck of how to extract each token from the old format and convert it to the new format such as the values of author, number , version number. I've tried to find some sample code on the NET including this site but was unable.

Please bless me with some PERL wisdom to get me started on this. I will greatly apprecitiate any help that I get.

Thanks ahead.

  • Comment on Converting logs to CSV format (desperate help)

Replies are listed 'Best First'.
Re: Converting logs to CSV format (desperate help)
by jarich (Curate) on Oct 16, 2002 at 06:12 UTC
    G'day pelp,
    Assuming you want a CSV without the pretty spacing that your html table produced, and assuming that your current records are separated by two newlines, ie:
    Author : tom jones Number : abc123 Version Number : 17 Feature : nothing was changed File Name : house.doc Modification Date : 05/16/2002 Paragraph Number Requirement Number Last Modified BCBLUE-BC-191.a SMAPSFS-VPU-1232 17 BCBLUE-BC-232.g SMAPSFS-VPU-2342 17 Author : fred jones Number : abc124 Version Number : 18 Feature : nothing much was changed File Name : house.doc Modification Date : 05/18/2002 Paragraph Number Requirement Number Last Modified BCBLUE-BC-191.a SMAPSFS-VPU-1232 18 BCBLUE-BC-232.g SMAPSFS-VPU-2342 18
    And your input is kinda well formed etc, then the following code:
    use strict; $/ = ""; # paragraph mode. print "File Name,Author,Date (MM/DD/Year),TIME (H:M:S),Version No.,". "Number,Feature Name,Paragraph Number,Requirement Number\n"; while(<>) { # $_ =~ Author : foo\nNumber : abc.... # These regexps may need changing if you allow # other characters in them. You may find something # more general such as what I use for Feature # best for all fields... my ($author) = m/^Author\s+:\s+([\w ]+)$/m; my ($number) = m/^Number\s+:\s+([\w ]+)$/m; my ($version) = m/Version Number\s+:\s+([\w ]+)$/m; my ($feature) = m/Feature\s+:\s+([^\s].*)$/m; my ($filename) = m/File Name\s+:\s+([\w._-]+)$/m; my ($mod_date) = m!Modification Date\s+:\s+(\d{2}/\d{2}/\d{4}) +!m; # Hope that Paragraph Number etc occurs at the end of +the # record. my ($otherjunk) = m/Paragraph(.*)$/s; my @paragraphs = (split /\n/, $otherjunk); shift @paragraphs; # don't need headings; foreach my $line (@paragraphs) { my ($para, $requirement) = split(/\s+/, $line); print qq{"$filename","$author","$mod_date","","$versio +n",}. qq{"$number","$para","$requirement"\n}; } }
    will produce:
    File Name,Author,Date (MM/DD/Year),TIME (H:M:S),Version No.,Number,Fea +ture Name,Paragraph Number,Requirement Number "house.doc","tom jones","05/16/2002","","17","abc123","nothing was cha +nged","BCBLUE-BC-191.a","SMAPSFS-VPU-1232" "house.doc","tom jones","05/16/2002","","17","abc123","nothing was cha +nged","BCBLUE-BC-232.g","SMAPSFS-VPU-2342" "house.doc","fred jones","05/18/2002","","18","abc124","nothing much w +as changed","BCBLUE-BC-191.a","SMAPSFS-VPU-1232" "house.doc","fred jones","05/18/2002","","18","abc124","nothing much w +as changed","BCBLUE-BC-232.g","SMAPSFS-VPU-2342"
    (without the line wrapping)

    If your input is reasonably well formed, ie you can rely on having "Author" be the first field, but records are not separated by 2 newlines, run something like the following over your data file first:

    while(<>) { if(/^Author\s+:\s+/) { print "\n"; } print; }
    The resulting output will be fine for my program above.

    I hope this will prove helpful to you.

    jarich

Re: Converting logs to CSV format (desperate help)
by diotalevi (Canon) on Oct 16, 2002 at 06:49 UTC

    Fix this to taste - it's my best guess at what you need. It reads one header per file and then multiple records per file.

    use strict; use warnings; use constant HEADER => q[ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> <style type="text/css"> <!-- body { font-family: verdana; font-size: 7px; } tbody { font-family: verdana; font-size: 9px; } --> </style> </head> <body> <TABLE width=955 cellPadding=1 cellSpacing=0> <TBODY> <TR vAlign=top> <TD width="59" height=16> <P class=Table>File Name, &nbsp;&nbsp; +&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </P></TD> <TD width="68" height=16> <P class=Table>Author,&nbsp;&nbsp;&nbs +p;&nbsp;&nbsp; </P></TD> <TD width="144" height=16> <P class=Table>Date (MM/DD/Year),&nbs +p;&nbsp;&nbsp; </P></TD> <TD width="96" height=16> <P class=Table>TIME (H:M:S),&nbsp;&nbs +p;&nbsp;&nbsp;&nbsp;&nbsp; </P></TD> <TD width="114" height=16> <P class=Table>Version No.,&nbsp;&nbs +p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </P></TD> <TD width="59" height=16> <P class=Table>Number,</P></TD> <TD width="97" height=16> <P class=Table>Feature Name,</P></TD> <TD width="125" height=16> <P class=Table>Paragraph Number,</P>< +/TD> <TD width="173" height=16> <P class=Table>Requirement Number</P> +</TD> </TR> ]; use constant FOOTER => q[ </TBODY> </TABLE> </body> </html>]; sub fixupRecord { my $record = shift; for (qw(FileName Author Date Time Version Number FeatureName ParagraphNumber RequirementNumber)) { $record->{$_} = '&npsp;' unless defined $record->{$_} } } sub formatRecord { my $record = shift; fixupRecord( $record ); qq[ <TR vAlign=top> <TD width="59" height=16> <P class=Table>@{[$record->{FileName}] +}</P></TD> <TD width="68" height=16> <P class=Table>@{[$record->{Author}]}, +</P></TD> <TD width="144" height=16> <P class=Table style="MARGIN-TOP: 0px +; MARGIN-BOTTOM: 0px">@{[$record->{Date}]}</P></TD> <TD width="96" height=16> <P class=Table style="MARGIN-TOP: 0px; + MARGIN-BOTTOM: 0px">@{[$record->{Time}]}</P></TD> <TD width="114" height=16> <P class=Table style="MARGIN-TOP: 0px +; MARGIN-BOTTOM: 0px">@{[$record->{Version}]},</P></TD> <TD width="59" height=16> <P class=Table style="MARGIN-TOP: 0px; + MARGIN-BOTTOM: 0px">@{[$record->{Number}]},</P></TD> <TD width="97" height=16> <P class=Table style="MARGIN-TOP: 0px; + MARGIN-BOTTOM: 0px">@{[$record->{FeatureName}]}</P></TD> <TD width="125" height=16> <P class=Table style="MARGIN-TOP: 0px +; MARGIN-BOTTOM: 0px">@{[$record->{ParagraphNumber}]},</P></TD> <TD width="173" height=16> <P class=Table style="MARGIN-TOP: 0px +; MARGIN-BOTTOM: 0px">@{[$record->{RequirementNumber}]}</P></TD> </TR> ]; } print HEADER; for my $filename (@ARGV) { my $file = do { local (@ARGV, $/) = $filename; <> }; my %record; for ([ Author => qr/^Author\s+:\s+(.+)/m ], [ Number => qr/^Number\s+:\s+(.+)/m ], [ Version => qr/^Version Number\s+:\s+(.+)/m ], [ FeatureName => qr/^Feature\s+:\s(.+)/m ], [ FileName => qr/^File Name\s+:\s(.+)/m ], [ Date => qr/^Modification Date\s+:\s(.+)/m ]) { @record{$_->[0]} = $file =~ $_->[1]; $record{$_->[0]} =~ s/^\s+//; $record{$_->[0]} =~ s/\s+$//; } $file =~ s/^.+?Paragraph Number\s+Requirement Number\s+Last Modifi +ed\n//s; for (split /\n/, $file) { my %instance = %record; @instance{qw(ParagraphNumber RequirementNumber)} = split; print formatRecord( \%instance ); } } print FOOTER;
    __SIG__ printf "You are here %08x\n", unpack "L!", unpack "P4", pack "L!", B::svref_2object(sub{})->OUTSIDE;
Re: Converting logs to CSV format (desperate help)
by IndyZ (Friar) on Oct 16, 2002 at 05:02 UTC
    Well, your input data doesn't seem to match your desired output data very well. Are the date, time, and feature name missing from the output intentionally, or were they omitted from the result file by accident?

    Are you allowed to write your program in C? Crunch time isn't where you should be learning a new language, and I think with your experience you should be able to punch this out in C++ overnight.

    --
    IndyZ

      IndyZ:

      I intentionally left out date, time and feature name. I'm not too concern about htat now.

      Actually, somone said it would be quicker in PERL than C++.

      So far I'm on a good track.

        Actually, somone said it would be quicker in PERL than C++.
        You have a work task due tomorrow.

        You have 3 years experience with C++, and only 2 months experience with perl, and someone told you it'd be quicker to do it in perl?

        Wha?????
Re: Converting logs to CSV format (desperate help)
by pelp (Initiate) on Oct 16, 2002 at 03:09 UTC
    Please use these links instead. Sorry for all these corrections but YAHOO is acting funny. http://www.geocities.com/pelpme/current_format.txt (original file) http://www.geocities.com/pelpme/csv.htm (new format)