As already noted, splitting based on whitespace is a faulty assumption in your algorithm, assuming company names have whitespace in them.

This, in my experience, is a common error for someone parsing a log for the first time so don't feel bad.  :-) I prefer to parse logs based on predictable components. The more wild the potential format, the more complicated the code gets, but for a relatively simple format like the one you are suggesting, I think it's fairly straightforward (assuming you have a basic understanding of Regular Expressions).

You have to craft your Regular Expression to match the data you are expecting. A technique I have become fond of is the use of an ifstatement, which provides the additional feature of filtering out lines that don't match my preconceived format. I often capture those out to another file for occasional review to see if the parsing routine needs to compensate for previously unknown formats or conditions. I won't do that in this example so we can save space.

C:\Steve\Dev\PerlMonks\P-2013-10-27@0838-Log-Parse>type test1.log GOOD Acme Toy Company 2010-01-01 2011-12-31 BAD XYZZY 1972-01-01 1972-06-18 UGLY Enron 2001-10-01 2011-09-11 C:\Steve\Dev\PerlMonks\P-2013-10-27@0838-Log-Parse>parselog.pl test1.l +og

Status Company Name Start Date End Date
GOOD Acme Toy Company 2010-01-01 2011-12-31
BAD XYZZY 1972-01-01 1972-06-18
UGLY Enron 2001-10-01 2011-09-11

#!/usr/bin/perl use strict; use warnings; # --------------------------------------------------------------- # Parse log with following format: # Status Company Name Start Date End Date # # Assumptions: Status contains no whitespace # Dates are in YYYY-MM-DD format # Company names have nothing that looks like a date # --------------------------------------------------------------- foreach my $inpfnm (@ARGV) { if (!open INPFIL, '<', $inpfnm) { print "ERROR: Cannot open input file '$inpfnm'\n"; } else { print "<HTML>\n"; print "<BODY>\n"; print "<TABLE BORDER>\n"; print " <TR>\n"; print " <TH>Status</TH>\n"; print " <TH>Company Name</TH>\n"; print " <TH>Start Date</TH>\n"; print " <TH>End Date</TH>\n"; print " </TR>\n"; while (my $inpbuf = <INPFIL>) { chomp $inpbuf; if ($inpbuf =~ /^(\w+)\s+(.+)\s+(\d{4}\-\d{2}\-\d{2})\s+(\ +d{4}\-\d{2}\-\d{2})\s*$/) { my $inpsts = $1; my $inpnam = $2; my $stadat = $3; my $enddat = $4; print " <TR>\n"; print " <TD>$inpsts</TD>\n"; print " <TD>$inpnam</TD>\n"; print " <TD>$stadat</TD>\n"; print " <TD>$enddat</TD>\n"; print " </TR>\n"; } } close INPFIL; print "</TABLE>\n"; print "</BODY>\n"; print "</HTML>\n"; } } exit; __END__

In reply to Re: Parsing Text from a File to HTML Table by marinersk
in thread Parsing Text from a File to HTML Table by anupchandu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.