I'm writing a Perl module to try to autodetect whether a spreadsheet/csv file has a header row. It's a somewhat tricky problem especially when trying to factor in the kind of malformed data people might feed in. I'm sticking with simple cases for now. I'd like to try to do some statistical analysis of the data to help me. Unfortunately, my knowledge of statistics is very weak. I'm feeling my way in the dark. So take this sample column for example:

STATE NY NY NY NJ NY

It's obviously a column of states. One thing that might jump out to a computer is that the length of the first row is 5 letters while the rest of the rows are two letters. Things can of course get fuzzier. The first row might have 5 letters while the rest of the columns have 2 OR 3 letters. Other tell-tale signs might be the header column is a string while the column is full of numbers. Or the header column might be named "STATUS" while the data might only contain the words "ACTIVE" or "RETIRED." The size of the column is also an important factor. If there is a lot of data, any statistical approach will likely be more accurate.

The Math::NumberCruncher module has some useful functions I think I could use like standard deviation to help me analyze things like how diverse the dataset is for a certain property (length, number of unique values, etc.). But I'm not really sure how I might apply it to be useful in a practical way. I found this interesting article was useful but there's no code and not being familiar with statistics, I'm still not clear on exactly how that analysis was done.

I intend to analyze each column and try to come up with some kind of "likely has header" factor based on the analysis. If it looks like most columns have a header, then it will determine that the spreadsheet has a header row.

Sorry this question is so open-ended. But any tips/advice you can think of would be appreciated.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks


In reply to Useful heuristics for analyzing arrays of data to determine column header by nysus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.