Re: Useful heuristics for analyzing arrays of data to determine column header

This is a very interesting endeavour! Here are my two cents:

If the first row has a string and everything else is numbers, the column has a header. Scalar::Util::looks_like_number could be useful.
If the first row has a number, it is not likely to be a header.
If the first row is a string, but repeats further below it is not likely to be a header.
If the value of the first row is unique but other values appear multiple times it is likely a header. This should be easy to implement.
I would assign some likelihood for each column. If the average is above a threshold or one or more columns are certain to have a header, the first row is a header row.

Comment on Re: Useful heuristics for analyzing arrays of data to determine column header Download Code

Replies are listed 'Best First'.
Re^2: Useful heuristics for analyzing arrays of data to determine column header by Laurent_R (Canon) on Feb 15, 2019 at 09:40 UTC
Hi hdb, these are very interesting ideas, but I'm not really convinced by this one: If the first row has a number, it is not likely to be a header. The header could consist in years, month numbers, quarters, test IDs, etc., all appearing to be numerical. @ nysus: in general, a very strong principle is "know your data." Dou you know anything about the data you're going to deal with, or is this just a general purpose tool where you can't know in advance anything about the type of your data?	[reply]
Re^3: Useful heuristics for analyzing arrays of data to determine column header by nysus (Parson) on Feb 17, 2019 at 10:33 UTC
Most of the data I'm dealing with will be related to people's contact info. But I'm also interested in trying to write a general purpose tool that can be used by others just for the challenge. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ ~~Curate~~ ~~Priest~~ Vicar"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply]