Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Useful heuristics for analyzing arrays of data to determine column header

by hdb (Monsignor)
on Feb 15, 2019 at 07:29 UTC ( #1229934=note: print w/replies, xml ) Need Help??


in reply to Useful heuristics for analyzing arrays of data to determine column header

This is a very interesting endeavour! Here are my two cents:

  • If the first row has a string and everything else is numbers, the column has a header. Scalar::Util::looks_like_number could be useful.
  • If the first row has a number, it is not likely to be a header.
  • If the first row is a string, but repeats further below it is not likely to be a header.
  • If the value of the first row is unique but other values appear multiple times it is likely a header. This should be easy to implement.
  • I would assign some likelihood for each column. If the average is above a threshold or one or more columns are certain to have a header, the first row is a header row.

  • Comment on Re: Useful heuristics for analyzing arrays of data to determine column header
  • Download Code

Replies are listed 'Best First'.
Re^2: Useful heuristics for analyzing arrays of data to determine column header
by Laurent_R (Canon) on Feb 15, 2019 at 09:40 UTC
    Hi hdb,

    these are very interesting ideas, but I'm not really convinced by this one:

    If the first row has a number, it is not likely to be a header.
    The header could consist in years, month numbers, quarters, test IDs, etc., all appearing to be numerical.

    @ nysus: in general, a very strong principle is "know your data." Dou you know anything about the data you're going to deal with, or is this just a general purpose tool where you can't know in advance anything about the type of your data?

      Most of the data I'm dealing with will be related to people's contact info. But I'm also interested in trying to write a general purpose tool that can be used by others just for the challenge.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1229934]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2022-05-22 12:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (80 votes). Check out past polls.

    Notices?