Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Module for parsing tables from plain text document

by GrandFather (Saint)
on Jan 07, 2023 at 06:33 UTC ( [id://11149401]=perlquestion: print w/replies, xml ) Need Help??

GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to be able to extract tabular data from plain text documents that look something like the sample data below. Perfect would be something smart enough to figure out the columns by "magic", but I don't mind giving hints if required. I don't want to manually count columns if I can avoid it. Do you know of a module that has that sort of foo?

Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13

The actual table for this example has about five times as many columns!

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

Replies are listed 'Best First'.
Re: Module for parsing tables from plain text document
by Tux (Canon) on Jan 07, 2023 at 10:01 UTC

    If it is *really* fixed-width columns, I'd use unpack


    Enjoy, Have FUN! H.Merijn

      There are lots of ways to do it if I want to count characters for each of the tables I need to deal with. What I'd like is something that looks at the table and uses heuristics to figure out the column widths and names. For all the tables I'm dealing with in the first instance the tables are machine generated so the columns are unlikely to change within a table, but they do change between tables.

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
        I wrote something similar for PDF once, and also wrote Data::TableReader, but I never got around to making PDF into one of the decoders. For PDF, it made sense to look at start X addresses for segments of text, and identify it as a column if there were roughly as many text fragments starting at an X as there are estimated number of lines. Text has less granularity, so I think if I were going to try writing it for text, I would iterate lines of text and make a history of which columns have a vertical run of whitespace, and at the EOF or first blank line, see which runs of whitespace lasted from the first to the last line. Concatenate adjacent whitespace columns, and then report the space inbetween as the data columns.

        It would be really awesome if you wanted to contribute a Decoder for Data::TableReader :-)

Re: Module for parsing tables from plain text document
by tybalt89 (Monsignor) on Jan 07, 2023 at 13:57 UTC

    My usual trick is to split on two or more spaces. For your example it would work on the table data but not the label lines at the top.

      > For your example it would work on the table data

      That's a matter of interpretation, for instance I could understand H M S as separate columns.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

Re: Module for parsing tables from plain text document
by roboticus (Chancellor) on Jan 10, 2023 at 15:56 UTC

    GrandFather:

    I've got a chunk of code that does this, but I've not turned it into a module because it's a bit temperamental. Rather, the code isn't temperamental, but the problem keeps changing for different projects. Consequently, for each project I find myself either tweaking the code a bit or the table a bit to make it load up.

    I'm at work right now, so I don't have it handy, but I can dig it up this evening if you want it. The gist of it, though, is to:

    • Find the split between the column headers and the data
    • Find the column widths
    • Read the records
    • While reading the records accumulate stats to help find the data type

    To simplify the first two tasks, I tweak the data and add a line of dashes to the table (as the automatic method I used to use is too finicky). While reading the file, I keep lines before the dash bar (back to the first non-empty line) to build the field keys.

    The ugly bit(s) are that there are so many special cases I wind up with for different projects. If you leave the special cases out, it's all fairly straightforward:

    $ cat pm_11149401.pl #!env perl use strict; use warnings; use Data::Dumper; ### Find the table start and column header lines my ($dashes, @tmp); while (<DATA>) { # We've found the end of the column headings when we find a line of +dashes and # blanks with at least eight sequential dashes $dashes=$_, last if /^[-\s]*-{8}[-\s]+$/; push @tmp, $_; # The data we'll build the column headers / keys from is only from l +ines # immediately before the dash bar @tmp=(), next if /^\s*$/; } die "No dash bar found!" unless defined $dashes; ### Build the column descriptions # First need the starting position and width of each column my $col=0; my @coldefs; while ($dashes ne '' and $dashes =~ /^(\s*)(-*)/) { # skip blanks $col += length($1); if (length $2) { push @coldefs, { beg=>$col, len=>length($2) }; $col += length($2); } $dashes = substr($dashes, length($1)+length($2)); } # Build the column keys for my $tmp (@tmp) { for my $ar (@coldefs) { my $chunk = substr($tmp, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $chunk =~ s/[^-a-zA-Z0-9_]+/_/g; $ar->{key} .= $chunk; } } # Parse the table my @records; while (<DATA>) { last if /^\s*$/; my $hr = {}; for my $ar (@coldefs) { my $chunk = substr($_, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $hr->{$ar->{key}} = $chunk; } push @records, $hr; } print Dumper(\@records); __DATA__ Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o ----------------- -------- --------- ------ -------- --- -- Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13 $ perl pm_11149401.pl $VAR1 = [ { 'Elvnm' => '0', 'lto' => '13', 'Longitudo_' => '174 45.', 'Site' => 'Auckland', '1st_ContU_T_h_m_s' => '4 33 59', 'Latitudeo_' => '-36 55.', 'PAo' => '313' }, { 'lto' => '11', 'Elvnm' => '30', 'Site' => 'Blenheim', 'Longitudo_' => '173 55.', 'PAo' => '326', 'Latitudeo_' => '-41 35.', '1st_ContU_T_h_m_s' => '4 40 34' }, { 'Elvnm' => '0', 'lto' => '9', 'Site' => 'Cape Palliser', 'Longitudo_' => '175 25.', '1st_ContU_T_h_m_s' => '4 42 28', 'PAo' => '327', 'Latitudeo_' => '-41 35.' }, { 'Site' => 'Cape Reinga', 'Longitudo_' => '172 45.', 'PAo' => '307', 'Latitudeo_' => '-34 25.', '1st_ContU_T_h_m_s' => '4 30 11', 'lto' => '17', 'Elvnm' => '50' }, { 'Latitudeo_' => '-41 5.', 'PAo' => '324', '1st_ContU_T_h_m_s' => '4 40 35', 'Longitudo_' => '175 35.', 'Site' => 'Carterton', 'lto' => '10', 'Elvnm' => '0' }, { 'Longitudo_' => '176 5.', 'Site' => 'Dannevirke', '1st_ContU_T_h_m_s' => '4 39 9', 'Latitudeo_' => '-40 15.', 'PAo' => '321', 'Elvnm' => '200', 'lto' => '10' }, { 'Elvnm' => '0', 'lto' => '10', '1st_ContU_T_h_m_s' => '4 37 58', 'PAo' => '315', 'Latitudeo_' => '-37 45.', 'Longitudo_' => '178 35.', 'Site' => 'East Cape' }, { 'Longitudo_' => '175 25.', 'Site' => 'Featherston', '1st_ContU_T_h_m_s' => '4 40 36', 'Latitudeo_' => '-41 5.', 'PAo' => '325', 'Elvnm' => '40', 'lto' => '10' }, { 'lto' => '10', 'Elvnm' => '0', 'PAo' => '317', 'Latitudeo_' => '-38 45.', '1st_ContU_T_h_m_s' => '4 38 29', 'Site' => 'Gisborne', 'Longitudo_' => '178 5.' }, { 'PAo' => '312', 'Latitudeo_' => '-36 15.', '1st_ContU_T_h_m_s' => '4 34 15', 'Longitudo_' => '175 25.', 'Site' => 'Great Barrier Is', 'lto' => '13', 'Elvnm' => '0' } ];

    The special cases, though, are where I basically tweak things that drive me crazy. There's a 'translation table' at the start that lets me map the incoming column names to a better one, as well as tie it to a function handle that parses the resulting string into a better format. Another version somewhere has a control-break handler that lets you specify key columns so when the key values are blank, it makes 'sub records' and so on.

    I've never created and published a module before, but if I had, I'd still be reluctant to try to build this thing out because of the ugly cases that keep coming up. But on the off chance that you might find it useful enough, I'll dig one of them up for you.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      I'm working through an idea that would effectively generate the hyphen row in a moderately robust fashion with help from hints like the number of header rows and their expected justification. It's very much a work in progress and I'd be happy to see anything you might have to show.

      Some examples of the types of problem tables you have had to deal with would help define the problem space. Quite likely I won't cleanly deal with all variations, but if I can handle common cases it would be worth turning this into a module and publishing it.

      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Module for parsing tables from plain text document
by LanX (Saint) on Jan 07, 2023 at 17:00 UTC
    > to figure out the columns by "magic"

    This could be done by a combination of statistics/histograms to identify empty columns and heuristics, but can't be perfect.

    Small number of data would result in ambiguity.

    But a semi automatic approach, where you choose a result of your liking sounds very feasible.

    Think of the way Excel asks about the best way to import CSV after showing a best guess. °

    Would that be good enough?

    The first step would require collecting sufficient training data.

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

    updates

    °) e.g. Text-Import-Wizard

Re: Module for parsing tables from plain text document
by Anonymous Monk on Jan 07, 2023 at 17:39 UTC
    I often have to parse such astronomical data. I think you have to get your hands dirty. I try to keep it simple as possible:
    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; # hash for data my $data = {}; # array of column names my @cols = (); for (<DATA>) { # normaliize the data next if /^\s/ or /^Annular/; if (/^Site/) { # grab column titles, minus the key (Site) @cols = map {$_} split /\s+/; shift @cols; } elsif (/\S+/) { # parse column data my @line = / ([A-Za-z ]+)\s+ # Site (\S+\s+\S+)\s+ # Longitude (\S+\s+\S+)\s+ # Latitude (\S+)\s+ # Elvn (\S+\s+\S+\s+\S+)\s+ # U.T. (\S+)\s+ # PA (\S+) # Alt /x; # clean the key my $site = do { $_ = shift @line; s/\s+$//; $_ }; next unless $site; # map titles to data my %line = do { @_{@cols} = @line; %_}; # add to hash $data->{$site} = \%line; } } # tada print Dumper $data; __DATA__ Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13
    Output:
    $VAR1 = {
              'Auckland' => {
                              'Longitude' => '174 45.',
                              'Alt' => '13',
                              'Elvn' => '0',
                              'Latitude' => '-36 55.',
                              'PA' => '313',
                              'U.T.' => '4 33 59'
                            },
    
Re: Module for parsing tables from plain text document
by kcott (Archbishop) on Jan 14, 2023 at 04:51 UTC

    G'day GrandFather,

    [Sorry, a bit late to the party; I haven't logged in for a week and a half.]

    I don't know what sort of variations may exist for your input data. Purely on what's shown, here's how I might get the data into a canonical form suitable for subsequent processing (via split, Text::CSV, or other).

    parse_000.pl:

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $data_file = 'data_000.txt'; my $re = qr{^([^0-9-]+?)\s+([0-9-][ 0-9.-]+?)\s*$}; my @data; open my $fh, '<', $data_file; while (<$fh>) { next unless /$re/; my ($site, $info) = ($1, $2); $info =~ y/ /\t/s; push @data, join "\t", $site, $info; } # For demo only use Data::Dump; dd \@data;

    Input:

    $ cat data_000.txt Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13

    Output:

    $ ./parse_000.pl [ "Auckland\t174\t45.\t-36\t55.\t0\t4\t33\t59\t313\t13", "Blenheim\t173\t55.\t-41\t35.\t30\t4\t40\t34\t326\t11", "Cape Palliser\t175\t25.\t-41\t35.\t0\t4\t42\t28\t327\t9", "Cape Reinga\t172\t45.\t-34\t25.\t50\t4\t30\t11\t307\t17", "Carterton\t175\t35.\t-41\t5.\t0\t4\t40\t35\t324\t10", "Dannevirke\t176\t5.\t-40\t15.\t200\t4\t39\t9\t321\t10", "East Cape\t178\t35.\t-37\t45.\t0\t4\t37\t58\t315\t10", "Featherston\t175\t25.\t-41\t5.\t40\t4\t40\t36\t325\t10", "Gisborne\t178\t5.\t-38\t45.\t0\t4\t38\t29\t317\t10", "Great Barrier Is\t175\t25.\t-36\t15.\t0\t4\t34\t15\t312\t13", ]

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11149401]
Approved by Athanasius
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2024-03-28 19:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found