Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: parsing table .doc

by kcott (Archbishop)
on Jun 01, 2020 at 08:49 UTC ( [id://11117549]=note: print w/replies, xml ) Need Help??


in reply to parsing table .doc

G'day IB2017,

Here's code which generates the regex I think you were after. I've provided two sets of output: the one shown in your OP; another which I think is more useful as it gives you access to all of the actual values in the table (blank cells are represented as zero-length strings).

#!/usr/bin/env perl use strict; use warnings; my $content = join '', <DATA>; my ($header, undef) = split /\a\a/, $content, 2; my $cols = scalar split /\a/, $header; my $re = qr{((?:(?:|[^\a]+)\a){$cols}\a)}; { print "*** WANTED ***\n"; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a/(BEL)/g; print "$row\n"; } } { print "\n*** PROBABLY MORE USEFUL ***\n"; my @rows; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a$//; push @rows, [ split /\a/, $row ]; } print join('|', @$_), "\n" for @rows; } __DATA__ Agreement^GACAP^GACAP^GAccord^G^Galbatross^G^G^Galbatros^G^Galleged vi +olation^G^G^Ginfraction présumée^G^Gallowable^G^G^Gadmissible^G^Ganch +ovy^G^G^Ganchois^G^Gangler fish, burbot^G^G^Glotte^G^G

Note: all of the '^G's are actually BELL (U+0007) characters which I embedded in the DATA section.

Output:

*** WANTED *** Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL) albatross(BEL)(BEL)(BEL)albatros(BEL)(BEL) alleged violation(BEL)(BEL)(BEL)infraction présumée(BEL)(BEL) allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL) anchovy(BEL)(BEL)(BEL)anchois(BEL)(BEL) angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL)(BEL) *** PROBABLY MORE USEFUL *** Agreement|ACAP|ACAP|Accord albatross|||albatros alleged violation|||infraction présumée allowable|||admissible anchovy|||anchois angler fish, burbot|||lotte

— Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11117549]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-24 22:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found