parsing table .doc

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse word tables in .doc format. Need some help with regex.

I get a dump of the table and the number of columns (first row is header) with:

use strict;
use warnings;

require Text::Extract::Word;
Text::Extract::Word->import( qw(get_all_text) );

my $InputFileReadable ='Fish.doc';
my $content = get_all_text($InputFileReadable);

#detecting structure
my ($header, $body)= split /(\a)\1+/, $content;
my $NrColumns = () = $header =~ /\a/g;
$NrColumns++;
print $NrColumns;
[download]

Now I need to parse $content. I now that columns are separated by \a (BEL) and end of rows by \a\a. Simply splitting $content everytime I match \a\a is not enough, since if a column is empty I will have \a\a also as non end-of-row delimiter. So, I need to keep track that \a\a is matching only at $NrColumns positions.

Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL)albatross(BEL)(BEL)(BE
+L)albatros(BEL)(BEL)alleged violation(BEL)(BEL)(BEL)infraction présum
+ée(BEL)(BEL)allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL)anchovy(BEL)(
+BEL)(BEL)anchois(BEL)(BEL)angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL
+)(BEL)
[download]

With $NrColumns = 4 I should get:

Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL)
albatross(BEL)(BEL)(BEL)albatros(BEL)(BEL)
alleged violation(BEL)(BEL)(BEL)infraction présumée(BEL)(BEL)
allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL)
anchovy(BEL)(BEL)(BEL)anchois(BEL)(BEL)
angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL)(BEL)
[download]

Comment on parsing table .doc Select or Download Code

Replies are listed 'Best First'.
Re: parsing table .doc by tybalt89 (Monsignor) on May 31, 2020 at 14:42 UTC
Just for the TMTOWTDI of it... #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11117527 use warnings; use List::Util qw( first ); $_ = do { local $/; <DATA> }; my @fields = split /($BEL$)/, $_; my $inrow = 2 + first { $fields[$_] eq '' } 0 .. $#fields; print splice( @fields, 0, $inrow), "\n" while @fields >= $inrow; __DATA__ Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL)albatross(BEL)(BEL)(BE +L)albatros(BEL)(BEL)alleged violation(BEL)(BEL)(BEL)infraction présum +ée(BEL)(BEL)allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL)anchovy(BEL)( +BEL)(BEL)anchois(BEL)(BEL)angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL +)(BEL) [download]	[reply] [d/l]
Re: parsing table .doc by kcott (Archbishop) on Jun 01, 2020 at 08:49 UTC
G'day IB2017, Here's code which generates the regex I think you were after. I've provided two sets of output: the one shown in your OP; another which I think is more useful as it gives you access to all of the actual values in the table (blank cells are represented as zero-length strings). #!/usr/bin/env perl use strict; use warnings; my $content = join '', <DATA>; my ($header, undef) = split /\a\a/, $content, 2; my $cols = scalar split /\a/, $header; my $re = qr{((?:(?:\|[^\a]+)\a){$cols}\a)}; { print "* WANTED \n"; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a/(BEL)/g; print "$row\n"; } } { print "\n PROBABLY MORE USEFUL \n"; my @rows; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a$//; push @rows, [ split /\a/, $row ]; } print join('\|', @$_), "\n" for @rows; } __DATA__ Agreement^GACAP^GACAP^GAccord^G^Galbatross^G^G^Galbatros^G^Galleged vi +olation^G^G^Ginfraction présumée^G^Gallowable^G^G^Gadmissible^G^Ganch +ovy^G^G^Ganchois^G^Gangler fish, burbot^G^G^Glotte^G^G [download] Note: all of the '`^G`'s are actually BELL (U+0007) characters which I embedded in the `DATA` section. Output: ` WANTED * Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL) albatross(BEL)(BEL)(BEL)albatros(BEL)(BEL) alleged violation(BEL)(BEL)(BEL)infraction présumée(BEL)(BEL) allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL) anchovy(BEL)(BEL)(BEL)anchois(BEL)(BEL) angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL)(BEL) * PROBABLY MORE USEFUL * Agreement\|ACAP\|ACAP\|Accord albatross\|\|\|albatros alleged violation\|\|\|infraction présumée allowable\|\|\|admissible anchovy\|\|\|anchois angler fish, burbot\|\|\|lotte` [download] — Ken	[reply] [d/l] [select]
Re: parsing table .doc by IB2017 (Pilgrim) on May 31, 2020 at 11:12 UTC
While writing this post, I just thought: why Regex? So I came up with another solution, not elegant, but it seems to work just fine. use strict; use warnings; require Text::Extract::Word; Text::Extract::Word->import( qw(get_all_text) ); my $InputFileReadable ='Input.doc'; my $content = get_all_text($InputFileReadable); #detecting structure my ($header, $body)= split /(\a)\1+/, $content; my $NrSeparators = () = $header =~ /\a/g; my @cells = split(/\a/, $content); my $count=0; open(FH, '>', "Output.txt") or die $!; foreach (@cells){ $count++; print FH $_; if ($count eq $NrSeparators+2){ print FH "\n"; }else{ print FH "\t"; } $count = 0 if $count eq $NrSeparators+2; } close(FH); [download]	[reply] [d/l]
Re: parsing table .doc by perlfan (Parson) on May 31, 2020 at 10:15 UTC
There is a command line tool that will dump Word documents for you such that they are nice to read (tables and all), I am sorry that I can find it atm. I did look. But I suggestion you could start with this. I suspect this will be nicer for you to parse. That said, Word docs are XML these days, so you the advice to not parse *ML with regular expressions still applies; that said, you likely can use any number of the scraper tools on CPAN.	[reply]
Re^2: parsing table .doc by marto (Cardinal) on May 31, 2020 at 10:23 UTC
"That said, Word docs are XML these days" Docx (OOXML) files are compressed archives, containing XML among other things. Doc files are a proprietary binary format.	[reply]
Re^2: parsing table .doc by IB2017 (Pilgrim) on May 31, 2020 at 10:25 UTC
Please note that I am talking about .doc and not .docx. Parsing the same table in .docx works like a charm (but I can not convert/upgrade all files to .docx). The above is the best I could come out with to extract tables from .doc. It just misses a clear identification of end-of-row. But since I can easily spot this end-of-row if I know the number of columns, there must be a way to automate this. All my attempts with regex failed though.	[reply]