IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

I need to parse word tables in .doc format. Need some help with regex.

I get a dump of the table and the number of columns (first row is header) with:

use strict; use warnings; require Text::Extract::Word; Text::Extract::Word->import( qw(get_all_text) ); my $InputFileReadable ='Fish.doc'; my $content = get_all_text($InputFileReadable); #detecting structure my ($header, $body)= split /(\a)\1+/, $content; my $NrColumns = () = $header =~ /\a/g; $NrColumns++; print $NrColumns;

Now I need to parse $content. I now that columns are separated by \a (BEL) and end of rows by \a\a. Simply splitting $content everytime I match \a\a is not enough, since if a column is empty I will have \a\a also as non end-of-row delimiter. So, I need to keep track that \a\a is matching only at $NrColumns positions.

Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL)albatross(BEL)(BEL)(BE +L)albatros(BEL)(BEL)alleged violation(BEL)(BEL)(BEL)infraction présum +ée(BEL)(BEL)allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL)anchovy(BEL)( +BEL)(BEL)anchois(BEL)(BEL)angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL +)(BEL)

With $NrColumns = 4 I should get:

Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL) albatross(BEL)(BEL)(BEL)albatros(BEL)(BEL) alleged violation(BEL)(BEL)(BEL)infraction présumée(BEL)(BEL) allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL) anchovy(BEL)(BEL)(BEL)anchois(BEL)(BEL) angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL)(BEL)

Replies are listed 'Best First'.
Re: parsing table .doc
by tybalt89 (Monsignor) on May 31, 2020 at 14:42 UTC

    Just for the TMTOWTDI of it...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11117527 use warnings; use List::Util qw( first ); $_ = do { local $/; <DATA> }; my @fields = split /(\(BEL\))/, $_; my $inrow = 2 + first { $fields[$_] eq '' } 0 .. $#fields; print splice( @fields, 0, $inrow), "\n" while @fields >= $inrow; __DATA__ Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL)albatross(BEL)(BEL)(BE +L)albatros(BEL)(BEL)alleged violation(BEL)(BEL)(BEL)infraction présum +ée(BEL)(BEL)allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL)anchovy(BEL)( +BEL)(BEL)anchois(BEL)(BEL)angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL +)(BEL)
Re: parsing table .doc
by kcott (Archbishop) on Jun 01, 2020 at 08:49 UTC

    G'day IB2017,

    Here's code which generates the regex I think you were after. I've provided two sets of output: the one shown in your OP; another which I think is more useful as it gives you access to all of the actual values in the table (blank cells are represented as zero-length strings).

    #!/usr/bin/env perl use strict; use warnings; my $content = join '', <DATA>; my ($header, undef) = split /\a\a/, $content, 2; my $cols = scalar split /\a/, $header; my $re = qr{((?:(?:|[^\a]+)\a){$cols}\a)}; { print "*** WANTED ***\n"; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a/(BEL)/g; print "$row\n"; } } { print "\n*** PROBABLY MORE USEFUL ***\n"; my @rows; while ($content =~ /$re/g) { my $row = $1; $row =~ s/\a$//; push @rows, [ split /\a/, $row ]; } print join('|', @$_), "\n" for @rows; } __DATA__ Agreement^GACAP^GACAP^GAccord^G^Galbatross^G^G^Galbatros^G^Galleged vi +olation^G^G^Ginfraction présumée^G^Gallowable^G^G^Gadmissible^G^Ganch +ovy^G^G^Ganchois^G^Gangler fish, burbot^G^G^Glotte^G^G

    Note: all of the '^G's are actually BELL (U+0007) characters which I embedded in the DATA section.

    Output:

    *** WANTED *** Agreement(BEL)ACAP(BEL)ACAP(BEL)Accord(BEL)(BEL) albatross(BEL)(BEL)(BEL)albatros(BEL)(BEL) alleged violation(BEL)(BEL)(BEL)infraction présumée(BEL)(BEL) allowable(BEL)(BEL)(BEL)admissible(BEL)(BEL) anchovy(BEL)(BEL)(BEL)anchois(BEL)(BEL) angler fish, burbot(BEL)(BEL)(BEL)lotte(BEL)(BEL) *** PROBABLY MORE USEFUL *** Agreement|ACAP|ACAP|Accord albatross|||albatros alleged violation|||infraction présumée allowable|||admissible anchovy|||anchois angler fish, burbot|||lotte

    — Ken

Re: parsing table .doc
by IB2017 (Pilgrim) on May 31, 2020 at 11:12 UTC

    While writing this post, I just thought: why Regex? So I came up with another solution, not elegant, but it seems to work just fine.

    use strict; use warnings; require Text::Extract::Word; Text::Extract::Word->import( qw(get_all_text) ); my $InputFileReadable ='Input.doc'; my $content = get_all_text($InputFileReadable); #detecting structure my ($header, $body)= split /(\a)\1+/, $content; my $NrSeparators = () = $header =~ /\a/g; my @cells = split(/\a/, $content); my $count=0; open(FH, '>', "Output.txt") or die $!; foreach (@cells){ $count++; print FH $_; if ($count eq $NrSeparators+2){ print FH "\n"; }else{ print FH "\t"; } $count = 0 if $count eq $NrSeparators+2; } close(FH);
Re: parsing table .doc
by perlfan (Parson) on May 31, 2020 at 10:15 UTC
    There is a command line tool that will dump Word documents for you such that they are nice to read (tables and all), I am sorry that I can find it atm. I did look. But I suggestion you could start with this. I suspect this will be nicer for you to parse. That said, Word docs are XML these days, so you the advice to not parse *ML with regular expressions still applies; that said, you likely can use any number of the scraper tools on CPAN.

      "That said, Word docs are XML these days"

      Docx (OOXML) files are compressed archives, containing XML among other things. Doc files are a proprietary binary format.

      Please note that I am talking about .doc and not .docx. Parsing the same table in .docx works like a charm (but I can not convert/upgrade all files to .docx). The above is the best I could come out with to extract tables from .doc. It just misses a clear identification of end-of-row. But since I can easily spot this end-of-row if I know the number of columns, there must be a way to automate this. All my attempts with regex failed though.