in reply to Reading tables in MS Word

(New) Word documents are (zipped) XML files, so you most likely can extract all tables by looking for the appropriate XML tags and then extract these and the content from it using XML::LibXML.

The alternative approach on Windows would be to use OLE automation through Win32::OLE and enumerate the Table objects in a document. Most VBA code translates easily to Perl.

Replies are listed 'Best First'.
Re^2: Reading tables in MS Word
by spiral (Novice) on Jul 10, 2025 at 16:16 UTC
    Thanks, I am looking at Win32::OLE. I am not sure how to do this though. I can do a $word->ActiveDocument->{Tables} to get all tables, and similarly for paragraphs, but what I really need to do is read sequentially, get the title from the paragraph header, then get the table belonging to that paragraph. Any ideas how to do that?

      For general maintainability, I recommend avoiding $word->ActiveDocument. If you'll only be working with the current document, then do my $doc = $word->ActiveDocument somewhere at the top of your code.

      Going from this Stackoverflow answer, you could use $doc->Selection->GoTo as suggested in the Microsoft documentation:

      for my $para (1..$doc->Paragraphs->Count) { $doc->Selection->GoTo( What => wdGoToParagraph, Which => wdGoToAbs +olute, Count => $i ); my $para = $doc->Selection->Paragraphs->[0]; $doc->Selection->GoTo( What => wdGoToTable, Which => wdGoToNext ); my $table = $doc->Selection->Tables->[0]; ... }

      This will still trip for paragraphs that have no table. Maybe in that case, it makes sense to step through all tables instead and then go backwards to find the corresponding paragraph or heading, depending on how your document is structured.

        It turns out all the info I need is in tables. I can do something like:
        my $tables = $word->ActiveDocument->{Tables}; for my $table (in $tables) { my $text = $table->ConvertToText(wdSeparateByTabs)->Text; ..... }
        The question is: instead of getting the content as a blob of text, can I get it per row and column?