spiral has asked for the wisdom of the Perl Monks concerning the following question:

I need to automate some tasks and this involves reading some MS Word files. I recently came across MsOffice::Word::Surgeon. Specifically what I want to do is read a word file, recognize tables in it and read tables, where specific columns may have relevant information, such as a key. Does anyone have or can give an example of this? E.g.
use MsOffice::Word::Surgeon; my $surgeon = MsOffice::Word::Surgeon->new(docx => $file); my $main_text = $surgeon->document->plain_text; # if I serialize this way, I lose table info

Replies are listed 'Best First'.
Re: Reading tables in MS Word
by Corion (Patriarch) on Jul 09, 2025 at 06:14 UTC

    (New) Word documents are (zipped) XML files, so you most likely can extract all tables by looking for the appropriate XML tags and then extract these and the content from it using XML::LibXML.

    The alternative approach on Windows would be to use OLE automation through Win32::OLE and enumerate the Table objects in a document. Most VBA code translates easily to Perl.

      Thanks, I am looking at Win32::OLE. I am not sure how to do this though. I can do a $word->ActiveDocument->{Tables} to get all tables, and similarly for paragraphs, but what I really need to do is read sequentially, get the title from the paragraph header, then get the table belonging to that paragraph. Any ideas how to do that?

        For general maintainability, I recommend avoiding $word->ActiveDocument. If you'll only be working with the current document, then do my $doc = $word->ActiveDocument somewhere at the top of your code.

        Going from this Stackoverflow answer, you could use $doc->Selection->GoTo as suggested in the Microsoft documentation:

        for my $para (1..$doc->Paragraphs->Count) { $doc->Selection->GoTo( What => wdGoToParagraph, Which => wdGoToAbs +olute, Count => $i ); my $para = $doc->Selection->Paragraphs->[0]; $doc->Selection->GoTo( What => wdGoToTable, Which => wdGoToNext ); my $table = $doc->Selection->Tables->[0]; ... }

        This will still trip for paragraphs that have no table. Maybe in that case, it makes sense to step through all tables instead and then go backwards to find the corresponding paragraph or heading, depending on how your document is structured.

Re: Reading tables in MS Word
by Fletch (Bishop) on Jul 09, 2025 at 01:31 UTC

    Another option might be to use pandoc to convert from word to something more amenable to text processing (markdown, HTML) and see if the table content you're interested in is more accessible that way (with e.g. HTML::TreeBuilder or what not).

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.