in reply to Extracting text from MS Word files on a Linux box

The following works for me with LibreOffice 5.1:

use IPC::System::Simple qw/capturex/; my $text = capturex('libreoffice', '--convert-to', 'txt:Text (encoded):UTF8', $filename, '--cat', '--headless'); utf8::decode($text); $text=~s/\A\x{FEFF}//; # remove BOM

Replies are listed 'Best First'.
Re^2: Extracting text from MS Word files on a Linux box
by Laurent_R (Canon) on Jun 21, 2018 at 11:50 UTC
    Thank you very much haukex for your suggestion, I'll try it, but I suspect that this might very well work with recent .docx files (which have a format very similar to the open office format), but probably not with the old proprietary binary format associated with MS Office of 2003 and before. I'll give a try anyway.

      I tested with an older format .doc file (not .docx), and AFAIK LibreOffice supports both the older and newer formats.