There is a command line tool that will dump Word documents for you such that they are nice to read (tables and all), I am sorry that I can find it atm. I did look. But I suggestion you could start with this. I suspect this will be nicer for you to parse. That said, Word docs are XML these days, so you the advice to not parse *ML with regular expressions still applies; that said, you likely can use any number of the scraper tools on CPAN.