in reply to parsing table .doc

There is a command line tool that will dump Word documents for you such that they are nice to read (tables and all), I am sorry that I can find it atm. I did look. But I suggestion you could start with this. I suspect this will be nicer for you to parse. That said, Word docs are XML these days, so you the advice to not parse *ML with regular expressions still applies; that said, you likely can use any number of the scraper tools on CPAN.

Replies are listed 'Best First'.
Re^2: parsing table .doc
by marto (Cardinal) on May 31, 2020 at 10:23 UTC

    "That said, Word docs are XML these days"

    Docx (OOXML) files are compressed archives, containing XML among other things. Doc files are a proprietary binary format.

Re^2: parsing table .doc
by IB2017 (Pilgrim) on May 31, 2020 at 10:25 UTC

    Please note that I am talking about .doc and not .docx. Parsing the same table in .docx works like a charm (but I can not convert/upgrade all files to .docx). The above is the best I could come out with to extract tables from .doc. It just misses a clear identification of end-of-row. But since I can easily spot this end-of-row if I know the number of columns, there must be a way to automate this. All my attempts with regex failed though.