Re: parsing word on non-windows platform
by b10m (Vicar) on Jan 25, 2004 at 22:00 UTC
|
As discussed before, a "good" way to start is by running your MS Word files through antiword. That will create a plaintext version, which you might parse.
HTH
| [reply] |
|
available for many platforms, gnu, src and binary - " ...Antiword is able to convert Word documents to plain text, to PostScript and to XML/DocBook ...
thanks. this is just the tool I need. It is a shorter path than doc->html-ps->pdf.
while the tool has no perl bindings (to my knowledge) I can just cron a perl (python) script to convert a directory of doc files to pdf via ps.
update:
matts (axkit) just left a message on my use.perl journal reporting that he recommends using HTMLDOC for converting html to pdf. It also has an Axkit plugin. I checked out the man and faq. GNU licensing, perl bindings and end user support available.
| [reply] |
|
| [reply] |
|
| [reply] |
Re: parsing word on non-windows platform
by neuroball (Pilgrim) on Jan 25, 2004 at 21:58 UTC
|
Hi MCS, we have already a node where we discussed the conversion from Word to PDF files. You might want to take a look at it and just replace 'PDF' with 'Textfile'. I.e. the OpenOffice solutions should be what you were searching for.
/oliver/
| [reply] |
Re: parsing word on non-windows platform
by dominix (Deacon) on Jan 25, 2004 at 22:15 UTC
|
there is a library wv ware that do this on *nix, you may want to make a perl Module that wrap it. :-)
| [reply] |
Re: parsing word on non-windows platform
by punchcard_don (Beadle) on Jan 26, 2004 at 14:45 UTC
|
Sometimes low-tech just as effective and a lot simpler.
Just open the Word files in Word, hit "Save as...", and opt to save as .txt
Reload the files on the server and parse them. | [reply] |
|
| [reply] |
|
Had (apparently mistakenly) imagined that "recipes my uncle typed up" meant a reasonable number of documents. How many recipes can one uncle type? At ~20-seconds per document, open-save-close, that's 180 recipes in an hour, far inferior to the time to develop, test and implement a server-side solution. Probably inferior to the time to post a question and read silly replies! :-)
| [reply] |
|
Re: parsing word on non-windows platform
by Willard B. Trophy (Hermit) on Jan 26, 2004 at 14:35 UTC
|
For very quick parsing, this usually works for me:
strings file.doc | fmt -s
Yes, you do get some extra guff, but if you are looking for particular words, this will pick them out.
-- bowling trophy thieves, die!
| [reply] |