Re: A copyeditor needs help to get started with a Perl project

While Perl is a very good tool for these types of jobs I wonder if in this case VB macros aren't more suitable. I should think that by interacting with Word's Object model you could get the Word engine itself to do many of these tasks. Having said that if you do go Perl I would probably end up converting the word files into a more suitable text only form (word documents are actually binary objects with lots of stuff in them) and then work there.

Possibly a hybrid approach would be to make a macro in VB that automatically extracts each chapter to a text file, and then you could have a perl script that operated on the result. There are also options of interfacing with Word COM interface from Perl as well, but until I was familiar with doing it in VB i wouldnt bother, the whole VBA editor/design process can be quite illuminating as to how the Object model works with reasonably good documentation and support like a visual debugger and automatic method/property selection. Making macros and then studying the generated code is a good way to learn. Once you have working VB code its not too diffuclt to translate it to Perl via the Win32 modules.

Altogether I imagine the difficulty will be dependent on how automated you want the process to be. If you simply save the documents as text and then have a script that does various jobs like you decribe you avoid a lot of the complexity of interfacing to the Word document/engine. As your perl skills improve you could look at automating the process.

BTW, as an editor you may like to know that your node was put up for editorial/janitorial consideration to add some markup to make your node more readable. Normally you should use P tags like: <p>blah blah</p> to break up your nodes. Reading a long paragraph like yours is difficult on a screen. There are markup tips and links underneath where you can post, please review them. ;-)

I'm an ex-IT person who has recently switched to copyediting STM books for a living. I would like to use Perl to automate the more mechanical aspects of my work. The manuscript files come in MS Word format, and I want to generate a report that will extract all hyphenated terms, capitalized phrases, acronyms (expanded and unexpanded), etc. Conflicting terms (e.g., the same term could be uppercase in one chapter and lowercase in another) would also be identified in the report. Also, the utility should be able to flag all the terms in the manuscript file that appear in another text file containing keywords input by me.

One of the tasks of the copyeditor is to make sure the book is consistent in the way terms appear in different chapters. I'm hoping to write a utility program in Perl that will generate such a report and help me make the manuscript consistent. A few questions:

I'm assuming Perl is the right tool for such a utility. Stupid question, still would appreciate confirmation from an expert.

Do any readymade libraries exist for such tasks? I'd prefer to write the code myself because I could tweak it to suit my private perversions; still, it would be nice to know.
Can I search MS Word files directly using Perl or do I need to save the Word file as a text file? If I could work directly with the Word file, the added functionality of searching headers, tables, etc., would be a great plus.

In passing, is the MS Word format a state secret? My Perl level, you ask? Tyro, sir, tyro. Just downloaded ActiveState 5.8, have bookmarked an online book, have a Perl primer in my drawer, and am looking forward to having fun. I'm just rarin' to go. I have done a fair amount of programming in the past and am not afraid of writing code.

Thanks in advance! Wordsmith

Cheers,

---
demerphq

Comment on Re: A copyeditor needs help to get started with a Perl project Download Code

Replies are listed 'Best First'.
Re^2: A copyeditor needs help to get started with a Perl project by wordsmith (Acolyte) on Nov 04, 2004 at 13:17 UTC
Thanks, everyone, for all your suggestions. I think I'll save as text and unleash Perl on the text files for starters. Later, I'll try and wade into the more complex part. Yes, your comments on the poor formatting of my post were on target. I'll use the HTML tags from now on.	[reply]