in reply to A copyeditor needs help to get started with a Perl project

FWIW, if you prefer Perl to VB, you can save your data as HTML in Word, edit that, and reread it. Word reads its own HTML very well, for all I know with no loss of information at all.
  • Comment on Re: A copyeditor needs help to get started with a Perl project

Replies are listed 'Best First'.
Re^2: A copyeditor needs help to get started with a Perl project
by tachyon (Chancellor) on Nov 04, 2004 at 10:01 UTC

    Word reads its own HTML very well

    That is a very good thought. As horrid as Word HTML is to the naked eye HTML parser should let you whip through it with ease, editing the text but leaving the puke vomit markup formatting. Then as you say let Word convert its own excreta back into native format. This conversion is essentially just padding with huge numbers of null bytes for every real character, thus 'Hello World!' as a text file is 13 bytes but in .DOC format it needs a mere 19,456 :-)

      *shrug*

      A VB runtime sounds like it's going to be bigger than either :)

      (BTW, some versions of Word had a bug where the first time you saved a file after an edit, it's size would be about double what it'd be if you'd immediately save it again. Or did this happen only when the file was saved as RTF?)

        It's just me showing my age. Some of my favorite games of all time fitted on ROMs smaller than 20K. Hell my first machine had 16K and I was over the moon to go 64K. I was cramming whole disk indexes in memory!

        I have this conspiracy theory. M$ takes kickbacks from Intel to write more and more bloated software. This in turn fuels the need for ever faster processors. Look at the reality. In terms of memory you needed DOS <640K ;-), Win 3.0 4M, 95 16M, 98 64M, 2000 256M, XP 512M with processors to match.

        I just find it wrong that you need 20 K to store 12 bytes of text plus Times New Roman as the default font. It seems analagous to using the Titanic to make crushed ice for your drink.

      Tacyon's phrasing is admirable -- wish I'd said it first! ( and read 405192 et seq, as well -- inserted 1700GMT) -- but sparks a small question.

      Based on wordsmith's original description, won't cleaning up the "puke" "vomit" bovine manure (ok, "markup" or "formatting") be almost as important to the desired "consistency" as emending the actual text? The boldfacing, font changes, etc. in the original .doc may be formatting conventions for which consistency is also desired.

      Also, curious (enough so to read/experiment soon, unless some oracle here knows with certainty) whether HTML parser can actually make sense of the many conditionals Word inserts while saving as .html garbage.

Re^2: A copyeditor needs help to get started with a Perl project
by wordsmith (Acolyte) on Nov 05, 2004 at 08:44 UTC

    Working with the HTML version of the Word file is an interesting idea. By the way, I just need to read the file; writing will be to a new report file. The original Word file will remain untouched. I will enter the corrections manually. The "automation" in my original post only refers to the process of identifying inconsistencies in the document.

    But can Perl recognize document elements such as headers, fonts, superscripts, tables, etc., in the HTML file? Let me give you a sample real-life scenario. Let us say a chapter has 100 numbered reference items at the end of the chapter, which are cited in text by superscripted integers. The problem: generate a report that will identify the reference items that have not been cited in the text.

    To accomplish this, Perl would have to recognize superscripted elements in the HTML file. Pardon my ignorance, but can it?

      Yes, it can.

      I don't have sample data at hand right now, so I can't give you the exact example, but if you inserted the footnote (or endnote) with the standard Insert Footnote menu item, the integer is well tagged with something like "<span class="footnoteReference">....</span>" tags. If you're on a Windows machine, just save a demo document as HTML and look at the result with a text editor, it should be clear enough. Don't be intimidated by the several KBs of CSS in the beginning :)