in reply to Re: A copyeditor needs help to get started with a Perl project
in thread A copyeditor needs help to get started with a Perl project

Word reads its own HTML very well

That is a very good thought. As horrid as Word HTML is to the naked eye HTML parser should let you whip through it with ease, editing the text but leaving the puke vomit markup formatting. Then as you say let Word convert its own excreta back into native format. This conversion is essentially just padding with huge numbers of null bytes for every real character, thus 'Hello World!' as a text file is 13 bytes but in .DOC format it needs a mere 19,456 :-)

  • Comment on Re^2: A copyeditor needs help to get started with a Perl project

Replies are listed 'Best First'.
Re^3: A copyeditor needs help to get started with a Perl project
by gaal (Parson) on Nov 04, 2004 at 11:06 UTC
    *shrug*

    A VB runtime sounds like it's going to be bigger than either :)

    (BTW, some versions of Word had a bug where the first time you saved a file after an edit, it's size would be about double what it'd be if you'd immediately save it again. Or did this happen only when the file was saved as RTF?)

      It's just me showing my age. Some of my favorite games of all time fitted on ROMs smaller than 20K. Hell my first machine had 16K and I was over the moon to go 64K. I was cramming whole disk indexes in memory!

      I have this conspiracy theory. M$ takes kickbacks from Intel to write more and more bloated software. This in turn fuels the need for ever faster processors. Look at the reality. In terms of memory you needed DOS <640K ;-), Win 3.0 4M, 95 16M, 98 64M, 2000 256M, XP 512M with processors to match.

      I just find it wrong that you need 20 K to store 12 bytes of text plus Times New Roman as the default font. It seems analagous to using the Titanic to make crushed ice for your drink.

        Oh, I used to own a machine with 1 kilobyte myself; and like the Titanic analogy :)

        But I was merely pointing out that the bloat in the exported data is negligible compared to the size of the environment required to access it in its native format.

        As for your conspiracy theory: I had the same thoughts! I once calculated that Wintel share about $4G a year between them just by making software that sucks.

Re^3: A copyeditor needs help to get started with a Perl project
by ww (Archbishop) on Nov 04, 2004 at 14:55 UTC
    Tacyon's phrasing is admirable -- wish I'd said it first! ( and read 405192 et seq, as well -- inserted 1700GMT) -- but sparks a small question.

    Based on wordsmith's original description, won't cleaning up the "puke" "vomit" bovine manure (ok, "markup" or "formatting") be almost as important to the desired "consistency" as emending the actual text? The boldfacing, font changes, etc. in the original .doc may be formatting conventions for which consistency is also desired.

    Also, curious (enough so to read/experiment soon, unless some oracle here knows with certainty) whether HTML parser can actually make sense of the many conditionals Word inserts while saving as .html garbage.