Intermediate Dave has asked for the wisdom of the Perl Monks concerning the following question:

Is there a way to get Perl to print out a character that Microsoft Word will recognize as a page break?

I'm trying to create a Kindle ebook -- which Amazon can automatically generate from a Microsoft Word document. I'm using Perl to scrape text from a JavaScript that printed out one page at a time, and I need some way to indicate (to Microsoft Word) where the page breaks should go.

  • Comment on Can Perl generate a page break character that Microsoft Word will recognize?

Replies are listed 'Best First'.
Re: Can Perl generate a page break character that Microsoft Word will recognize?
by tobyink (Canon) on Dec 31, 2019 at 01:15 UTC
Re: Can Perl generate a page break character that Microsoft Word will recognize?
by marto (Cardinal) on Dec 31, 2019 at 05:00 UTC

    If you absolutely have to do this in word, .docx files are essentially a compressed collection of XML files. I've used Mojo::DOM to work with parts of this, however there's now MsOffice::Word::Surgeon which looks promising in terms of editing existing documents.

Re: Can Perl generate a page break character that Microsoft Word will recognize?
by jcb (Parson) on Dec 31, 2019 at 00:42 UTC

    In other words, is a Word "page break" an actual character or some other object from the Stygian Depths of Redmond?

    Try an ASCII FF (form feed) character, Control-L or "\014". If that does not work, you will need to use COM Windows-isms to build up the text in Word bit by bit. Or try another trick: WordPad actually wrote RTF with a .doc extension and Word will silently accept RTF documents, so you might be able to output RTF and get Amazon to process it.

      A Word page break is a character \n in old DOC file format. Newer Word documents are DOCX files, which are essentially ZIP files containing several xml documents, one of which is called document.xml. This one contains the document text itself. I created a simple document with two lines "AAA" and "BBB" for example. This was the content in the document.xml file:

      <w:body> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>AAA</w:t> </w:r> </w:p> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>BBB</w:t> </w:r> </w:p> - <w:sectPr w:rsidR="00D96BA8" w:rsidSect="00354B3C"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1008" w:right="1008" w:bottom="1008" w:left="1008" w +:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body>

      and this was the DOC file hex dump somewhere in the middle. I am not going to copy the entire file here. I know, some of you are like "whew!" lol

      Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 000009B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A00 41 41 41 0D 42 42 42 0D 00 00 00 00 00 00 00 00 AAA.BBB.... +..... 00000A10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +.....

        Interesting. Word seems to use ASCII CR as paragraph break, so does it use ASCII LF or ASCII FF as page break? (There is also a forced end-of-line produced by Shift-Enter that does not start a new paragraph. Simply pressing Enter actually starts a new paragraph, which starts a new line as a side-effect.)

        If we want to consider producing DOCX, it would be fairly easy to input AAA [Control-Enter to insert a page break] BBB and see what turns up in document.xml. Word DOC format uses Microsoft's "OLE Container" format, which turns out to be a miniature FAT filesystem, complete with its own allocation tables, and (if I remember correctly) a second FAT filesystem with smaller blocks stored inside a "file" in the outer container file. At least they only did that to one level of recursion, instead of producing a "filesystems all the way down" crawling horror.

Re: Can Perl generate a page break character that Microsoft Word will recognize?
by Anonymous Monk on Jan 01, 2020 at 02:50 UTC