in reply to Re: Can Perl generate a page break character that Microsoft Word will recognize?
in thread Can Perl generate a page break character that Microsoft Word will recognize?

A Word page break is a character \n in old DOC file format. Newer Word documents are DOCX files, which are essentially ZIP files containing several xml documents, one of which is called document.xml. This one contains the document text itself. I created a simple document with two lines "AAA" and "BBB" for example. This was the content in the document.xml file:

<w:body> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>AAA</w:t> </w:r> </w:p> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>BBB</w:t> </w:r> </w:p> - <w:sectPr w:rsidR="00D96BA8" w:rsidSect="00354B3C"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1008" w:right="1008" w:bottom="1008" w:left="1008" w +:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body>

and this was the DOC file hex dump somewhere in the middle. I am not going to copy the entire file here. I know, some of you are like "whew!" lol

Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 000009B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A00 41 41 41 0D 42 42 42 0D 00 00 00 00 00 00 00 00 AAA.BBB.... +..... 00000A10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +.....

Replies are listed 'Best First'.
Re^3: Can Perl generate a page break character that Microsoft Word will recognize?
by jcb (Parson) on Jan 01, 2020 at 02:43 UTC

    Interesting. Word seems to use ASCII CR as paragraph break, so does it use ASCII LF or ASCII FF as page break? (There is also a forced end-of-line produced by Shift-Enter that does not start a new paragraph. Simply pressing Enter actually starts a new paragraph, which starts a new line as a side-effect.)

    If we want to consider producing DOCX, it would be fairly easy to input AAA [Control-Enter to insert a page break] BBB and see what turns up in document.xml. Word DOC format uses Microsoft's "OLE Container" format, which turns out to be a miniature FAT filesystem, complete with its own allocation tables, and (if I remember correctly) a second FAT filesystem with smaller blocks stored inside a "file" in the outer container file. At least they only did that to one level of recursion, instead of producing a "filesystems all the way down" crawling horror.

      Or just look up the XML to do what you want:

      <?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingm +l/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocumen +t/2006/math" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns: +r="http://schemas.openxmlformats.org/officeDocument/2006/relationship +s" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:ve="http://schemas.o +penxmlformats.org/markup-compatibility/2006" xmlns:w10="urn:schemas-m +icrosoft-com:office:word" xmlns:wne="http://schemas.microsoft.com/off +ice/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/dra +wingml/2006/wordprocessingDrawing"> <w:body> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>1234</w:t> </w:r> </w:p> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>5678</w:t> </w:r> </w:p> <w:sectPr w:rsidR="00D479B1" w:rsidSect="00D479B1"> <w:pgSz w:w="11906" w:h="16838" /> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left=" +1800" w:header="708" w:footer="708" w:gutter="0" /> <w:cols w:space="708" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

      becomes:

      <?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingm +l/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocumen +t/2006/math" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns: +r="http://schemas.openxmlformats.org/officeDocument/2006/relationship +s" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:ve="http://schemas.o +penxmlformats.org/markup-compatibility/2006" xmlns:w10="urn:schemas-m +icrosoft-com:office:word" xmlns:wne="http://schemas.microsoft.com/off +ice/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/dra +wingml/2006/wordprocessingDrawing"> <w:body> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>1234</w:t> </w:r> </w:p> <w:p> <w:r> <w:br w:type="page" /> </w:r> </w:p> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>5678</w:t> </w:r> </w:p> <w:sectPr w:rsidR="00D479B1" w:rsidSect="00D479B1"> <w:pgSz w:w="11906" w:h="16838" /> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left=" +1800" w:header="708" w:footer="708" w:gutter="0" /> <w:cols w:space="708" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

      See also the other links already provided in this thread, and their associated links. To be honest your work flow ('I'm using Perl to scrape text from a JavaScript that printed out one page at a time..') seems somewhat convoluted, but you don't go into much detail.