Re: Extract Paragraph From Text

Careful, careful ... if this text was simply copy-and-pasted into a <<HereDoc, perhaps the text does not in fact contain the expected end-of-line characters. This could therefore be something as simple (and, not-reproducible, when we try it ourselves ...) as having the wrong record separator specified to Perl. Hard to spot, easy to fix.

What I would do, first, is to look at the Perl source-file with a tool such as hexdump which can display the binary content of the file side-by-side with the characters. Look, within the heredoc section, at how the lines and paragraphs are separated. Exactly what byte sequence is used within that section.

Further confusion can be introduced if you retrieve the file from some source, and, in handling it (e.g. to put it into a heredoc), you inadvertently mess-up the sequence or introduce more, conflicting bytes.

For this reason, it might be advantageous to simply read the source-file directly, instead of attempting to embed it into the code. (Which, I understand, might have been done here for the sake of example ...)

Replies are listed 'Best First'.
Re^2: Extract Paragraph From Text by perlbeginneraaa (Novice) on Sep 08, 2015 at 15:31 UTC
Hi sundialsvc4, Thanks for your reply! I will look into that and check what paragraph separators the text uses. Maybe it is that in the text the paragraph separator is not a blank line, so I got the unexpected output. I am not sure about this...	[reply]
Re^3: Extract Paragraph From Text by locked_user sundialsvc4 (Abbot) on Sep 09, 2015 at 22:11 UTC
What I would expect is that text such as this might not contain any “end-of-line” character sequences at all. Instead, the rendering engine would pour the text into the graphic container, line-by-line according to the size of the container and the selected font/font-size ... both of which presumably could change. The only trustworthy “end-of-something” marker would be “end of paragraph,” but what might that be? Who knows. In this situation, I would suggest two specific things: Get the information directly from the original source file, and do it in binary mode. (In other words, don’t tell Perl to expect record-separators of any sort. All you want Perl to do, is to read exactly the bytes that are there, exactly as they are. And, you really need to read the entire file at once ... slurp!) Before writing the code to do that, look at the original source file with the hex-editor as previously discussed, to see what is actually there and what might reasonably be relied-upon. Don’t attempt to copy-and-paste into Perl source code: you have no idea what your text-editor might actually do. (And anything it might do, would only muddy the waters further.) Perl is an extremely powerful data-extraction tool that can most certainly do whatever-it-is that you determine needs to be done. So, please follow-up in this thread and tell us what you’ve found. We’ll be happy to then help you further.