comment on

I'm working on some scripts that can digest Project Gutenberg etext files into managable sections, which are then converted into Plucker format for use/reading on a Palm handheld device. What I'm trying to figure out, is how to "bite" the etext into sections I can manage in pieces, and I'm running into some trouble.

Basically, most of the Project Gutenberg etexts have this format:

   The Project Gutenberg Etext of <etext title> 
   by <etext author>

   Copyright (c) <copyright holder>

   **This is a COPYRIGHTED Project Gutenberg Etext, 
   Details Below**

   <Project Gutenberg Header>

   ... # Gutenberg Copyright information 

   <known range of delimiting text>

   <actual body of etext>

   ... # etext body/document/book is here
   
   End of the Project Gutenberg Etext of <etext title>
   by <etext author>
   Copyright (c) <copyright holder>
[download]

Some of this is going to rely on an array of possible opening Gutenberg strings and delimiters, based on the trasnlator and year of translation, but what I'd like to do is stuff those sections into an array I can manipulate, then reassemble it back into a "sectional" document, so it ends up like this (pseudocode):

   my $pg_author  = "<etext author>";
   my $pg_title   = "<etext title>";
   my @pg_header  = "<Project Gutenberg Header>";
   my @etext_body = "<actual body of etext>";
   my @pg_footer  = "<Project Gutenberg Footer>";
[download]

With the relevant sections in a series of arrays I can manipulate, I can then reassemble it into a document which can be turned into a clickable XHTML 1.0-compliant document, where the copyright, document body, etc. is all clickable from the "contents" page of the document when viewed on the Palm device. Right now, it would be a huge flat text file, where the first 10 or so pages are the copyright info, which doesn't scale well on a 160x160 screen. Having each "chapter" of the etext, including the Gutenberg Copyright in their own "page", clickable (tappable, via stylus/finger) is more preferred.

I'm also rewrapping the text, using Text::Autoformat to 55 columns wide, with full-justify, which looks very good, compared to the "chainsaw" effect of the original etext packed in natively. The code for that is very basic, and looks like:

   use strict;
   use CGI qw(:standard);
   use Text::Autoformat;

   my $file      = "pg_exext.txt";
   open(PGW, "<$file") or die $!;
   local $/      = undef; 
   my $data      = <PGW>;

   my $formatted = autoformat $data, {justify =>'full', 
                                      left    => 4, 
                                      right   => 55, 
                                      all     => 1};
   print pre("$formatted");
   my $data =~ m/End of Project Gutenberg Etext (.*)/s;
[download]

The part I'm confused about, is how do I walk/seek through the text file/stream (assuming the file body itself is in an object, via Net::FTP or HTTP::Request), and bite those sections of into arrays? I'm able to pull the whole file into an array, but not specific delimited sections.

The next phase, if I can get this sorted out, is to try to build a table of heuristics where I can detect actual chapter breaks in the texts, and separate those into their own pages, but being able to separate header, etext, footer, into their own bits is important for this first pass.

TIA for any help and suggestions.

In reply to "Biting" text files into managable sections by hacker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.