Basically, most of the Project Gutenberg etexts have this format:
The Project Gutenberg Etext of <etext title> by <etext author> Copyright (c) <copyright holder> **This is a COPYRIGHTED Project Gutenberg Etext, Details Below** <Project Gutenberg Header> ... # Gutenberg Copyright information <known range of delimiting text> <actual body of etext> ... # etext body/document/book is here End of the Project Gutenberg Etext of <etext title> by <etext author> Copyright (c) <copyright holder>
Some of this is going to rely on an array of possible opening Gutenberg strings and delimiters, based on the trasnlator and year of translation, but what I'd like to do is stuff those sections into an array I can manipulate, then reassemble it back into a "sectional" document, so it ends up like this (pseudocode):
my $pg_author = "<etext author>"; my $pg_title = "<etext title>"; my @pg_header = "<Project Gutenberg Header>"; my @etext_body = "<actual body of etext>"; my @pg_footer = "<Project Gutenberg Footer>";
With the relevant sections in a series of arrays I can manipulate, I can then reassemble it into a document which can be turned into a clickable XHTML 1.0-compliant document, where the copyright, document body, etc. is all clickable from the "contents" page of the document when viewed on the Palm device. Right now, it would be a huge flat text file, where the first 10 or so pages are the copyright info, which doesn't scale well on a 160x160 screen. Having each "chapter" of the etext, including the Gutenberg Copyright in their own "page", clickable (tappable, via stylus/finger) is more preferred.
I'm also rewrapping the text, using Text::Autoformat to 55 columns wide, with full-justify, which looks very good, compared to the "chainsaw" effect of the original etext packed in natively. The code for that is very basic, and looks like:
use strict; use CGI qw(:standard); use Text::Autoformat; my $file = "pg_exext.txt"; open(PGW, "<$file") or die $!; local $/ = undef; my $data = <PGW>; my $formatted = autoformat $data, {justify =>'full', left => 4, right => 55, all => 1}; print pre("$formatted"); my $data =~ m/End of Project Gutenberg Etext (.*)/s;
The part I'm confused about, is how do I walk/seek through the text file/stream (assuming the file body itself is in an object, via Net::FTP or HTTP::Request), and bite those sections of into arrays? I'm able to pull the whole file into an array, but not specific delimited sections.
The next phase, if I can get this sorted out, is to try to build a table of heuristics where I can detect actual chapter breaks in the texts, and separate those into their own pages, but being able to separate header, etext, footer, into their own bits is important for this first pass.
TIA for any help and suggestions.
In reply to "Biting" text files into managable sections by hacker
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |