Re: Parsing Gutenberg Catalog Index

For the other readers, here is a snippet of the index:

Anna Karenina, by Lev Nikolaevica Tolstoi                             
+   13214
  [Language: Dutch]
Night Before Christmas & Other Popular Stories For Children, by Variou
+s  13213
The Wild Olive, by Basil King                           13212
The Pearl, by Sophie Jewett                             13211
El Comendador Mendoza, by Juan Valera                   13210
  [Subtitle: Obras Completas Tomo VII]
  [Language: Spanish]
[download]

It seems that there is a good bit of structure here. Each new entry starts on a new line. The title and author are separated by /, by/. The ID is at the end of the first line of the entry. Combining these, a first stab at a regexp would be

$line =~ /^(\w.*?), by (.*?)\s+(\d+)$/;
$author = $1;
$title = $2;
$id = $3;
[download]

-Mark

Comment on Re: Parsing Gutenberg Catalog Index Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing Gutenberg Catalog Index by Anonymous Monk on Aug 30, 2004 at 06:37 UTC
Woudlnt it just make sense to use their XML formatted catalog? ... http://gutenberg.net/browse/rdf/catalog.rdf.bz2	[reply]
Re^3: Parsing Gutenberg Catalog Index by tachyon (Chancellor) on Aug 30, 2004 at 07:58 UTC
I would second that, given that the RDF looks like: `<rdf:Description rdf:ID="etext13218"> <dc:publisher>&pg;</dc:publisher> <dc:title rdf:parseType="Literal">Don Orsino</dc:title> <dc:creator>Crawford, F. Marion (Francis Marion) (1854-1909)</dc:cre +ator> <dc:language>en</dc:language> <dc:created>2004-08-19</dc:created> <dc:rights rdf:resource="&lic;" /> </rdf:Description>` [download] Then all you need is something trivial like this to create a file ready for a MySQL 'load data local infile .....' `#!/usr/bin/perl local $/ = "\n\n"; open RDF, $ARGV[0] or die $!; while(<RDF>){ next unless m/<rdf:Description rdf:ID="etext(\d+)"/; my $id = $1; next unless m/<dc:title[^>]+>([^<\n]+)</; my $title = $1; next unless m/<dc:creator>([^<\n]+)</; my $author = $1; $title =~ s/\s+/ /g; $author =~ s/\s+/ /g; print "$id\t$title\t$author\n"; }` [download] cheers tachyon	[reply] [d/l] [select]
Re^2: Parsing Gutenberg Catalog Index by lidden (Curate) on Aug 30, 2004 at 12:08 UTC
Looking a little closer and also in older index files i found this. ***A "C" Following a Project Gutenberg eBook Number Indicates Copyri +ght **A "" Following a Project Gutenberg eBook Number Indicates Reserv +ed **** [snip] The Life of John Ruskin, by W. G. Collingwood + 13076 A Hero and a Great Man, by Francis Kruckvich + 13075C [Illustrator: Fritz] Punch, Vol. 100, February 7, 1891, Ed. by Sir Francis Burnand + 13074 [snip] Feb 1995 Moon and Sixpence by Somerset Maugham [Maugham #1][moonaxxx.x +xx] 222 Feb 1995 The Return of Sherlock Holmes [Magazine Edition] [rholmxxb.x +xx] 221B Feb 1995 The Secret Sharer, by Joseph Conrad [Conrad #2] [ssharxxx.x +xx] 220 [download] I did not found the meaning of the 'B' though.	[reply] [d/l]
Re^3: Parsing Gutenberg Catalog Index by GotToBTru (Prior) on Mar 14, 2016 at 12:58 UTC
221B - as in 221B Baker Street. It could mean something else, but I think it's somebody's attempt at humor. Wonder how this project turned out? But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)	[reply]