in reply to Parsing Gutenberg Catalog Index
It seems that there is a good bit of structure here. Each new entry starts on a new line. The title and author are separated by /, by/. The ID is at the end of the first line of the entry. Combining these, a first stab at a regexp would beAnna Karenina, by Lev Nikolaevica Tolstoi + 13214 [Language: Dutch] Night Before Christmas & Other Popular Stories For Children, by Variou +s 13213 The Wild Olive, by Basil King 13212 The Pearl, by Sophie Jewett 13211 El Comendador Mendoza, by Juan Valera 13210 [Subtitle: Obras Completas Tomo VII] [Language: Spanish]
$line =~ /^(\w.*?), by (.*?)\s+(\d+)$/; $author = $1; $title = $2; $id = $3;
-Mark
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Parsing Gutenberg Catalog Index
by Anonymous Monk on Aug 30, 2004 at 06:37 UTC | |
by tachyon (Chancellor) on Aug 30, 2004 at 07:58 UTC | |
|
Re^2: Parsing Gutenberg Catalog Index
by lidden (Curate) on Aug 30, 2004 at 12:08 UTC | |
by GotToBTru (Prior) on Mar 14, 2016 at 12:58 UTC |