Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

parsing a bibliography

by patrickrock (Beadle)
on Dec 01, 2004 at 22:23 UTC ( [id://411598]=perlquestion: print w/replies, xml ) Need Help??

patrickrock has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I have been handed an annotated bibliography created in ms word. I have extracted out each entry onto its own line thus:

==========begin biblio===========
-Lightfoot, J. B. St. Paul’s Epistle to the Philippians. Grand Rapids: Zondervan, 1953 (= 1913). Classic commentary by one of the greatest English-speaking NT scholars of all time. 2
-Martin, Ralph P. Philippians. Rev. ed.; NCB. Grand Rapids: Eerdmans, 1980. Clear and informed. 2
O'Brien, Peter, T. Commentary on Philippians. NIGTC. Grand Rapids: Eerdmans, 1991. Thorough and insightful comments on the Greek text. 1
-Silva, Moisés. Philippians. Baker Exegetical Commentary. Grand Rapids: Baker, 1993. Sound comments on the Greek text. 2
-Barth, Markus and Helmut Blanke. The Letter to Philemon: A New Translation with Notes and Commentary. Grand Rapids: Eerdmans, 2000. With over 500 pages devoted to a letter that was probably written on a single sheet of papyrus, this work will be consulted by all who want the most thorough treatment of Philemon and avoided by the rest of us. 3
-Bruce, F. F. The Epistles to the Colossians, to Philemon, and to the Ephesians. NIC. Grand Rapids: Eerdmans, 1984. See comments under “Commentaries on Ephesians.” 2
==========end biblio===========


any ideas how you would parse this into its consituent parts for insertion into a database? Like Author(s), Title, Publisher, comments etc...

There isn't anything obvious to split() on, nor any regex wizardry that occurs to me either.

Thought I'd run it by you guys before hiring a temp to type it all in by hand. Thanks in advance, Pat Rock

Replies are listed 'Best First'.
Re: parsing a bibliography
by BrowserUk (Patriarch) on Dec 01, 2004 at 23:46 UTC

    It'll probably need tweaking constantly for different input, but it works for the sample data. I've no idea what $thing is though.

    #! perl -slw use strict; $^W=0; while( <DATA> ) { my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ -( .*? \. ) \s(?=[A-Z][a-z]) ( .+ ) \.\s+ ( [^:]+? ) : \s+ (\S+), \s+ ( \d{4} ) [^.]* \. \s+ ( [^\[]+ ) \[ ( \d+ ) \] \s* $ /x; printf " Author:'%s'\n" . " Title:'%s'\n" . " Thing?:'%s'\n" . "Publisher:'%s'\n" . " Date:'%4d'\n" . " Comment:'%s'\n" . " No:'%d'\n\n", $authors, $title, $thing, $pub, $date, $comment, $no; } __DATA__ -Lightfoot, J. B. St. Paul’s Epistle to the Philippians. Grand Rapids: + Zondervan, 1953 (= 1913). Classic commentary by one of the greatest +English-speaking NT scholars of all time. [2] -Martin, Ralph P. Philippians. Rev. ed.; NCB. Grand Rapids: Eerdmans, +1980. Clear and informed. [2] -O'Brien, Peter, T. Commentary on Philippians. NIGTC. Grand Rapids: Ee +rdmans, 1991. Thorough and insightful comments on the Greek text. [1] -Silva, Moisés. Philippians. Baker Exegetical Commentary. Grand Rapids +: Baker, 1993. Sound comments on the Greek text. [2] -Barth, Markus and Helmut Blanke. The Letter to Philemon: A New Transl +ation with Notes and Commentary. Grand Rapids: Eerdmans, 2000. With o +ver 500 pages devoted to a letter that was probably written on a sing +le sheet of papyrus, this work will be consulted by all who want the +most thorough treatment of Philemon and avoided by the rest of us. [3 +] -Bruce, F. F. The Epistles to the Colossians, to Philemon, and to the +Ephesians. NIC. Grand Rapids: Eerdmans, 1984. See comments under “Com +mentaries on Ephesians.” [2]

    Output:


    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      One small but worthwhile modification would be to open an errors file and ouput non matching records to it. You could then tune or post process these.

      open ERRS, ">error.log" or die $!; while( ... ) { my ( ... ) =~ m/ ... /x; if ( $authors ) { # output as desired } else { print ERRS "$_\n"; } } close ERRS;

      cheers

      tachyon

      I think "thing" is the location of the publisher. Seems that Grand Rapids, (MI?) has some serious biblical publishing houses ...

      - j

      BrowserUK, oh. my. stars.

      I am in awe of you. Thanks ever so much.

      wonderful! Wish I could ++ your solution repeatedly! This writeup led to a "Eureka!" moment; the kind of haze-clearing that makes PM so valuable to beginners like me.

      request: please add to our understanding by commenting lines of regex, esp that part of line8 reading

      (?=[A-Z][a-z])

      (grouped but non-capture??)

      and in line13,

      ( [^\[]+ )

      which, as I read Owl (pocket ref), means capture one-or-more of a class including not-an-open_BRKT and close_BRKT ...which doesn't make sense to me, and -- more importantly, doesn't seem to WORK that way.

        The regex commented.

        my( $authors, $title, $thing, $pub, $date, $comment, $no ) = m/ ^ ## Author(s): Capture the minimum needed to satisfy that: ## a) It ends with a '. ' ## b) And the next word is not an initial ## IE: Lookahead and check the next word starts with ## 1 uppercase *and* one lowercase character. -( .*? \. ) \s(?=[A-Z][a-z]) ## Title: Greedily capture something that ends with '. ' ( .+ ) \.\s+ ## Location: Non-greedily capture ## Ends with a ': '. ## Doesn't contain a ':' ( [^:]+? ) : \s+ ## Publisher: ## Single word followed by a ', ' (\S+), \s+ ## Year: Capture Four digits ## Discard anything else upto '. ' ( \d{4} ) [^.]* \. \s+ ## Comment: Greedy capture non-'[' characters ## Ie. Stop capturing when you see a '[' ( [^\[]+ ) ## No: Capture 1 (or more) digits between '[' & ']' ## Discard any trailing space to the EOS. \[ ( \d+ ) \] \s* $ /x;

        Examine what is said, not who speaks.
        "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
        "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: parsing a bibliography
by Your Mother (Archbishop) on Dec 01, 2004 at 22:57 UTC

    This is a great case for Parse::RecDescent. It can take a record oriented matching strategy (so A is possible only if B follows and that then excludes C kind of thing). The trick is that writing grammars for it is not super easy or quick or else I'd take a stab--who knows, maybe some good and kind monk who's a pro with it is doing so right now.

Re: parsing a bibliography
by bprew (Monk) on Dec 01, 2004 at 23:03 UTC

    When I have had to do things similar to this (read: parse unordered data), I've found that attempting to pass it through several pre-filters and trying for the 90% (or sometimes only 80%) solution works well.

    Its sometimes faster to do the unthinkable --Enter the data by hand, then it is to write a highly complex regex/parsing engine for some unordered data format you're unlikely to see again. Or, write a partial solution that gets most of the data, spot check it, and then enter the rest by hand.

    Ultimately, you're dealing with the problem that the a human can "understand" the data they're reading and make correct assumptions about the data, but the computer has a much harder time, and its not well-suited to it.

    If the data is in Word, perhaps you could try formatting it *more*, and then try possibly getting it into Excel, or as mentioned above, using italics or bolding, etc.


    --
    Ben
    "Naked I came from my mother's womb, and naked I will depart."
Re: parsing a bibliography
by jimbojones (Friar) on Dec 01, 2004 at 22:51 UTC
    Hi

    Looking at what you have, it's going to be difficult. I thought about splitting on periods, but then initials and abbreviations start to mess you up.

    Is it possible that there is more information in the Word Doc? For example, is the title in italics, and if so, can you somehow highlight that information when extracting to your plaintext?

    WIth that information, you might be able to get author; title; location, publisher, date.

    - j

Re: parsing a bibliography
by ikegami (Patriarch) on Dec 01, 2004 at 22:56 UTC
    I don't think Perl can help you much here, but I do have a tip to offer. Load up the text in notepad or something similar, start a search for a space, and cancel the search (i.e. make the dialog disappear). Keep pressing F3 (Find Next). When the found space seperates two fields, press Enter. You'll end up with newline-seperated fields, with a leading '-' marking a new record. Then, Perl might be able to help you a bit.
Re: parsing a bibliography
by erix (Prior) on Dec 02, 2004 at 01:21 UTC

    If your bibliography file is large enough, and you can find the appropriate library database, it may be worthwhile to extract only basic data (author,year,title), and with those search the library database to receive standard format bibliographic data. Your bibliography file seems to have 'comments' as its only specific part, and that part can be easily separated.

Re: parsing a bibliography
by hsmyers (Canon) on Dec 02, 2004 at 16:05 UTC
    Not a solution, but if you plan on a lot of work involving bibliographies, you might want to read up on the 'bib' form used by LaTeX (read up on LaTeX while you're at it). I keep my citations in this form (there is even an editor that make initial entry trivial---WinEdt) and adding references to a document with that as a base is quite simple.

    --hsm

    "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: parsing a bibliography
by inkdroid (Initiate) on Dec 02, 2004 at 18:00 UTC
    While it looks like you've got a nice solution for this bibliography, you might want to take a look at ParaTools. ParaTools is a suite of Perl libraries used in the eprints self-archiving application that is in heavy use in libraries around the world. I'm not sure if ParaTools will help you, but it might be interesting to look at. It appears they've recently started adding to CPAN as well.
      BrowserUK, Thanks so much for the update with the code commented. I have to admit that I was completely lost in how it worked, but it did work with about a 80% success rate. Given the nature of bibliography's I am incredibly impressed. I really appreciate your help and education. This was a huge leap forward in my understanding of puttting regexes to work. Inkdroid Paracite works like a champ! Great utility that I will be using in the future.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://411598]
Approved by Arunbear
Front-paged by davidj
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-03-29 01:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found