All,
The current project I am working on requires me to extract citations from PDF files. I have used a variety of methods to extract the text from the PDF such as CAM::PDF and xpdf. As you probably already know, this is far from perfect:

Fortunately for me, I don't require perfection. Unfortunately for me, it is rather difficult to see where one citation ends and the next begins. In the case of CAM::PDF - the bibliography seems to be crammed all on a single line. In the case of xpdf, which does a better job of layout, the citation can still span multiple lines (as it does in the PDF).

I would like to use Biblio::Citation::Parser::Standard. It has about 400 templates. The problem is I want to feed it a chunk of data and for it to extract the citations but it is designed to be passed a single citation and then identify its constituent parts. All I need to do is identify the publication (book, magazine, journal, etc) which Biblio::Citation::Parser::Standard can do if I can identify the citation boundaries.

I have implemented a very rudimentary approach. I am interested in what others may be able to come up with (ideas, proof of concept, full working solutions, better tools, etc). I am including a small sample below from xpdf. Ultimately, I would like a solution that works for both (CAM::PDF looks identical except no newlines). Any and all help will be greatly appreciated.

Biggs, S. F. & Mock, T. J., An Investigation of Auditor Decision Proce +sses in the Evaluation of Internal Controls and Audit Scope Decisions, Journal ofAccounting Research (Spr +ing 1983) pp. 234255. Ericsson, K_A. & Simon, H. A., Verbal Reports as Data, Psychological R +evieu' (May 1980), pp. 2 15-25 1. Feinstcin, A. R., An Analysis of Diagnostic Reasoning: the Constructio +n of Clinical Algorithms, YuleJournal of Biology andMedicine ( 1974), pp. 5-32. Kennedy, R. E. & Wilson, M. H.. The Corporate Information Investors Re +ally Want, Business (January- February 1980) pp. 42-46. Larcker. D. F. & Lcssig, V. P.. An Examination of the Linear and Retro +spective Process TracingApproaches to Judgment Modeling, TheAccounting Review (January 1983) pp. 58-77. Lcr. B., Financial StutementAnaIysis: a NewApproach (Inglewood Cliffs, + NJ: Prentice-Hall, 1978). Lindsay, R K, Buchanan, B. G., Feigenbaum, E. A. & Lederberg, J.AppIic +ations ofArtificialIntelIigencefor Organic Chemis~q~: the DENDRAL Project (New York: McGraw-Hill. 1980). Miller, R. A., Pople, H. E.. and Meyers, J. I).. INTERNIST-I, an Exper +imental Computer-Based Diagnostic Consultant for General Internal Medicine, :Veul England Journal of Med +icine (August 1982), pp. 46% 4'6 Newell. A. 8r Simon, 11.A, Human Problem Solzv'ng (Englcwood Cliffs, N +J, Prentice-Hall. 1972). Payne, J. W., BKUUBtein, hl. 1.. bi Carroll, J. S., Exploring Predecis +ional Behavior: an Alternative Approach to Decision Kc-search, Orgunizutional Behuzjior and Human Performance (Fe +bruary 1978) pp. 17-44. Porcano. `I`.M., A Comparison of Information Needs and Sources of the +Investment Community, Akron Business andEconomic Kezjieuv(Fall 1981) pp. 43-52. Reilly. F K, Im%Thnents (New York: Dryden Press, 19&Z). Ricchiutc, D. N.. An Empirical Assessment of the Impact of AlterIXItiv +e Task Presentation Modes on Decision-Making Research in Auditing, Journal of Accounting Reseurcb ( +Spring 19&i), pp. 34 I-350. Rich, E., Art~ficiul Intelligence (New York: McGraw-llill, 1983). Shields. X1.D Some Effects of Information Load on Search Patterns Used + to Analyze Performance Reports, Accounling, Organizahons and Socie[y ( 1980) pp. 429-442. HOW DO FINANCIAL ANALYSTS MAKE DECISIONS? 29 Shields, M. D., Effects of Information Supply and Demand on Judgment A +ccuracy: Evidence from Corporate Managers, TheAccountingReview(April 1983) pp. 284-303. Siegel, J. G. & Simon, A., What Financial Statements Really Tell, Cred +it G Financial Management (September 1981) pp. 22,26-29.

Cheers - L~R


In reply to Extracting Bibliography Citations by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.