Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
The current project I am working on requires me to extract citations from PDF files. I have used a variety of methods to extract the text from the PDF such as CAM::PDF and xpdf. As you probably already know, this is far from perfect:

Fortunately for me, I don't require perfection. Unfortunately for me, it is rather difficult to see where one citation ends and the next begins. In the case of CAM::PDF - the bibliography seems to be crammed all on a single line. In the case of xpdf, which does a better job of layout, the citation can still span multiple lines (as it does in the PDF).

I would like to use Biblio::Citation::Parser::Standard. It has about 400 templates. The problem is I want to feed it a chunk of data and for it to extract the citations but it is designed to be passed a single citation and then identify its constituent parts. All I need to do is identify the publication (book, magazine, journal, etc) which Biblio::Citation::Parser::Standard can do if I can identify the citation boundaries.

I have implemented a very rudimentary approach. I am interested in what others may be able to come up with (ideas, proof of concept, full working solutions, better tools, etc). I am including a small sample below from xpdf. Ultimately, I would like a solution that works for both (CAM::PDF looks identical except no newlines). Any and all help will be greatly appreciated.

Biggs, S. F. & Mock, T. J., An Investigation of Auditor Decision Proce +sses in the Evaluation of Internal Controls and Audit Scope Decisions, Journal ofAccounting Research (Spr +ing 1983) pp. 234255. Ericsson, K_A. & Simon, H. A., Verbal Reports as Data, Psychological R +evieu' (May 1980), pp. 2 15-25 1. Feinstcin, A. R., An Analysis of Diagnostic Reasoning: the Constructio +n of Clinical Algorithms, YuleJournal of Biology andMedicine ( 1974), pp. 5-32. Kennedy, R. E. & Wilson, M. H.. The Corporate Information Investors Re +ally Want, Business (January- February 1980) pp. 42-46. Larcker. D. F. & Lcssig, V. P.. An Examination of the Linear and Retro +spective Process TracingApproaches to Judgment Modeling, TheAccounting Review (January 1983) pp. 58-77. Lcr. B., Financial StutementAnaIysis: a NewApproach (Inglewood Cliffs, + NJ: Prentice-Hall, 1978). Lindsay, R K, Buchanan, B. G., Feigenbaum, E. A. & Lederberg, J.AppIic +ations ofArtificialIntelIigencefor Organic Chemis~q~: the DENDRAL Project (New York: McGraw-Hill. 1980). Miller, R. A., Pople, H. E.. and Meyers, J. I).. INTERNIST-I, an Exper +imental Computer-Based Diagnostic Consultant for General Internal Medicine, :Veul England Journal of Med +icine (August 1982), pp. 46% 4'6 Newell. A. 8r Simon, 11.A, Human Problem Solzv'ng (Englcwood Cliffs, N +J, Prentice-Hall. 1972). Payne, J. W., BKUUBtein, hl. 1.. bi Carroll, J. S., Exploring Predecis +ional Behavior: an Alternative Approach to Decision Kc-search, Orgunizutional Behuzjior and Human Performance (Fe +bruary 1978) pp. 17-44. Porcano. `I`.M., A Comparison of Information Needs and Sources of the +Investment Community, Akron Business andEconomic Kezjieuv(Fall 1981) pp. 43-52. Reilly. F K, Im%Thnents (New York: Dryden Press, 19&Z). Ricchiutc, D. N.. An Empirical Assessment of the Impact of AlterIXItiv +e Task Presentation Modes on Decision-Making Research in Auditing, Journal of Accounting Reseurcb ( +Spring 19&i), pp. 34 I-350. Rich, E., Art~ficiul Intelligence (New York: McGraw-llill, 1983). Shields. X1.D Some Effects of Information Load on Search Patterns Used + to Analyze Performance Reports, Accounling, Organizahons and Socie[y ( 1980) pp. 429-442. HOW DO FINANCIAL ANALYSTS MAKE DECISIONS? 29 Shields, M. D., Effects of Information Supply and Demand on Judgment A +ccuracy: Evidence from Corporate Managers, TheAccountingReview(April 1983) pp. 284-303. Siegel, J. G. & Simon, A., What Financial Statements Really Tell, Cred +it G Financial Management (September 1981) pp. 22,26-29.

Cheers - L~R

Replies are listed 'Best First'.
Re: Extracting Bibliography Citations
by shmem (Chancellor) on Sep 02, 2008 at 15:12 UTC

    It seems all the citations (update: begin with a capital letter and) end with "pp. something\n", so for starters

    perl -p0777 -e 'print "<<$1>>\n" while /(.*?(?:pp\..*?\n))/gs' 708496. +pl

    erm. update:

    perl -n0777 -e 'print "<<$1>>\n" while /(.*?(?:pp\..*?\n))(?=[A-Z])/gs +' 708496.pl
      shmem,
      This isn't far off from my approach. I have some additional error handling though. Consider that the newlines are not guaranteed to be there (remember CAM::PDF omits them) and also that pp may get mistranslated. Since it doesn't have to be perfect this is fine but I am looking to improve if possible.

      Cheers - L~R

        This isn't far off from my approach.

        Oh, no. That isn't fair. Saying "mine is better, but I won't show you" just isn't fair :-(

        And keep in mind that only you have the full variety of input at hand, so this is my last suggestion:

        { local $/; $_= <DATA> } print "<<$1>>\n" while /(.*?(?:[\dZ]\)\.|pp\..*?))\r?\n?(?=[A-z]{2})/g +s; __DATA__ ...(data as per OP)... __END__ <<Biggs, S. F. & Mock, T. J., An Investigation of Auditor Decision Pro +cesses in the Evaluation of Internal Controls and Audit Scope Decisions, Journal ofAccounting Research (Spr +ing 1983) pp. 234255.>> <<Ericsson, K_A. & Simon, H. A., Verbal Reports as Data, Psychological + Revieu' (May 1980), pp. 2 15-25 1.>> <<Feinstcin, A. R., An Analysis of Diagnostic Reasoning: the Construct +ion of Clinical Algorithms, YuleJournal of Biology andMedicine ( 1974), pp. 5-32.>> <<Kennedy, R. E. & Wilson, M. H.. The Corporate Information Investors +Really Want, Business (January- February 1980) pp. 42-46.>> <<Larcker. D. F. & Lcssig, V. P.. An Examination of the Linear and Ret +rospective Process TracingApproaches to Judgment Modeling, TheAccounting Review (January 1983) pp. 58-77.>> <<Lcr. B., Financial StutementAnaIysis: a NewApproach (Inglewood Cliff +s, NJ: Prentice-Hall, 1978).>> <<Lindsay, R K, Buchanan, B. G., Feigenbaum, E. A. & Lederberg, J.AppI +ications ofArtificialIntelIigencefor Organic Chemis~q~: the DENDRAL Project (New York: McGraw-Hill. 1980).> +> <<Miller, R. A., Pople, H. E.. and Meyers, J. I).. INTERNIST-I, an Exp +erimental Computer-Based Diagnostic Consultant for General Internal Medicine, :Veul England Journal of Med +icine (August 1982), pp. 46% 4'6>> <<Newell. A. 8r Simon, 11.A, Human Problem Solzv'ng (Englcwood Cliffs, + NJ, Prentice-Hall. 1972).>> <<Payne, J. W., BKUUBtein, hl. 1.. bi Carroll, J. S., Exploring Predec +isional Behavior: an Alternative Approach to Decision Kc-search, Orgunizutional Behuzjior and Human Performance (Fe +bruary 1978) pp. 17-44.>> <<Porcano. `I`.M., A Comparison of Information Needs and Sources of th +e Investment Community, Akron Business andEconomic Kezjieuv(Fall 1981) pp. 43-52.>> <<Reilly. F K, Im%Thnents (New York: Dryden Press, 19&Z).>> <<Ricchiutc, D. N.. An Empirical Assessment of the Impact of AlterIXIt +ive Task Presentation Modes on Decision-Making Research in Auditing, Journal of Accounting Reseurcb ( +Spring 19&i), pp. 34 I-350.>> <<Rich, E., Art~ficiul Intelligence (New York: McGraw-llill, 1983).>> <<Shields. X1.D Some Effects of Information Load on Search Patterns Us +ed to Analyze Performance Reports, Accounling, Organizahons and Socie[y ( 1980) pp. 429-442.>> <<HOW DO FINANCIAL ANALYSTS MAKE DECISIONS? 29 Shields, M. D., Effects of Information Supply and Demand on Judgment A +ccuracy: Evidence from Corporate Managers, TheAccountingReview(April 1983) pp. 284-303.>>
        Finding the end of the line is going to be hard if you are also going to include corrupted of characters. The version with too many line ends works well but not sure about a munged up mess.
Re: Extracting Bibliography Citations
by UnderMine (Friar) on Sep 02, 2008 at 15:15 UTC
    A quick and dirty fudge would be :-
    my $data=''; while (<>) { my $line=$_; $line=~s/[\n\r\l]//g; if (length($line)<10) { $data.=$line; } else { if ($data=~m/(?:\s[12]\d{3}\s*\)|\spp.\s+)/) { print "$data\n\n"; $data=$line; } else { $data.=$line; } } } print "$data\n\n";
    Looks like that catches the other cases
    UnderMine

    update: added last print to catch final entry
      UnderMine,
      Interesting. I had abandoned a similar approach because it is possible to get runaway lines (where the year gets mistranslated or the trailing paren is omitted). I may use a variation on this approach where I use year) pp as stop point one, if I feel like I have gone to far back up to pp only, and if that fails, back up to year, and if that fails, provide a max length on a citation.

      Cheers - L~R

        Sounds like a double parse is your best bet then. i.e. use something like the above and then trap for exceptional lines and reparse them to split them again.

        Corruption is giving you the real headache. Why is the data so corrupt? It sounds like you are using PDFs generated from an OCRs. Have you looked at Tesseract. I have used that when I have needed to train a system to handle OCRing and improving the quality of your source data is always an option.

        UnderMine
Re: Extracting Bibliography Citations
by ikegami (Patriarch) on Sep 03, 2008 at 01:38 UTC

    You asked if one could leverage the templates from Biblio-Citation-Parser to define a grammar.

    Biblio::Citation::Parser::Standard matches the entire input against the each template, and picks the most "reliable" and "concrete" of the matches. (The templates are weighted.) I don't know what effect it would have if you didn't require the entire input to be matched (i.e. changed $ref =~ /^$currtemplate$/ to $ref =~ /^$currtemplate/).

    The problem is that longer is considered better.

    use Biblio::Citation::Parser::Standard qw( ); BEGIN { *get_reliability = \&Biblio::Citation::Parser::Standard::get_r +eliability; } my $r1 = get_reliability('_AUTHORS_ (_YEAR_) _TITLE_. _PUBLICATION_'); my $r2 = get_reliability('_AUTHORS_ (_YEAR_) _TITLE_' ); print(($r1 > $r2 ? 'ok' : 'not ok'), " (r1: $r1, r2: $r2)\n");
    ok (r1: 2.7, r2: 2.05)

    So if the text matched by _PUBLICATION_ was suppose to be the start of the next citation, we got a problem.

    Another likely problem is the use of _ANY_. Those that end with _ANY_ are a definite problem.

Re: Extracting Bibliography Citations
by Illuminatus (Curate) on Sep 02, 2008 at 15:41 UTC
    Interesting problem. I looked at the source for the parser, and have a couple of (completely untested) suggestions:
    1) Pass the whole citation block as a single line to the parser. Maybe in most instances it either ignores stuff after the end of the best match, or it lumps the last part of the string into some tag. In the former case, you could match the various returned tags against the current line, to find where the parser stopped, and start again from there. In the latter case, you could find the tag with the lumped data and work with that.
    2) Create your own small parser to find citation ends. Don't almost all of them either end in a page reference or a publication reference, and also ultimately with a period?
    In the event that (1) isn't feasible, and performance is not an issue (probably not, given the way the parser works), Your (2) could probably be written to catch 80+% of the situations. When it punts, go past where it punted a few tokens, then iterate parse calls subtracting one token at a time and keep track of reliability. It will most likely max out at the correct place. You can then continue again from that point in the string.
    Hope you hadn't already thought of all this, and that it was helpful.