in reply to Extracting Bibliography Citations

It seems all the citations (update: begin with a capital letter and) end with "pp. something\n", so for starters

perl -p0777 -e 'print "<<$1>>\n" while /(.*?(?:pp\..*?\n))/gs' 708496. +pl

erm. update:

perl -n0777 -e 'print "<<$1>>\n" while /(.*?(?:pp\..*?\n))(?=[A-Z])/gs +' 708496.pl

Replies are listed 'Best First'.
Re^2: Extracting Bibliography Citations
by Limbic~Region (Chancellor) on Sep 02, 2008 at 15:20 UTC
    shmem,
    This isn't far off from my approach. I have some additional error handling though. Consider that the newlines are not guaranteed to be there (remember CAM::PDF omits them) and also that pp may get mistranslated. Since it doesn't have to be perfect this is fine but I am looking to improve if possible.

    Cheers - L~R

      This isn't far off from my approach.

      Oh, no. That isn't fair. Saying "mine is better, but I won't show you" just isn't fair :-(

      And keep in mind that only you have the full variety of input at hand, so this is my last suggestion:

      { local $/; $_= <DATA> } print "<<$1>>\n" while /(.*?(?:[\dZ]\)\.|pp\..*?))\r?\n?(?=[A-z]{2})/g +s; __DATA__ ...(data as per OP)... __END__ <<Biggs, S. F. & Mock, T. J., An Investigation of Auditor Decision Pro +cesses in the Evaluation of Internal Controls and Audit Scope Decisions, Journal ofAccounting Research (Spr +ing 1983) pp. 234255.>> <<Ericsson, K_A. & Simon, H. A., Verbal Reports as Data, Psychological + Revieu' (May 1980), pp. 2 15-25 1.>> <<Feinstcin, A. R., An Analysis of Diagnostic Reasoning: the Construct +ion of Clinical Algorithms, YuleJournal of Biology andMedicine ( 1974), pp. 5-32.>> <<Kennedy, R. E. & Wilson, M. H.. The Corporate Information Investors +Really Want, Business (January- February 1980) pp. 42-46.>> <<Larcker. D. F. & Lcssig, V. P.. An Examination of the Linear and Ret +rospective Process TracingApproaches to Judgment Modeling, TheAccounting Review (January 1983) pp. 58-77.>> <<Lcr. B., Financial StutementAnaIysis: a NewApproach (Inglewood Cliff +s, NJ: Prentice-Hall, 1978).>> <<Lindsay, R K, Buchanan, B. G., Feigenbaum, E. A. & Lederberg, J.AppI +ications ofArtificialIntelIigencefor Organic Chemis~q~: the DENDRAL Project (New York: McGraw-Hill. 1980).> +> <<Miller, R. A., Pople, H. E.. and Meyers, J. I).. INTERNIST-I, an Exp +erimental Computer-Based Diagnostic Consultant for General Internal Medicine, :Veul England Journal of Med +icine (August 1982), pp. 46% 4'6>> <<Newell. A. 8r Simon, 11.A, Human Problem Solzv'ng (Englcwood Cliffs, + NJ, Prentice-Hall. 1972).>> <<Payne, J. W., BKUUBtein, hl. 1.. bi Carroll, J. S., Exploring Predec +isional Behavior: an Alternative Approach to Decision Kc-search, Orgunizutional Behuzjior and Human Performance (Fe +bruary 1978) pp. 17-44.>> <<Porcano. `I`.M., A Comparison of Information Needs and Sources of th +e Investment Community, Akron Business andEconomic Kezjieuv(Fall 1981) pp. 43-52.>> <<Reilly. F K, Im%Thnents (New York: Dryden Press, 19&Z).>> <<Ricchiutc, D. N.. An Empirical Assessment of the Impact of AlterIXIt +ive Task Presentation Modes on Decision-Making Research in Auditing, Journal of Accounting Reseurcb ( +Spring 19&i), pp. 34 I-350.>> <<Rich, E., Art~ficiul Intelligence (New York: McGraw-llill, 1983).>> <<Shields. X1.D Some Effects of Information Load on Search Patterns Us +ed to Analyze Performance Reports, Accounling, Organizahons and Socie[y ( 1980) pp. 429-442.>> <<HOW DO FINANCIAL ANALYSTS MAKE DECISIONS? 29 Shields, M. D., Effects of Information Supply and Demand on Judgment A +ccuracy: Evidence from Corporate Managers, TheAccountingReview(April 1983) pp. 284-303.>>
        shmem,
        Oh, no. That isn't fair. Saying "mine is better, but I won't show you" just isn't fair :-(

        Hrmm. I certainly wasn't complaining nor was I trying to tell you it wasn't good enough. I purposely avoided sharing what I had come up with as to not influence solutions. The intent of my comment was more along the lines of "The basic strategy is sound with minor tweaking". When I said "but I am looking to improve if possible", I apologize if that came across as "try harder shmem but do so blind". I wanted to indicate that a different strategy all together might be better since minor refinements on the existing one are going to lead to diminishing returns.

        Thank you for your contributions. They are valued and appreciated.

        Cheers - L~R

      Finding the end of the line is going to be hard if you are also going to include corrupted of characters. The version with too many line ends works well but not sure about a munged up mess.