kiz,
You are not always comparing two strings. You are mentally tokenizing these strings. Further, you are giving context to them - things in parens at the end do not seem to be important to you. In other words, you have to build a solution that applies the same mental process as yourself to determine if the reference sources are the same. I have built a solution that does this in the past but since I was hired to do it, I can't share the code with you.

There are plenty of tools to build your own but you have to figure out how to glue them together. This is not that simple, so neither can your approach. I used a layered approach. Let me give you some things to consider.

Now consider all the tools in your tool bag and how they may be useful. Here are some examples:

I can see you have already searched CPAN and know about things like Text::Compare and Text::PhraseDistance but these seem to be publication references. There are a number of modules on CPAN for citations and bibliography references - you may be able to leverage them as well. It would also be helpful to know more about the overall project because there are some other tools that may be helpful. For instance, do you have a known list of publications and have a list that needs to be identified or do you have one huge bunch and are trying to identify duplicates? The approach I would take is different in both case.

I have a stack full of notes on the topic of text comparison and analysis I have been meaning to write about at length. If you need more help, speak up.

Cheers - L~R


In reply to Re: Fuzzy text matching... again by Limbic~Region
in thread Fuzzy text matching... again by kiz

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.