t0mas has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I'm just done with a small learing project of mine, where I did an image archive browser (to browse my small personal image archive) to learn Apache::ASP (Yes, I know there are already lots of image archive browsers out there but I thought it was fun anyway...). Now, the next thing I want to learn is the Berkeley DB (using the BerlekeyDB module from CPAN) and I thougth that extending the image archive browser with a database of data about each image would be a nice learning project for me.

I want to be able to describe the image with words to categorize it, ie. artist, size, year, published in, depth, etc, would be nice. I would also like to describe the contents of the image with words like "flower", "wolves", "people" and so on. I don't think I'll have any problems creating that :)

However, the more I think about it, the more I want to be able do describe the content of the image in a way that the computer can understand so I can weed out duplicates. By this I don't mean exact duplicate, since I already weed them out with a script using a MD5 checksum, but rather the same image content in a diffrent size or maybe scanned by another person (diffrent colors). Sort of query by image content kind of thing.

I thought of creating some sort of representation for each image and store it in the database, then create a representation for each new image i get, compare this to the ones stored in the db, and have the program yell "Hey, t0mas you've already got that one! I'm 98% sure of that." at me if I already have a image that is 98% the same (content wise).

My problem is that I'm all lost in this area. I've searched for algorithms for this kind of stunt, but found none. I've found some papers describing the general concepts of this but no gory details. There are some software out there that can do object reconition on images, but this seems like an overkill for my little app (and I haven't found any with an open api) since I only want weed out duplicates.

I havn't done anything like this before so now I'm asking you all: Has anyone done anything like this in perl? Have any algorithms or snippets to share? Any URLs to open source projects (with an api) that does this?

Thanks for your time.

/brother t0mas

Replies are listed 'Best First'.
Re: Query by Image Content
by Corion (Patriarch) on Jan 15, 2001 at 17:07 UTC

    You're going into an interesting area, information management, which will be interesting for the next 10 years in my opinion. I do my admin stuff for a small business (plug: navicon.de (german, not really useful except for the demo-download of Cernato)), who work in this field.

    To achieve meaningful results, you need two things :

    1. A fine categorization to find the differences between images
    2. A thesaurus to make searching and cross-linking more efficient

    Number 1, the fine categorization, can be helped by having your program ask you for the differences between two images every time there is a detected identity of more than (say) 75%. Either you claim that they are identical (and thus remove the image of lower preference) or you introduce a new criterion (or specify an existing criterion) to distinguish them.

    Number 2, the thesaurus (a dictionary of synonyms, that is, of different words that mean the same thing), is used to create coarser granularities from the fine grained specifications. This helps to find, for example, images with a book on them, even though you only filed the image under "lexicon" (bad example, but I can't think of a good one right now.

    Of course, the coolest thing would be a file system based on this concept - taking the need of paths and the ilk away from the user, making it possible to concentrate on the file content rather than the semantics and syntactic administrivia of file management...

Re: Query by Image Content
by Braindead_One (Monk) on Jan 15, 2001 at 19:56 UTC
    There is a project called "GiST" which is about indexing various types of data in a Database.
    One part of this Project is a system called "Blobworld" which allows the user to search for similar images by selecting and rating regions of an existing Image. I don't know how "Perlish" this one is, but it could be worth a view.
    You can find the Project here.
    And here you can test Blobworld.

    Hope this helps you
    Braindead One
Re: Query by Image Content
by IndyZ (Friar) on Jan 15, 2001 at 23:38 UTC
    You have chosen a very difficult project for yourself. Your project (automatically recognizing similar images) is the type of problem that comp-sci graduate students write doctoral thesis about. It is unbelievably difficult to have computers recognize images.

    Your program would probably look for distinctive features of the images. Probably simple shapes. The line of the horizon in a shot of the desert. The circle of a noon sun. The rectangle of a television screen. Then, you would compare these shapes to a precompiled database of your other images.

    Unfortunately, there are still problems with this approach. Users might have scanned at a different magnification or resolution. Their scan might be cropped differently. Of course, the most daunting challenge is how to program a computer to recognize shapes in an image. I, personally, can't even begin to think of how to do this.

    --
    IndyZ