Libpuzzle Indexing millions of pictures? Libpuzzle Indexing millions of pictures? php php

Libpuzzle Indexing millions of pictures?


So, let's take a look at the example they give and try to expand.

Let's assume you have a table that stores information relating to each image (path, name, description, etc). In that table, you'll include a field for the compressed signature, calculated and stored when you initially populate the database. Let's define that table thus:

CREATE TABLE images (    image_id INTEGER NOT NULL PRIMARY KEY,    name TEXT,    description TEXT,    file_path TEXT NOT NULL,    url_path TEXT NOT NULL,    signature TEXT NOT NULL);

When you initially compute the signature, you're also going to compute a number of words from the signature:

// this will be run once for each image:$cvec = puzzle_fill_cvec_from_file('img1.jpg');$words = array();$wordlen = 10; // this is $k from the example$wordcnt = 100; // this is $n from the examplefor ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {    $words[] = substr($cvec, $i, $wordlen);}

Now you can put those words into a table, defined thus:

CREATE TABLE img_sig_words (    image_id INTEGER NOT NULL,    sig_word TEXT NOT NULL,    FOREIGN KEY (image_id) REFERENCES images (image_id),    INDEX (image_id, sig_word));

Now you insert into that table, prepending the position index of where the word was found, so that you know when a word matches that it matched in the same place in the signature:

// the signature, along with all other data, has already been inserted into the images// table, and $image_id has been populated with the resulting primary keyforeach ($words as $index => $word) {    $sig_word = $index.'__'.$word;    $dbobj->query("INSERT INTO img_sig_words (image_id, sig_word) VALUES ($image_id,        '$sig_word')"); // figure a suitably defined db abstraction layer...}

Your data thus initialized, you can grab images with matching words relatively easily:

// $image_id is set to the base image that you are trying to find matches to$dbobj->query("SELECT i.*, COUNT(isw.sig_word) as strength FROM images i JOIN img_sig_words    isw ON i.image_id = isw.image_id JOIN img_sig_words isw_search ON isw.sig_word =    isw_search.sig_word AND isw.image_id != isw_search.image_id WHERE    isw_search.image_id = $image_id GROUP BY i.image_id, i.name, i.description,    i.file_path, i.url_path, i.signature ORDER BY strength DESC");

You could improve the query by adding a HAVING clause that requires a minimum strength, thus further reducing your matching set.

I make no guarantees that this is the most efficient setup, but it should be roughly functional to accomplish what you're looking for.

Basically, splitting and storing the words in this manner allows you to do a rough distance check without having to run a specialized function on the signatures.


I've experimented with libpuzzle before - got about as far as you. Didnt really start on a proper implementation. Was also unclear how exactly to do it. (and abandoned the project for lack of time - so didnt really persist with it)

Anyway, looking now, will try to offer my understanding - maybe between us we can work it out :)

Queries use a 2 stage process -

  1. first you use the words table.
    1. take the 'reference' image and work out its signature.
    2. work out its component words,
    3. consult the words table to find all the possible matches. This can use the database engines 'indexes' for efficient queries.
    4. compile a list of all sig_ids. (will get some duplicates in 3. )
  2. Then consult the signatures table
    1. retreive and decompress all possible from signatures (because you have a prefiltered list the number should be relatively small)
    2. use puzzle_vector_normalized_distance to work out an actual distance.
    3. sort and rank the results as required

(ie you only use compression on the signatures table. words remains uncompressed, so can run fast queries on it)

The words table is a form of inverted index. In fact I have in mind to use https://stackoverflow.com/questions/tagged/sphinx instead the words database table, because that is designed specifically as a very fast inverted index.

... in theory anyway...


I am also working on libpuzzle in php and am having some doubts about how to generate the words from the image signatures.Jasons answer above seems right but I have a problem with this part:

// this will be run once for each image:$cvec = puzzle_fill_cvec_from_file('img1.jpg');$words = array();$wordlen = 10; // this is $k from the example$wordcnt = 100; // this is $n from the examplefor ($i=0; $i<min($wordcnt, strlen($cvec)-$wordlen+1); $i++) {    $words[] = substr($cvec, $i, $wordlen);}