High-Performance Hierarchical Text Search High-Performance Hierarchical Text Search database database

High-Performance Hierarchical Text Search


take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.

Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.

It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.

Please post back what you find, if possible :-)


Hi Aarron, I have the following idea:
From your description I have the following image in my mind:

          Aaron         /     \        /       \       /         \  House           Cars    |            /    \Living Room   Ferrari  Mercedes    |Liquor Cabinet    /    |    \Table   Door   Window

This is how your search tree might look like. Now I would sort the nodes on every level:

               Aaron              /     \             /       \            /         \         Cars         House         /   \       /        /     \     /       /       \   /      /         \ /     /           X    /           / \   /           /   \  /           /     \ /           /       \|           /         \|          /           \ Ferrari   Living Room   Mercedes                        |                  Liquor Cabinet                   /    |    \               Door   Table   Window

Now it should be easy and fast to process a query:

  1. Start with the last word in the query and the lowest node level(leafs)
  2. Since all the nodes are sorted within one level, You can use binary search and therefore find a match in O(log N) time, where N is the node count.
  3. Do this for every level. There are O(log N) levels in the tree.
  4. Once You find a match, process all parent nodes to see, if the path matches your query. The path has length O(log N). If it matches, store it in the results, that should be shown to the user.

Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).

When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.

Implementation:You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:

  • ID : int
  • Text : string
  • Parent : int -> node ID
  • Level : int //I don't expect this to change too often, so You can save it and update it, as the database changes.

This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.

One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.

Edit: Improvement

I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:

  1. Store all nodes sorted by their text value
  2. Find all matches for the last word of the query at once using binary search over all nodes.
  3. From every match, trace the path up to the root and check if the path matches the whole query.

This improves the runtime to O(log N + M * (log N)).