If you’ve never worked with a search engine like Apache Solr or Elasticsearch (or the lower level Lucene library), some of the concepts and conventions can feel a bit foreign to the uninitiated data scientist. In this article, I’ll be giving you a deeper understanding of how search engines work and how they relate to many of the concepts you already know as a data scientist. For a better idea of why you should even care about bringing search into your data practice, be sure to check out my previous article “7 Benefits of Using Search Engine Tools for Data Analysis” beforehand.

Modern search engines are like chameleons, but instead of changing colors to fit their background, search engines change in response to the different workloads asked of them. This chameleon-like quality is a direct result of a concerted effort by search engineers to make it easier and easier to mash together a lot of disparate features (text, categorical, numerical, spatial) to produce a ranked list of results, as well as really fast ways of counting common attributes in said result lists (often called faceting or, more broadly, aggregations). In adding support for all of these data types and ways of counting objects, a cool thing came about as a byproduct: people realized that since they already have the data in the engine for search, they might as well start doing more general computations on the data where it is instead of moving it out to some other processing layers, Turns out, search engine tools can do this pretty quickly. This has led to a massive shift in the way people conduct search, business intelligence, and IT monitoring/analytics, and has spawned a whole new industry of companies who sell search-driven analytics (like yours truly).

What follows, then, are some of the different ways I think about search engine technology. Not all of them are a perfect view of what search is these days, but they all illuminate the ways the chameleon changes. I guarantee at least one of the views, if not multiple, will resonate with some part of your data practice.

A Vector Space Engine

Most search engines take in content one document at a time (if you come from databases, a document is the equivalent of a row), do some transformations on it, and turn it into a n-dimensional vector of weights which are then stored in a data store (usually called an index) designed to do really fast, partial/fuzzy lookups and counting. Those returned vectors can be sorted in traditional database ways, such as by date. However, they can also be scored according to a variety of criteria including a document model, index statistics, and a variety of runtime factors.

search_1Figure 1. An example of a mapping and scoring a query and a document in a vector space.

For instance, in building a restaurant search engine (e.g. Yelp), each document in the index might contain fields (columns for the database folks) for title, description, food type, hours open, ratings, reviews, location, average price, and so on. At query time, the engine may choose to match on some or all of the criteria in the document and factor in different attributes using different weights. It might also factor in user information like prior behavior or intent. For instance, one might heavily match the fact that the user’s keywords are in the title, but then also factor in how far the user is from the location and the fact that you know the user prefers spicy food in sorting the final results. In practical terms, this matching process is a highly optimized cosine similarity calculation between two vectors—the query vector and the document vector—in a vector space, as illustrated in Figure 1. Note, I use the word query here in a very loose sense. This query is just an expression of what you want to match on and isn’t tied to a notion of text. Much of leveraging search comes down to understanding how to create these vectors to get the results you want. Chances are you’ve come across similar concepts if you’ve ever done something like k-means clustering or k-nearest neighbors or even collaborative filtering style recommenders.

Doing this query/document calculation across all documents (or at least those that match some preliminary matching criteria) yields the notion that a search engine based on a vector space representation is really just a fast way of multiplying a vector (the query) times a dynamically defined, usually sparse, matrix. I call it dynamically defined because most engines leverage some form of boolean logic to winnow down the size of the matrix, but thanks to the massive speedups in compute, it is entirely possible to do many of these calculations in a reasonable time across all rows in the matrix.

An Inverted Index

The main data structure most search engines build is usually called an inverted index. In practice, search engines use a number of different techniques for tracking where things are in data, but the inverted index is often the primary data structure. At its simplest form, an inverted index is akin to the index in the back of a book: a simple listing of all the terms (usually called tokens) seen to date and where they are located. Query document matching then happens much like how you as a human use an index in a book to look up key concepts in the book. The big difference is a search engine can efficiently keep track of millions of unique terms and query and score billions of documents in milliseconds.

search_2

Figure 2. The first page of the index of Novus Atlas Sinensis by Martino Martini(published as a section of Volume 10 of Joan Blaeu's Atlas Maior in 1655).

A search index can also keep a lot more metadata about any given term beyond where it occurred, but also where on the page it occurred, its relative importance to other terms, and its part of speech or other arbitrary attributes. Most engines can also easily index terms from all languages or, for that matter, really any arbitrary string of characters, numbers, or symbols. Finally, search engines provide a myriad of ways to query the index that combine complex operations like boolean logic, phrase matching, dates, spatial, wildcards, and regular expressions.

Other Models

While the Vector Space Model and the inverted index are the most common models for search, it’s worthwhile to note there are a few others that may be useful. In the interest of space, I’ll highlight them here and leave the details as an exercise to the reader.

Key-Value Store: Most modern search engines have the notion of a single unique key that represents the ID or primary key of a document (although it is possible to forego them in some engines). This ID must be unique and therefore is often a field of very high cardinality. Thus, most engines actually have built in data structures optimized for very fast lookup of those keys that often rival the performance of purpose built key-value stores while enabling search over the values. In my search travels, I have encountered a number of times where the search engine was used primarily as a key-value store and easily supported 100,000+ key lookups per second.

Columnar Store: Lucene-based engines all support representing data in columnar format in order to speed up large scale aggregations of a single field/column.

Tools for Text Analysis: Since search engines have been built to deal with noisy, unstructured content (read why I hate that phrase), they usually come feature-packed with all kinds of tools for turning that content into tokens. Whether it’s dealing with the languages of the world, stemming plurals into a base form or keeping up with the latest UNICODE standard, chances are a search engine has a library for processing that data and you’d be remiss to not leverage it, lest your results suffer because you think everything is whitespace delimited.

A Rank-First NoSQL Engine, aka a Top X NoSQL Engine: By now, you know one of the key things a search engine does is rank things by how close they match a query and not just for text. And since most search engines main query dialect is not SQL, search engines are thus a NoSQL engine. To that end, most modern search engines are perfectly capable of being the authoritative store for your data, can scale to trillions of records and not lose data. In fact, I have long contended that search was NoSQL before NoSQL was cool. Ironically, in the past few years, many search engines have added SQL dialects, so it’s really your choice!

If you’ve made it this far, you now know there is so much more to a search engine these days than just fast text matching. I encourage you to take on the search engine challenge for your next data project: try to map all the different ways you analyze data onto the search engine of your choosing and see just how many answers you can get without having to write any code!

Grant Ingersoll
Grant Ingersoll

Grant Ingersoll is the CTO and co-founder of Lucidworks as well as an active member of the Apache Lucene community – a Lucene and Solr committer, and co-founder of the Apache Mahout machine learning project. He is also the lead author of “Taming Text” from Manning Publications. Grant’s experience includes engineering a variety of search, question answering and natural language processing applications for a variety of domains and languages. Grant earned his bachelor degree from Amherst College in Math and Computer Science and his master degree in Computer Science from Syracuse University.