Mention the word “search” to most laypeople and it conjures images of Google and Bing. Mention it to most data scientists and it usually conjures notions of keywords and text retrieval, and maybe a passing reference to open source projects like Elasticsearch, Apache Solr, or—if they are particularly well-versed—Apache Lucene. However, many of the data scientists I work with don’t usually understand the full breadth and depth of capabilities a modern search engine can help them on not just text problems, but most every data problem they encounter.

Most data scientists are quite comfortable discussing things like k-nearest neighbors, collaborative filtering, and k-means clustering—as well as the basics of linear algebra and statistics—but don’t seem to realize that the same techniques and tools that power their everyday work are the same ones that power search. So they’re often left either reinventing the wheel (most often poorly) or working less efficiently than they should be. By reframing many of their approaches to data exploration and analysis, data scientists can often see a significant increase in their productivity and their ability to analyze more types of data at scale and with less overhead.

I know, it sounds too good to be true, but I’ve seen it happen in my own career as well as with countless other data scientists and engineers. In this article (and future articles), I will walk you through how search technology can transform your data analysis—and also show you the places where it can fall short.

Benefits of Search Engine Tools

At a high level, search engine tools can provide a number of features that make everything from the mundane to the magnificent easier. Here are 7 benefits of using search engine tools for data science tasks, in no particular order:

1. Data exploration is a snap. 

Your boss just sent you a new data set you’ve never seen before? Load it into a search engine and wow them with your first cut analysis in minutes with no code required. Modern search engines are super forgiving about things like file formats and schemas due to their years of practice dealing with the noisiest of all content types: text. Most search engines routinely can handle file formats like JSON, CSV, XML, PDF, Office docs, Images/Audio, source code, GIS/Spatial, CAD, and custom types (Parquet/Avro/Protobuffs). Ingestion is often incredibly fast and flexible. Data can be sparse or dense. One document can have thousands of fields on it, whereas the next can have two. Once ingested, search engines have a vast and flexible query language that supports ad hoc querying and the ability to stream out large result sets.

2. Training/test/validation data splits are easy to generate. 

Several of my company’s customers have data science teams who are using search simply as a cheaper, faster, more flexible way to store data sets to be consumed by data hungry deep learning systems. Built into most engines are support for complex joins across multiple data sets as well as easy selection of specific rows and columns (often called documents and fields). Think about what it would mean to your organization to have all your data, experiments and associated logs from said experiments in one place with Google-like access to them.

3. Data reduction and feature selection/engineering.  

Modern search engines come with a plethora of tools for mapping a variety of content (text, numeric, categorical, spatial, custom) into a vector space (more on that in part two) and have a large set of tools for constructing weights, capturing metadata, imputing values, handling nulls, and generally shaping data to your will. They also all have a broad swath of support for natural language (and not just English), including tokenization, stemming/lemmatization, and sentence detection. Many newer engines are also capable of handling word embeddings and synonyms.

4. Search-driven analytics.

Up until now, you’ve probably been thinking, “These are all great for the drudgery work, but where’s the data science?” Don’t worry, it’s in there. Just like Hadoop moved the computation to the data, so have search engines. Most modern engines can not only perform really fast scoring of objects as part of building a retrieval set, but they can then also easily analyze those results, whether it’s simply finding all the customers in New Jersey who complained about bad service and aggregating on revenue lost or doing on-the-fly regression analysis, search engines are designed to not just find data, but to make sense of it. Most modern engines can also incorporate prediction tools, advanced content scoring functions, do anomaly detection, and trend analysis.

5. Your tools are welcome. 

Yep, whether you like the command line or a notebook, Python or R or even Scala, chances are your tools work with a your search engine (especially if it is open source). Want CSV instead of JSON out the engine? Engines can output in most of the common formats data scientists are used to working in.

6. Quick and dirty FTW. 

Shh, don’t tell anyone, but doing things like k-nearest Neighbors or quickly creating a cheap little classifier or even a full fledged dynamic content and collaborative filtering-based recommendation engine are practically trivial in a search engine (don’t worry, I’ll explain in my next article). Are they state of the art in a research sense? Probably not (although I’ve yet to see very many beat the recommendation engine approach). Are they 100X easier to setup and test and get “good enough” results at scale so that the many engines that you use every day are actually just search under the hood? As my Mom always says: You betcha!

7. Key-value, columnar, and mixed storage.

Modern Lucene powered engines (Solr and Elasticsearch) both operate quite efficiently on key-value and columnar data storage and beat all comers when it comes to mixed content types (i.e. text + numeric + categorical + spatial). Over the years, the open source communities have added specific data structures in place to handle these types of data workload requirements.

Room to Grow

Where is a modern search engine not ready for data science? There are a few areas in which these engines are still evolving.

Graph analysis: While Solr and others do have some graph operations, graphs are still not a first-class citizen in most search engines.

Iterative compute tasks: While some parts of search engines are based on multi-pass algorithms and some engines have supported Map/Reduce style operations, most engines cannot yet do general purpose, iterative computation tasks like what one would do in an engine like Spark that are often crucial for building more complex, large-scale models. However, most modern engines can easily plugin to Spark for iterative compute.

Tensors and Dense Vector operations: While the inputs and outputs to deep learning systems can coexist nicely with traditional search engines, most engines do not—as of this writing—have first-class support for the artifacts most often produced by deep learning systems. I am aware of a few initiatives working to unlock this great new potential.

Images and audio: While most engines can store images and audio and operate on metadata, many do not yet have best-in-class search (i.e. where the image is the query) support for images and audio files out of the box (mostly due to the previous reason above). I expect this to rapidly change in the next year.

Still not convinced? In part two, I’ll go through some of the underpinnings of search technology and map it into common data science constructs like vector spaces, statistics, and linear algebra. If you take nothing else away from this entire article (and its follow-ons), remember these two things when it comes to search technology: 1) Search engines are designed to rank things (i.e. even though Item A and B both match your need, B is more important than A) that match an information need and 2) After said ranking, search engines are really good at slicing and dicing the found set and summarizing the results. Stay tuned.

Grant Ingersoll
Author
Grant Ingersoll

Grant Ingersoll is the CTO and co-founder of Lucidworks as well as an active member of the Apache Lucene community – a Lucene and Solr committer, and co-founder of the Apache Mahout machine learning project. He is also the lead author of “Taming Text” from Manning Publications. Grant’s experience includes engineering a variety of search, question answering and natural language processing applications for a variety of domains and languages. Grant earned his bachelor degree from Amherst College in Math and Computer Science and his master degree in Computer Science from Syracuse University.