Machine Learning at SurveyMonkey
SurveyMonkey collects around 3 million responses each day, which has amassed into an extensive treasure chest of structured and unstructured survey data over the years. While the raw data has been analyzed and processed for the SurveyMonkey data businesses—Benchmarks and Audience—we’ve only just begun to perform more sophisticated statistical machine learning on the data. More specifically, some of the problems we are currently tackling are in the area of natural language processing (NLP) such as:
- Automatically categorizing text responses
- Finding relevant questions to recommend to survey creators
- Detecting when survey respondents are not giving thoughtful responses
In order to solve a wide-ranging set of machine learning challenges, we’ve first focused on architecting a machine learning platform to efficiently and effectively manage the data workflows surrounding machine learning.
When most folks think about machine learning, they focus their attention immediately on the algorithms (e.g. neural networks, regression-based techniques, Bayesian models, clustering, dimensionality reduction, etc.). However, machine learning in production applications is becoming less and less about algorithms1 and becoming more and more focused on the data workflows surrounding them.
So, what is a data workflow? Simply put, it is all the steps required to train machine-learning models in the lab, deploy those models into production, monitor and evaluate their performance, and iteratively improve those models. More formally stated, the primary components of any production machine learning workflow are the following:
- Data acquisition and retrieval
- Analysis of data and model building
- Model management and deployment
- Model access and evaluation
The above is an ETL (Extract, Transform, Load) based data workflow commonly found in companies that utilize machine learning. Here’s how each of the data workflow components are implemented in this ETL scheme:
- Data acquisition and retrieval – In order for meaningful models to be trained, production data is ETL’ed into a data warehouse where data scientists can then analyze the data. This is typically done periodically on a trailing day or week basis.
- Analysis of data and model building – Here’s where the machine learning is performed. Data is exported from the data warehouse into the workstations of the data scientists. The data is then cleaned, analyzed to find the appropriate feature transforms, and models are trained and evaluated on the transformed raw data. This prototyping is typically performed in R or another data science language.
- Model management and deployment – Once a suitable model is trained, a data scientist works with a systems engineer to implement the model in the repository of code that will be shipped into production, usually requiring the model to be rewritten in another language. The management of the model parameters and versioning is done manually.
- Model access and evaluation – When the code is production ready, the model is then shipped as part of a release process and is typically accessed in a bespoke manner specific to that model (e.g. a recommender system focused API). Model performance is tracked and stored in the production database. In order to iterate and improve accuracy, the cycle starts again at the first step with an ETL.
The primary issues with the ETL based workflow above are:
- Long and expensive iterations to improve models.
- Double implementation costs between Lab and Production. This is further exacerbated if you need to rewrite models in a separate language or rewrite for parallelization.
- Manual management and deployment of models and their transforms. Typically, large-scale production environment often contain hundreds or even thousands of models, with some that need to be updated daily. The actual models themselves may not be that complicated, but keeping track of them is certainly non-trivial.
- Bespoke API management to allow access per model, which requires coordination with other teams and code releases.
It’s clear this is not the optimal data workflow. So, how can we do better?
Machine Learning as a Platform
At SurveyMonkey, we’re building out a platform that approaches machine learning problems holistically and aims to reduce the inefficiencies from the ETL based workflow.
Here’s how it looks:
The platform is called ML Service—short for Machine Learning Service. The fundamental improvement here is around the usage of h20.ai which allows us to develop, deploy, and iterate on models just like we do with code. Here’s how ML Service implements each of the data workflow components:
- Data acquisition and retrieval – Production data is available to ML Service in three forms: a readable SQL secondary, a Cassandra data center with Spark, and HDFS with Hadoop. While data still needs to be ETL’ed into those forms, each data store has been optimized with parallelized data access and efficient computation in mind.
- Analysis of data and model building – ML Service leverages the h20.ai machine learning framework to train models and serve predictions. h20.ai models are designed for parallelization and scale, meaning once a model is prototyped and evaluated in the lab, the same model can be used in production. Further, h20.ai interfaces with Spark, Hadoop, SQL, and flat-files for exploration and model training, which offers much flexibility depending on the machine learning problem at hand.
- Model management and deployment – When ready, the h20.ai models and their corresponding transforms—collectively known as a pipeline—are serialized. We use git to track these serialized pipelines to allow version tracking of both models and their transforms. For instance, if we used the same feature transform, but updated the parameters in our model, the entire serialized pipeline would be versioned.
- Model access and evaluation – Serialized pipelines get assigned to a particular use case, which encapsulates business requirements for a particular domain, such as Survey Responses. Each use case can have any number of active pipelines that perform different functions on the same domain, such as classification or sentiment analysis of Survey Responses. ML Service routes requests to a dynamically generated REST-ful API based on use cases and their pipelines. This generalized API allows us to push updated serialized pipelines without requiring a release to update the API. Finally, h20.ai has prediction accuracy tracking built in, which forms the basis for a feedback loop to help us iterate on the performance of our models.
In short, the ML Service approach offers the following benefits:
- Single implementation of models and transforms in the lab with performance and parallelization built in.
- Git based version tracking of serialized pipelines to scalably manage all active and historical pipelines.
- Generalized REST-ful API that routes requests accordingly to use cases and their active pipelines.
- Fast deploys by pushing new pipelines without a full release, along with feedback loops on performance. This allows us to quickly iterate and improve models over time given the feedback on model performance.
We’ve designed ML Service to be a machine learning platform that takes the pain out of the data workflow, namely by abstracting away the headaches of deployment, model and API management, and by allowing for quick, measurable iterations. This enables our data engineers to focus more on the interesting bit: the actual machine learning.
The journey has just begun! If you’d like to know more about ML Service or any of the specific machine learning problems we are solving, please shoot me an email at firstname.lastname@example.org.