For those of you interested in what’s under the covers here at SurveyMonkey, here’s an overview of our tech stack and architecture.
For most of SurveyMonkey’s 16 year life, the website was a monolithic application written in C#, sitting atop a single SQL Server database. Six years ago, for reasons that were as much logistical as technical, we realized that we needed to re-architect the system. We decided to replace the monolith with a Microservice Architecture. This bought us a couple of nice benefits:
- We built out an API layer that we could use both internally and externally.
- Allowed the engineering organization to scale easily, with multiple teams working on the site independently and in parallel.
- We can more easily scale horizontally by adding new nodes to services that get more traffic.
We chose Python for the rewrite both for the ease of the language and to allow us to tap into the vibrant Python community in Silicon Valley. Over the last few years, piece by piece, we’ve been rewriting and re-architecting SurveyMonkey. Then, to ensure that we didn’t overwhelm the new servers or negatively impact any of our core metrics, we carefully moved batches of users over from the old C# servers onto the new Python servers. At this point, a vast majority of our traffic is served by the new stack.
The F5 sits at the perimeter of our data center and sees all external requests coming from our users. Its job is to handle basic encryption (the “S” part of our HTTPS connections) and ensure that all requests are routed to an appropriate server. We also rely upon the F5 for basic load balancing and visibility into the shape and size of our incoming traffic load.
Most of the time, F5 routes a request to a virtualized Ubuntu machine that’s running one of our micro services. We run a private cloud based on OpenStack, which allows us to easily provision new virtualized servers and deploy to them via automation.
Architecturally, for our application logic, we’ve split the Model-View-Controller paradigm into two layers: Webs, which are mostly the View part of MVC, and Services, which are mostly the Model part. The Controller logic gets split across the two layers, but most of it resides in the Service layer where it can be easily reused by many different clients. However, when the F5 routes a request from an external user, it’s destined for one of the servers in the top layer, the Webs, which host the front-end for our site.
Although some of these Webs have all the information they need to build a web page (typically those are static pages), most need to call out to our back-end services to authenticate the user, get data about a survey, or maybe gather recommendations on what questions to suggest during survey creation. These actions are accomplished by making web service calls to our Service layer, either when the page is initially being constructed or via ajax calls for pages that are really single-page apps.
Our Service layer is also composed of 20+ applications, each of which handles a different logical area of our business. We’ve got one that handles authorization requests, another that processes all requests about survey structure, etc. Just like the Web layer, they’re Python apps that use the Pyramid web framework. Unlike the Web layer, much of their work involves rooting around in databases, processing messages, searching for data, or doing calculations. Although many other companies have had success using other protocols like Thrift, we’ve enjoyed having all our Services based on HTTP. The comfort with which we all interact with HTTP, and the mature tool set around it has been great for developer ease and productivity.
A typical request to the Service layer usually involves talking to one of our SQL Server databases. We have a suite of sharded databases that contain all of the core survey and user data created by our system and our users. For example, when one of our users wants to view their survey results, a call is made to AnalyzeSvc asking for the aggregated results for one of the questions on the survey. In real-time we’ll count the number of responses for each possible answer by navigating our way through the other 65 billion responses in the DB, and return those results back to the user’s browser. Most of the time that call takes a couple dozen milliseconds. How do we make sure that it stays fast? Mostly it’s a matter of being smart about how we setup and use SQL Server, but that’s probably an entire blog post of its own. The other important part of the equation is caching whenever possible.
We save trips to the DB (and thus saving time-consuming calls reading from disk), by often saving query results in memcached. We have a fleet of memcached servers fronted by mcrouter, an open source memcached router, that store cached copies of surveys, results, or whatever else might have been time-consuming to fetch or compute. This caching layer is mostly used by our backend Service tier, but there are multiple places where we use in our Web tier too.
One of the challenges with caching is making sure that we never serve up stale data. This is typically accomplished by either putting a ‘Time To Live’ on the cached data (which merely limits how stale data can get) or by ensuring that we keep track of when data gets modified so that you never read the stale data. We mostly use the latter strategy, but for performance reasons sometimes the application that caches the data isn’t always aware of when that data becomes stale. So, we’re rolling out a publish/subscribe system that will make it easier for all relevant parts of our application stack to get notified when events occur that they care about.
That’s a high-level trip through our stack. The processing described above is what powers the majority of actions on SurveyMonkey.com, but there are plenty of pieces that don’t fall into the above framework. Parts of our environment use other DBs like MySQL, Cassandra, and RethinkDB. We’ve got search engines like SOLR, a HADOOP cluster, a publish/subscribe messaging system and a sprinkling of Amazon Web Services. Watching over all of these are multiple layers of monitoring including NewRelic, Splunk, and Zabbix. And, of course, we’re looking at new technology all the time, trying to find better, faster, and more reliable ways to serve our customers.
Sound interesting? Check out our jobs page!