An Interview With Heap

From 5-line MVP to 30,000 TPS datastore.

Posted by Vaibhav Mallya on March 12, 2014
Heap tracks everything. While most analytics products require upfront declarations of trackable events, Heap's iOS and JavaScript integrations generate the full firehose of user data. It's available for after-the-fact querying via their UI, and is an incredibly powerful approach. I caught up with the team in SoMA, where I learned how a 5-line MVP evolved into a 30,000 tps beast with a custom storage layer.

Walk into the Heap office, and you notice the oranges first - There's a small pile on the kitchen counter.
"Oranges? Nice," I say, eyeing them.
"Feel free!" Matin responds. "We have plenty."
Sadly, raiding someone's kitchen isn't the best way to start a conversation. I hastily decline the offer, and we start talking.
Matin Movassate- Founder. Ex-Facebook, Stanford graduate.

Matin Movassate

So, how did Heap start?
I was a PM at Facebook. Every time I wanted to analyze or monitor some other new piece of data, I would have to go bother an engineer and then wait a few days until it was in production. Even small changes took absurd amounts of time.
What about technically?
I experimented a bunch after leaving Facebook. The MVP that became Heap was five lines of Javascript backed by MongoDB. That validated the technical side. I got into YC, and got Ravi and Dan on board, not long after. We have over a thousand customers right now, a strong revenue stream, and good growth. Lot more work ahead, but we're obviously happy about our trajectory so far.
Are you still on Mongo?
Oh god no. Initially we thought the document store and aggregation framework would be the perfect fit. We're literally just storing lots of key-value pairs. But the database had all sorts of bizarre performance issues. We saw on-disk storage spike to hundreds of megs after a relatively small amount of data. The indices just ballooned.
So how long did your Mongo backend last? And what did you do after?
Mongo lasted one week. We then moved to a single huge Postgres node on EC2. It scaled surprisingly well - we just kept tacking on memory and disk space. And then we saw that starting to fall over.
Wait, why not just stick with RedShift? It's meant for large-scale data processing.
It still has limits. First, RedShift doesn't have an HStore datatype. We managed some hacks around creating many columns instead, but that was obviously not going to scale. Second, the columnar storage model was a poor fit for the kind of funnel analyses we wanted to run on-demand.
And now you're on a massive custom data store
Yep. Dan built most of it, and can talk details about that.
Any advice for folks setting out to build their first non-trivial storage layer?
[Matin pauses]
Just one - replication is not a backup strategy.
Dan Robinson- Engineer. Avid hiker, enjoys building physical things. Stanford graduate.

Dan Robinson

What is it like building this huge Postgres storage layer? It seems like a really hard problem.
We have a lot of Postgres instances on EC2 serving a lot of traffic. We lean pretty heavily on a tool called Citus Data that makes it manageable to build out a sharded Postgres schema. In particular, it handles distributing subqueries and reassembling results, so we can treat it like one big postgres instance at query time.
But you're using EC2. You don't have visibility into the underlying OS or hardware. Has that been good enough?
We've had surprisingly few problems with EC2. Some variances in performance, but we've gotten mileage out of experimenting with different instance types. I/O perf is supposed to be a typical concern on EC2, but hosting everything performance-sensitive on ephemeral storage has been sufficient thus far. The results are pretty powerful - our import pipeline can manage on the order of 30,000 requests per second.
You have an iOS client out now. How does your firehose approach translate to native?
We have a single library with implicit integration - drop it in, and you're done. We use method swizziling to intercept calls to the native APIs, and then send them to our servers.
I have absolutely no idea what method swizziling is.
Anytime a call is made to a function that handles meaningful UI events - taps, swipes, and so forth - our functions intercept it and send it back. They're injected using a particular dynamic pattern in Objective-C. Of course since this is the same firehose model, we have to be careful in how we do it because of issues with battery life, network bandwidth, etc.
And what's your strategy?
Batching, compression, and local storage if a connection isn't present mostly. Retry when it is. Battery life is key, and we preserve it by starting the radio as infrequently as possible. But there's definitely more work to do to make it more intelligent.
Ravi Parikh- Founder. UI engineer, sales, marketing. Former professional music producer. Stanford graduate.

Ravi Parikh

What's the frontend built in?
The frontend is a lot of Backbone.js and D3. The visualization stuff I did in school really comes in useful here.
You have a relatively small amount of users, especially compared to consumer apps. How do you iterate meaningfully without a lot of data?
There's a certain set of features we know we want in the UI, so that gives us focus. As far as feedback - we actively solict customer feedback, listen to our support forms, and have 1 on 1 feedback sessions. We also use Heap on our own site, which is a good way to dogfood. And sometimes, features are so obviously good or bad that we can still use our data to kill them or iterate more on them.
Example?
We used to have a cumulative event graph. Turned out basically no one used it, ever. Nor did they want to - it was awful for insights. We ultimately killed it.
One of your explicit goals is to build UIs for non-technical people. Statistics and analytics are hard. How do you communicate the complexity inherent in these concepts?
We start off by managing expectations aggressively with regards to what Heap can do. We also try and help people understand the danger of false positives. But at the end of the day, we're still dealing with customers who know the ins and outs of their own products. We've found that even non-technical people have a good grasp of what their products are and what the data can and should refer to, which is really important. Also, we're not trying to be overly prescriptive. Instead, we're being upfront about the fact that we're giving users a set of tools they can use to analyze their data.