An Interview With Sourcegraph

Better coding, better communities.

Posted by Vaibhav Mallya on May 12, 2014
I stumbled on Sourcegraph when I was Googling for a usage example of one of the Flask APIs. Apparently, that's exactly what the team wants - Sourcegraph is trying to become the most comprehensive utility for code search, comprehension, and reputation.

Sourcegraph HQ is in a short building on the border of Financial District. I buzz the intercom, and Charles, the community manager, lets me know that one of the founders is coming down. Quinn waves me inside, and we walk up the stairs and past a broken elevator. The office at top is surprisingly large, with several large TVs lining the walls. A drone rests in one corner, and a pile of dog toys lays strewn haphazardly in another. "Milton's at doggy day care today, but he's here most days. Want to eat?" Quinn asks. "Yep!" I exclaim. We start talking, alternating with bites of our sandwiches.
Beyeng Liu- Founder. Stanford grad, ex-Palantir engineer. Enjoys eating, hiking, and eating.

Beyeng Liu

How did it start?
Quinn and I went to Stanford together, worked on a bunch of class projects together, and worked at Palantir together. One day I was chatting with him at my housewarming in Corona Heights - we were discussing issuses we'd experienced as programmers. That's where the discussion really started.
What was the most compelling part of the idea?
Programmers today are mostly reading code, making decisions about the libraries to use, and figuring out how to put those libraries together. The other tools available today don't take advantage of all the information that's available. I like to say that the right example is worth a thousand lines of documentation - an example can show you at a glance how you should use an API and reason about it.
If Google Code Search had still been around, do you think you still would have started this?
We do things quite a bit differently than Google Code Search. It was this really fast trigram index. It was great if you had a particular expression in mind. But they didn't go through and parse the code like we do. It couldn't tell you who else uses this particular function or repository. We can. The use-case that we're targeting is: "I'm writing some code. I want to use a specific function and figure out how to use it as quickly as possible."
You're using PySonar and RubySonar to do deep analysis and type inferencing. In terms of the accuracy, are you seeing one language be easier than others to get right? For example, Ruby and Rails have a lot of metaprogramming patterns, as does Python.
Go is by far the easiest, since it's both statically typed and has good libraries for introspection. PySonar does a really good job at Python. For Ruby, we're using RubySonar and YARD. The particular tools we're using in flux, but we're going to open source them soon. As far as metaprogramming goes, it turns out with a few heuristics, you can capture a lot of the magic that people do in Ruby and Python.
What's your existing stack, and what kind of data volumes are you looking at?
Mostly Go on AWS. There's a massive PostgreSQL instance and massive Elasticsearch node. We also have a lot of file storage on S3. Basically, all history of every repository we've ever looked at, we store. We also have this really cool setup with Makefiles - when you're processing, analyzing, indexing repositories, you have a lot of complex and interdependent steps. Our system does all that with Makefiles very elegantly. We're both systems folks, so we gravitate towards those kinds of solutions. Overall we have 5 terabytes of data total right now. EBS has the git repositories for speed and cost - would be too expensive to get things from S3 for every request.
Something like this would be super useful for companies. Have you given any thought to that?
We have. We've talked to Twitter, Facebook, and a bunch of other companies about private installations. Down the road, yes, we are open. But for now we want to focus on building an awesome product that every single programmer can use. Sourcegraph should be one of the top three applications open while you're coding.
Quinn Slack- Founder. Stanford graduate, ex-Palantir engineer. Dog owner and avid reader of historical nonfiction.

Quinn Slack

EBS has historically been pretty terrible on a lot of fronts. What's been your experience using it?
It is slow, but it's the simplest thing that works (that we've found). We need a filesystem because storing git/hg repositories on S3 or other object stores would be much slower and quite costly (due to the number of random reads/writes, and the total number of objects stored per repository). Currently we get ~40msec latency for fetching a file from git, and ~100msec from hg. Those numbers spike when the server's under heavy load, because of 2 bottlenecks. One is spawning the git/hg processes (which we can sometimes but not always avoid) and the EBS disk itself.
How do people use Sourcegraph?
So there are several main ways right now. One is if you're a programmer and you're writing some code, you can figure out how to use a specific piece of code. We can give you exact, real, examples right now - nobody else can do that. The other thing - if you're an author, you can see how people are using your code. It's just a really amazing feeling to see other people using your stuff. We've also had Django, libcloud, and several other big projects use Sourcegraph to figure out: Can we simply deprecate this function? Or do we need a helper function? Finally, if you have a library or snippet, you can see who is using that snippet, line, or library. And you'll see how it filters down. So you can see the impact you're having.
I see - this isn't just a pull-request-based graph. You want to make a deep reputation system for code. Very interesting. What's next?
Well, Java is the next language well want to support. But more broadly, Sourcegraph is an interesting combination of systems and PL challenges. A friend who works at one of the Stanford security labs thinks Sourcegraph can help find a new class of exploit. Say some library makes some assumption with an API. It has its own static analysis tools, it has tests, and its fine internally. But, a lot of other things might call into it, and some of those callers might violate those assumptions.

The simplest example - some function casts to a C++ class representing the 'div' element. A lot of callers pass in something pretty close to this 'div' element's class, and that works fine. But say they give it something that's different, and the user can control something and overwrite memory. We'd love to make it so the world can write these kinds of tools very easily.
Anything else?
Yeah, something else we want to do is really just foster more community around open source with Sourcegraph. Better ways of reasoning about bugs, decision tracking, and discussions. We'll be adding those features over time.
Mailing lists feel a bit closed off, even today. God forbid you make a mistake or need patience to guide you. It would be cool to imagine democratizing aspects of that discussion - GitHub is close, but not quite there.
I totally agree - and if you and I feel like these mailing lists are closed off, imagine how people who are new to programming feel. If we can make more ways for people to stick their feet in and get even a little bit involved, we can make a meaningful difference in terms of people contributing to the open source world.
Milton - Australian Shepherd. Not an engineer.


[disclaimer - I never actually met Milton, but his picture was too adorable not to share.]