NoSQL, Parallelism, and Teaching in the Face of Uncertainty
Yesterday’s Amazon outage should serve to emphasize something we already know: people are trying to build modern web applications for scale. The layer of abstraction offered by virtualization and/or cloud providers can help. And there are a lot of businesses, even technology-driven businesses (Engine Yard and Heroku were among those down yesterday), that don’t want to manage the complexities behind that abstraction.
The outage is a good, if painful, reminder for businesses that those complexities exist. For those of us in CS education, it should also be a wake-up call of a different sort: The future is parallel and distributed, but we’re not teaching for that future.
Right now, most CS1 courses include object-oriented concepts like inheritance. Most don’t include parallel programming concepts. When I was in college, we studied “multithreading” and learned to avoid it. A lot has changed in the 10 years since. Today, interesting computations are being done not only across multiple cores, but multiple machines, and multiple data centers. This stuff is no longer the domain of esoteric supercomputing centers and research projects — it’s available for a low monthly cost via well-documented APIs. There are some well-articulated systems engineering ideas that attend distributed computing. That doesn’t mean it should automatically be a part of an intro course. But are we making that decision consciously and intentionally?
Maybe new languages or compilation techniques will be able to automatically parallelize computations that we express in our existing languages. But that’s not the trend we see. Some details, particularly the details about computational complexity, can’t be pushed down behind abstractions. Relational databases hide a bunch of work behind statements like JOIN. That doesn’t scale. Facebook, for example, uses MySQL and distributes their database over multiple shards. You can’t JOIN across shards. Facebook has a big layer of memcached between the database and the application logic. So they’re effectively building their application against a key-value store, and doing all the JOIN-like work in the application layer.
But what are we teaching in introductory database classes? I’d be surprised if the answer is anything but relational databases. I’m not saying that SQL is irrelevant, or that object-oriented programming is irrelevant. But when we focus on these topics to the exclusion of others that we know to be important (and quite interesting from a CS perspective), can we honestly say we’re introducing students to the realities of our field?