Skip to content
April 22, 2011 / Ben Chun

NoSQL, Parallelism, and Teaching in the Face of Uncertainty

Yesterday’s Amazon outage should serve to emphasize something we already know: people are trying to build modern web applications for scale. The layer of abstraction offered by virtualization and/or cloud providers can help. And there are a lot of businesses, even technology-driven businesses (Engine Yard and Heroku were among those down yesterday), that don’t want to manage the complexities behind that abstraction.

The outage is a good, if painful, reminder for businesses that those complexities exist. For those of us in CS education, it should also be a wake-up call of a different sort: The future is parallel and distributed, but we’re not teaching for that future.

Right now, most CS1 courses include object-oriented concepts like inheritance. Most don’t include parallel programming concepts. When I was in college, we studied “multithreading” and learned to avoid it. A lot has changed in the 10 years since. Today, interesting computations are being done not only across multiple cores, but multiple machines, and multiple data centers. This stuff is no longer the domain of esoteric supercomputing centers and research projects — it’s available for a low monthly cost via well-documented APIs. There are some well-articulated systems engineering ideas that attend distributed computing. That doesn’t mean it should automatically be a part of an intro course. But are we making that decision consciously and intentionally?

Maybe new languages or compilation techniques will be able to automatically parallelize computations that we express in our existing languages. But that’s not the trend we see. Some details, particularly the details about computational complexity, can’t be pushed down behind abstractions. Relational databases hide a bunch of work behind statements like JOIN. That doesn’t scale. Facebook, for example, uses MySQL and distributes their database over multiple shards. You can’t JOIN across shards. Facebook has a big layer of memcached between the database and the application logic. So they’re effectively building their application against a key-value store, and doing all the JOIN-like work in the application layer.

But what are we teaching in introductory database classes? I’d be surprised if the answer is anything but relational databases. I’m not saying that SQL is irrelevant, or that object-oriented programming is irrelevant. But when we focus on these topics to the exclusion of others that we know to be important (and quite interesting from a CS perspective), can we honestly say we’re introducing students to the realities of our field?

5 Comments

Leave a Comment
  1. Lukas / Apr 22 2011 10:03 am

    There’s another interesting angle to this, which is that large scalable web applications are typically built (hopefully) as a set of loosely coupled services with well-defined interfaces. Sort of a macro scale version of how a student’s code should be structured. Might be interesting to explore.

  2. Adam Marcus / Apr 23 2011 4:41 am

    Most database courses worth their salt do not solely focus on traditional relational model. Here’s one example of a syllabus, in which the second half of the course focuses on newer techniques (parallel databases, streaming, bigtable, mapreduce, column stores, …): http://db.csail.mit.edu/6.830/sched.html

    It’s important to cover recent approaches in a self-respecting databases course. I think it’s actually the wrong approach to go for introductory courses, at least with respect to data management. There’s an argument to be made for teaching parallelism, distribution, and service-oriented architectures (you know, the foursquare API) early. But I think we’re a long way off from considering ourselves responsible if we teach budding computer scientists NoSQL before or instead of the relational model.

    The majority of workloads and datasets fit in traditional single-node relational database installations just fine. They likely fit into a spreadsheet just fine. Teaching students to discard a single-node approach for a more complex one is teaching them premature optimization. It’s worse than that, actually. By teaching them to discard SQL, you’re teaching them to discard reusable components and declarative data processing.

    Facebook’s design decision to process JOINs in application logic has the benefit of predictable datastore performance. That predictability comes at the expense of each application reimplementing the JOIN, which discards 40 years of deep optimizations. This was a wise design decision on Facebook’s behalf, but can not be appreciated without first understanding systems performance (what’s a disk seek?), networking (packets are dropped? latency?), and distributed coordination protocols (what if half of the servers don’t respond to my key lookups?). Let’s teach those things before we rid our students of a powerful, relevant, provably correct, and new (to them) declarative programming model.

  3. gasstationwithoutpumps / Apr 23 2011 5:13 am

    Some intro programming languages, like Scratch, start out appearing multi-threaded, so that parallelism is not a scary thought. (Though students do have to learn about race conditions earlier than most cs faculty are willing to deal with it.)

  4. Ben Chun / Apr 23 2011 8:10 am

    Adam – I certainly hear what you’re saying about avoiding premature optimization. I’m not advocating that people implement JOIN themselves or shard out their 10k row dataset. I’m just asking if SQL is the best first thing to teach — and also asking if the Object-Oriented paradigm is the best first thing to teach.

    For example, I think OO concepts come very easily when introduced after a few weeks of Scheme. This is SICP talking, but I think it’s a lot easier to understand objects when you know what a stack frame is, and you know the environment and scope work. Then gluing some data and functions together seems reasonable and you know how it’s implemented and can reliably predict behavior — or at least know what questions to ask about possible behaviors.

    The goal is for the student to arrive at a correct and consistent mental model for what happens during program evaluation. All that said, I don’t see a clear reason that the OO paradigm must be taught before the functional paradigm or vice versa.

    Likewise, I don’t see why SQL has to come first. I’m not saying that it should be discarded. I just wonder if we’re helping students successfully build correct mental models. I think building up from B-trees is just as valid as starting out by saying “here’s SQL”. I know I would have a much greater appreciation for the 40 years of deep optimization on JOIN if I started out by implementing a primitive version of it myself.

  5. Ben Chun / Apr 23 2011 8:10 am

    Kevin – Yep, you’re right on. I discovered that earlier this year:
    https://itmoves.wordpress.com/2011/01/10/scratching-deep/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: