Reporting with MongoDB

I gave a talk at the 2011 Iowa Code Camp on Saturday, April 30th.

Given a lot of events like this in MongoDB:

db.events.insert({
    report: "file1",
    time: { dateHour: "2011042412",
            date: "20110424",
            month: "201104" },
    data: { product: "coffee",
            age: "young",
            height: "tall",
            gender: "m",
            mood: "happy" }
})

And “pivots” defined like this:

db.pivots.insert({
  company: "JavaCo",
  dimensions: ["age", "gender"]
})

And some scala that more or less does this:

val buffer = ... // local filesystem file buffer
val aggregator = ... // a wrapper for MongoDB
while( true ) {
  for( report <- buffer.unprocessedFiles ) {
    aggregator.loadFile(report)
    for( pivot <- pivots ) {
      events.???(pivot, report, "dateHour")
    }
    buffer.remove(report)
    aggregator.purge(report)
  }
  pause
}
aggregator.close

We’re left to define what “???” does.

My current thought is that something like this is appropriate:

m = function() {  emit( {
  { "pivot" : ObjectId("4dba..."),
    "time" : { "dateHour" : this.time.dateHour },
    "data" : { "age" : this.data.age,
               "gender" : this.data.gender } },
  1 ) };

r = function(key, values) {
    var total=0;
    for ( var i=0; i < values.length; i++ ) {
        total += values[i];
    }
    return total;
};

db.events.mapReduce(m, r, {
    out  : { reduce: "aggregates" },
    query: {
        "filename": "file1",
        "data.product": "coffee" }
})

My sample dataset is 2 million events spread across 114 files.

I’m using the SAFE WriteConcern.

Earlier today I turned off atime updates on the ebs volume that I had run the load test on. This was purely loads — no pivots were defined, ergo no mapreduce was performed. With atime on my dataset took 425 seconds. With atime off it took 414 seconds. I don’t have enough data to know if this 11 second speedup is real, but in any case it’s not a significant speedup. As expected. I can now check off that little optimization item.

The real way to speed up this phase is likely running 8-10 ebs volumes together as a raid 10 volume using lvm.

The other phase — mapreduce — is CPU bound. And due to the single-threadedness of mapreduce, I can only bring to bear the power of one core without sharding (or doing the computation in scala). This is the more serious bottleneck. I hope that MongoDB 2.0 addresses the concurrency problems with mapreduce or provides some new features that solve the aggregation problem.

Here are some of the resources I drew upon for the talk.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google