Web-based IDE’s and AST Formats

I noticed yesterday that Github is using ACE to edit files. Apparently ACE is a descendant of the Bespin project.

In a previous post I described a script I had written that parses python into an AST represented using JSON. I’m looking for budding attempts at standardizing such representations. Does ACE have something appropriate?

As I mentioned in that previous post, this was just a small piece of a project on source code search algorithms. It’s a fun project, but when I think about why more progress hasn’t been made by others, the lack of a good standard for representing AST’s seem to be the biggest barrier.

As much as I respect the work of Xtext, MPS and Spoofax, it seems to me that the key to unlocking all this potential is creating AST interchange formats that Javascript can easily manipulate.

Five years ago the strategies with the most buzz were 1) ATerm and 2) MOF. I don’t imagine that projects like ACE will ever give the time of day to MOF. And I’m not aware of any plans to build a Javascript library for ATerm.

Ometa/JS was a cool attempt at creating language-oriented tools in a browser, but as far as I know, it never gained a big following.

The hardest part about the vast majority of software these days is not performance optimization or correctness or proving properties about some fancy type system. It’s decomposing the problem into components that the world is ready to digest.

Many of the existing attempts at multi-language IDE’s have just been too big. I’m thinking of the “libraries not frameworks” meme that has been making the rounds on Twitter lately. I’d like to be able to embed text-editing panes for language mash-ups and DSL’s in my web applications. The big GUI tools out there at the moment can’t be carved up to suit that purpose.

On the other hand, some attempts like the syntax highlighter I’m now using on this blog don’t seem ambitious enough. Will this project ever evolve into a more powerful tool, or will it be content to simply highlight text? The problem is that — as anyone who has used syntax highlighting in text editors like emacs knows — there are frequently edges cases that stymie anything other than a full description of the language.

Projects like ACE seem to be in the right position to take the lead. I’ll update this post if I find any answers.

Reporting with MongoDB

I gave a talk at the 2011 Iowa Code Camp on Saturday, April 30th.

Given a lot of events like this in MongoDB:

db.events.insert({
    report: "file1",
    time: { dateHour: "2011042412",
            date: "20110424",
            month: "201104" },
    data: { product: "coffee",
            age: "young",
            height: "tall",
            gender: "m",
            mood: "happy" }
})

And “pivots” defined like this:

db.pivots.insert({
  company: "JavaCo",
  dimensions: ["age", "gender"]
})

And some scala that more or less does this:

val buffer = ... // local filesystem file buffer
val aggregator = ... // a wrapper for MongoDB
while( true ) {
  for( report <- buffer.unprocessedFiles ) {
    aggregator.loadFile(report)
    for( pivot <- pivots ) {
      events.???(pivot, report, "dateHour")
    }
    buffer.remove(report)
    aggregator.purge(report)
  }
  pause
}
aggregator.close

We’re left to define what “???” does.

My current thought is that something like this is appropriate:

m = function() {  emit( {
  { "pivot" : ObjectId("4dba..."),
    "time" : { "dateHour" : this.time.dateHour },
    "data" : { "age" : this.data.age,
               "gender" : this.data.gender } },
  1 ) };

r = function(key, values) {
    var total=0;
    for ( var i=0; i < values.length; i++ ) {
        total += values[i];
    }
    return total;
};

db.events.mapReduce(m, r, {
    out  : { reduce: "aggregates" },
    query: {
        "filename": "file1",
        "data.product": "coffee" }
})

My sample dataset is 2 million events spread across 114 files.

I’m using the SAFE WriteConcern.

Earlier today I turned off atime updates on the ebs volume that I had run the load test on. This was purely loads — no pivots were defined, ergo no mapreduce was performed. With atime on my dataset took 425 seconds. With atime off it took 414 seconds. I don’t have enough data to know if this 11 second speedup is real, but in any case it’s not a significant speedup. As expected. I can now check off that little optimization item.

The real way to speed up this phase is likely running 8-10 ebs volumes together as a raid 10 volume using lvm.

The other phase — mapreduce — is CPU bound. And due to the single-threadedness of mapreduce, I can only bring to bear the power of one core without sharding (or doing the computation in scala). This is the more serious bottleneck. I hope that MongoDB 2.0 addresses the concurrency problems with mapreduce or provides some new features that solve the aggregation problem.

Here are some of the resources I drew upon for the talk.

Convert python AST to JSON Document

I’ve added my python2json.py script to github.

This is a small piece of the source code search algorithm project that I’ve been working on. I think this piece is useful in its own right, and that releasing it doesn’t impinge too much on the larger project.

As an example, let’s say we have the following code in example.py:

x = 1 + 2
print x

The script can be invoked with the -f option (or it can parse stdin) like so. You can pipe the output through json.tool like I do here to pretty-print the result:

./python2json.py -f example.py | python -mjson.tool

Will output:

{
    "_lineno": null,
    "node": {
        "_lineno": null,
        "spread": [
            {
                "_lineno": 2,
                "expr": {
                    "_lineno": 2,
                    "left": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "1"
                    },
                    "right": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "2"
                    },
                    "type": "Add"
                },
                "nodes": [
                    {
                        "_lineno": 2,
                        "name": "x",
                        "type": "AssName"
                    }
                ],
                "type": "Assign"
            },
            {
                "_lineno": 3,
                "nodes": [
                    {
                        "_lineno": 3,
                        "name": "x",
                        "type": "Name"
                    }
                ],
                "type": "Printnl"
            }
        ],
        "type": "Stmt"
    },
    "type": "Module"
}

YAML Schema with Moose

A friend asked me recently about parsing YAML with perl. He needed to impose some additional structure to a set of YAML documents. His initial approach was to define a new grammar for the language, but this was turning out to be non-trivial — due to significant whitespace and other complexities.

I told him about how I had used Moose types together with YAML.pm to achieve the effect of a “schema” for document types defined on top of YAML. But my solution relied on the “!!perl/hash:Bar” syntax to tell the deserializer how to bless the parsed perl data structure. That’s OK for a purely internal document that I was using, but not appropriate for anything that might be more public.

Stackoverflow led me to a cleaner solution that uses Moose’s type coercion. When passing a raw deserialized perl data structure to a Moose constructor, Moose will look for matching coercions when a type constraint is not initially met.

I’ve been looking for a schema language for YAML and JSON for a while. The wikipedia suggests that Kwalify, Rx, and Doctrine can all fulfill that role, but in the absence of consensus about a schema language, I’d prefer to use something that I have more control over.

Here’s an example. Let’s say we have the following yaml file:

---
name: Extreme Foo
id: 10
alias: FooX
bars:
  - id: 1
    name: bar1
  - id: 2
    name: bar2

The obviously implied Moose types that would define Foo and Bar are:

class Bar {
    has 'name' => (isa => 'Str', is => 'ro', required => 1);
    has 'id' => (isa => 'Int', is => 'ro', required => 1);
}

class Foo {
    has 'name' => (isa => 'Str', is => 'ro', required => 1);
    has 'id' => (isa => 'Int', is => 'ro', required => 1);
    has 'alias' => (isa => 'Str', is => 'ro', required => 0);

    has 'bars' => (isa => 'ArrayRef[Bar]',
		   is => 'ro',
		   required => 1,
		   default => sub { [] } );

    method print() {
	print $self->name . "\n"; # etc
    }
}

Unfortunately Moose will complain about the “bars” variable not being of the correct type. To fix this, we set the coerce flag on the “bars” field, so that Moose will know to go looking for coercion during object construction:

    has 'bars' => (isa => 'ArrayOfBars',
		   is => 'ro',
		   coerce => 1, # This tells Moose to look for matching type coercions
		   required => 1,
		   default => sub { [] } );

In this case — because the “Bar” is embedded in the parameterized ArrayRef type — we also need a new type called ArrayOfBars, and a coercion from ArrayRef[HashRef] to ArrayOfBars.

subtype 'ArrayOfBars'
    => as 'ArrayRef[Bar]';

coerce 'ArrayOfBars'
    => from 'ArrayRef[HashRef]'
    => via { [ map { Bar->new($_) } @{$_} ] };

Meaning that we can now do this with a yaml file that does not contain the “!!” syntax:

my $foo = Foo->new(LoadFile('example2.yaml'));
$foo->print();

The complete code is available on github

Scala Intro

David Pollack recently gave an overview of the Scala language at a BASE (Bay Area Scala Enthusiasts) meetup. A video has been posted http://blip.tv/file/4243180. I learned a few things.

It’s always different to see a presentation vs reading text. If you’re curious about scala, or have even been using it for a while, I recommend it.