Monday, February 5, 2007

Sidecar Styles for SpreadsheetML (and Noisy XML Formats)

Spreadsheets are accessible. Business users, in particular, really seem comfortable with spreadsheets. That's a good thing, particularly if you want to ask a business user for some data that can be represented well in tabular form.

For instance, we have a multi-tenant product that requires a fair amount of configuration, and that configuration is often best done by business users with only a moderate background in technology. Those users would find an XML file onerous, and a text-file error-prone, but are more than happy to configure the application using a spreadsheet.

However, these business users are using Excel. Excel is a binary format, which doesn't integrate exceptionally well with a development toolset. For instance, it's fairly difficult to compare two Excel spreadsheets and find the differences. Normal diff tools (commercial examples include BeyondCompare, Araxis Merge) balk at providing a diff on a binary file, and Excel doesn't come with its own capability for comparison (unlike, say, Word). You can invest in a tool like Synkronizer, or you can save the spreadsheet to some kind of non-binary format.

This is where SpreadsheetML comes in. Excel can save its workbooks in an XML format called SpreadsheetML. XML is much more amenable to comparison analysis and human-readability. Unfortunately, most office/document XML formats have a very low signal-to-noise ratio: XML is verbose to begin with, and style and layout information can outweigh actual content by an order of magnitude.

In SpreadsheetML, this means that adding a few cells on a new row will result in a 20-row change, when you factor in the cell, row and data tags, the type, styleid and height attributes, and so forth.

Sidecar Formats
One of the things I've enjoyed about using Photoshop with RAW files (e.g. Canon's CR2 format) is that your current 'raw' conversion settings can be stored alongside the raw file in what's known as a sidecar file. As a result, the original raw file is untouched, and some additional information is stored where it's easily accessible, copyable, and so forth.

SpreadsheetML and Sidecar Styles
Going back to SpreadsheetML, wouldn't it be nice if the styling information were similarly stored in a sidecar file? This would buy a separation of content from form, the ability to provide more than one possible styling per data sheet, the ability to diff the data without having to get confused in all the formatting. This feels both more powerful and more useful.

No comments: