Tech Blog‎ > ‎Geek corner‎ > ‎

Why do I need a schema anyway?

posted Feb 16, 2016, 1:41 PM by Michael Holleran   [ updated Aug 26, 2016, 1:10 PM by LauraC ]
"Art is limitation; the essence of every picture is the frame." — G.K. Chesterton

Summary

This article  is not yet another argument in the tiresome SQL vs. NoSQL debate. I think both technologies have their place. This is an explanation of the benefits of using a schema when the data can benefit from it.

Most NoSQL databases store data either in key/value form, or as XML/JSON documents. In almost all cases, they lack the concept of a schema. This presents certain advantages: programmers can store any data they want, they can change how they store the data over time without migrating old data, etc... That make sense for unstructured data, but when it comes to structured data, these advantages are offset by significant, and (I think) under-reported, downsides regarding the value of the data, and its long-term viability.

In this article, I describe how a schema can be an important asset when dealing with many types of data, and how the concept of schema can be extended to make it even more useful. 

Why a schema?

When writing software, we usually think of what the system is supposed to do. We should also think about what the software is not supposed to do.

In many ways, that's what a schema does. It's a way to define how data should behave, and how it should not behave. It's a way to draw the line between the "good" space, where data is consistent, and the "bad" space, where data is not consistent.

That is the main purpose of a schema. It's not a crutch to help the database engine. It's not an arbitrary set of limits created solely for the purpose of frustrating the programmer's creativity. It's about carving out a well-defined area in an infinite space of possibilities.

Advantages of Having a Schema

The following are the advantages to having a schema:
  • As a communication tool

The first advantage of having a schema is that it brings structure. This may sound tautological but I don't think it is. Having a formally defined structure for your data means that all parts of the system will have at least that much in common. A schema diagram is a great tool for communicating in a team.

  • As an error-catching mechanism

Having a well-defined schema will catch errors that would go undetected otherwise: null values where there shouldn't be, incorrectly spelled attribute/column names, values out of range, referential integrity, etc.

ProblemExample
Invalid dataProduct price = true (meaningless -- should be a number)
Missing dataLine item does not have a price.
Extraneous dataLine item has an extra attribute named "Color" and we don't know what it means.
Referential integrityOrder does not belong to a customer.
  • Discoverability - reports, other apps, etc.
An under-appreciated benefit of having a schema is also the discoverability it brings to your data. A well-defined schema means that other systems may also be able to use your data: ELT tools, reporting tools, even app generators.
  • For performance

A schema makes indexing easier. It also informs how the database retrieves your data.

  • For migration

Perhaps most importantly, a schema makes migrating the data much easier. Data tends to outlive applications. You will have to transform your data in any number of ways over its lifetime.

As Sarah Mei recently wrote in her remarkably clear and cogent piece:

"Schema flexibility sounds like a great idea, but the only time it’s actually useful is when the structure of your data has no value." -- Sarah Mei

Disadvantages of Having a Schema

The following are the advantages to having a schema:

  • It takes more time up front.
  • You can't store whatever you feel like.
  • You have to learn some data modeling.

There is nothing wrong about storing schema-less data if that makes sense for your particular problem. But we should stop pretending that NoSQL is the best solution for everything.