Avoid data coupling as well as code coupling
We talk a lot about reducing code coupling as a way to reduce the interdependency in the code so that changes to one piece of code can be made more safely. Implementations can be refactored, code can be modified, and new features can be added more easily if the code you are changing only has 5 interface points with other code, rather than 50.
Isn’t it the same with data? When you define a table in a relational database, you are defining what data parts an entity can have. And when you put data in that table, you are mixing that entity’s data with all the other entities that have data in that table. You are essentially coupling the data among all the entities in that table. Now you can’t separate the entities very easily and you can’t change the data structure or types for an entity without affecting all the other entities.
In a document-centric model, if each entity has it’s own document you can insert, delete and modify them with no effect on other entities. You can also create new types of entities, and even convert some old entities into the new type, without affecting other entities.
But even if you are using XML you still may gravitate to data coupling if the data gets sharded too much. I use a rule of thumb that an entity is something that has it’s own lifecycle. It can reference other entities (like a foreign key) but those are references to other entities, not sharded data parts of the same entity.
For example, a Person entity may have a “name” element which does not exist independently and does not have its own lifecycle. So “name” is an element of a Person. But that Person may be in a Company and that Company does exist independently from any Person and it has it’s own lifecycle, so Company should be a separate entity and each Person entity would have an id reference to the Company that it belongs to.
Now what happens if a Company entity gets deleted when a Person references it? Then you have an ID to nothing, but the Person still exists, still can be viewed, modified, and potentially reassigned, but the Person entity is still complete. And when a Person referencing some Company is deleted, the Company doesn’t know, doesn’t care.
I typically expect that the number of data models for entities in a MarkLogic web application should be able to be counted on one hand. And usually starting with what makes conceptual sense is the best approach: what “entities” do there seem to be that have their own lifecycles and how do they relate to each other?
This reduces data coupling so changes can be made easily and with low risk. Entities can be inserted and deleted without affecting other entities. It also tends to foster queries that are simple and “fully searchable” so that they utilize the indexes appropriately for maximum performance and minimal disk reads.
Thankfully some bloggers can still write. My thanks for this blog post…