Data systems are hard. We have a running joke on our team around how many times we’ve asked “what does one row of this data mean” (and how hard it has been to get answers). In my experience we rarely get this right; certainly not quickly. Complex systems, lots of data sources, heterogeneous formats (or kinds) of data, changes over time, updated business rules. All of these things combine to make data correctness more of a journey than a destination.
Still, there are some patterns that can make this easier. Do you have a clean, untouched record of the data as it was captured originally? You’d be surprised how often the answer is “no”. Do you have an explicit record of the rules (or code) that were used to transform this data - and can you find the right version? Can you rebuild the final data in an automated way? Can it be done quickly enough that you don’t dread doing so?
We are big believers in fungibility (people who know me can tell you how frequently I use this word). Systems should be rebuildable, copyable, disposable. To do this we capture the archive separately from the data lake and we build entirely new environments in a few hours. Any size data, any time you want. This allows for new environments to be staged, tested, and accepted. Entire data lakes and associated indices can be replaced in production seamlessly and new copies (or subsets) can be stood up for new projects and other organizations painlessly.
If you currently maintain systems fully or partially by hand you should give it a try. It’s easy. Contact us to get started.