Escaping the Notebook Trap: Taming 30- Year-Old Legacy Data with Config, Not Code

If you work in data engineering long enough, you eventually run into the “notebook trap.”

Written by

Jared Magrath

Published on

May 15, 2026

Copy Link

https://www.colibri.com/insights/taming-30--year-old-legacy-data-with-config-not-code

It usually happens when dealing with messy, legacy data sources. Every time the business needs to connect a new data source to the warehouse, a developer has to build a bespoke pipeline for it. You write the code, test it, deploy it, and maintain it. Eventually, you look up and realise your architecture is just a massive pile of near-identical notebooks that all do roughly the same thing, but in slightly different ways.

‍
We recently faced this exact scenario with a client in the pension fund administration space. We weren’t working with tidy, modern data feeds. We were dealing with two decades-old legacy systems spanning hundreds, if not thousands, of files. We had proprietary flat files with complex multivalue structures and 30-year-old COBOL files relying on packed decimal encoding.
‍

Traditionally, this would mean building a completely custom pipeline for every single file. That is a slow, expensive, and deeply frustrating way to scale an enterprise data platform.
‍

So, we decided to fundamentally flip the problem: instead of writing code per source, what if you write config? Adding a new file becomes a metadata exercise, not an engineering one.

‍

The Shift from Code to Configuration

‍

To build the backbone of our client's Bronze and Silver ingestion layers, we implemented DLT-Meta.
‍

Built on top of Databricks’ Lakeflow Declarative Pipelines, DLT-Meta is a metadata-driven ingestion framework. The core concept is beautifully simple: instead of writing separate code for every source, you define your source and target metadata once in a configuration file (a Dataflowspec). From there, a single generic pipeline handles the rest.
‍

For a project dealing with thousands of legacy files, this distinction matters a lot. DLT-Meta means onboarding a new file is largely a configuration task, not a development one.

‍
For data leaders, the commercial impact of this shift is undeniable. It directly impacts how quickly the business can get new data into the hands of analysts, and how much it costs to keep the whole thing running as the data landscape grows.

‍

The Reality Check: Navigating the Open-Source Edge

‍

Is this ready for the enterprise? In short: yes, it is genuinely ready, but you need to go in with your eyes open.

‍
DLT-Meta is a Databricks Labs project, which means it doesn’t come with the formal SLAs of a first-party product. If an internal team tries to deploy this alone and something breaks, they will find themselves relying on GitHub repositories for fixes.
‍

This is exactly why enterprise adoption requires an experienced engineering partner. You need a team that can not only implement the framework but provide the ongoing managed support to keep it robust. We are using it in a real production-bound environment with complex legacy sources, and with the right architectural guardrails, the framework holds up brilliantly. If you are already on Databricks, it is worth adopting now.
‍

But the biggest hurdle to getting this working in a messy corporate environment isn’t the software. It’s source system knowledge. DLT-Meta handles the pipeline mechanics brilliantly, but it cannot tell you what your data actually means. On this project, we're dealing with 30-year-old legacy systems where the routing logic for some files is buried in code that only one person fully understands. If you don't have that domain knowledge captured somewhere, the framework is waiting on you, not the other way around. The tech is the easy part.

‍

The Next Strategic Frontier: Autonomous Onboarding

‍

So, what does the next evolution of this architecture look like?

‍
Our source systems come with a FILE-DESCR file. This file describes the structure of every file at runtime, field names, multivalue groupings, the lot.
‍

The next step in this transformation isn't about heavy engineering; it is about pushing automation to its absolute limit. The goal is to build a fully metadata-driven ingestion layer that makes onboarding completely self-service. The framework reads the FILE-DESCR, derives the metadata it needs automatically, and you are done. No manual configuration step, no manual intervention.

‍
The underlying framework is already there. The final step is simply partnering with the right experts to fully leverage what your source systems are already telling you.

‍
If your team is spending more time maintaining one-off pipelines than actually delivering value from the data, it is time to change your approach.

Escaping the Notebook Trap: Taming 30- Year-Old Legacy Data with Config, Not Code

The Shift from Code to Configuration

The Reality Check: Navigating the Open-Source Edge

The Next Strategic Frontier: Autonomous Onboarding

Monthly newsletter