Designing Data Pipelines for Failure
Posted on 2026-01-02
When I first started writing code over 10 years ago, I couldn’t fathom why code would ever fail after you’ve developed it. Ever the optimist, I envisioned the standard development process to be writing code, validating that it worked against a set of tests, tweaking it until it worked, and then trusting that it could run forever as long as you had covered all the edge cases. Needless to say, this has only been an accurate for an inconceivably small portion of the day-to-day work for my entire career because I never write unit tests. In fact, most of my data engineering career has been dealing with job failure related to all sorts of problems ranging from network issues to hardware failures, and even the infamous programmer error (which has definitely never been my fault.)
Recently, I built a project using Python and Django to find interesting tourist attractions around the world called Nomadic Atlas. The core component of this project was building a behemoth 12-stage data pipeline. This data pipeline was responsible for pulling tourist attractions from over 2 million individual blog posts coming from 500 individual travel blog sites, validating the results, and organizing them into an easily digestible format, all while plunging me into madness in the process. Pretty efficient, if I do say so myself.
The pipeline took over six months to fully complete, partially because of the sheer amount of data points that it had, but also because of the different failure modes that I experienced throughout the process. Luckily, I had made some key design decisions that alleviated some of the stress of running this kind of data pipeline!
Failure and the Failures It Causes
Software failures that you encounter in the wild generally fall into one of two categories: other people’s responsibility and my responsibility… wait, no, I meant retryable and non-retryable failures, both of which need to be dealt with in order to avoid further problems downstream.
Non-retryable failures, much like other people’s problems, don’t require too much work to solve, though it’s important to identify them so that you don’t waste resources on something that you can’t solve.
For example, the earlier stages of my data pipeline fed the raw text of each blog post into an LLM and asked it to extract a list of tourist attractions that were mentioned in the blog post. However, because the LLM had a generously broad latitude in terms of what it considered a tourist attraction and what it considered to be in the article, my pipeline had to validate the LLM’s output against an unholy combination of mapping APIs. One of the mapping APIs returned a 404 response if the place didn’t exist. While you can argue that if you were to retry 404s long enough, someone might get around to adding an entry to the API’s database for my query, 404s generally fall into the category of non-retryable errors. These types of failures can be pretty insidious since they use up resources and bog down data pipelines for no reason at all. If you find these, make sure to swallow the job.
Retryable failures, on the other hand, require a bit of grace to solve since their solutions can often incur other problems.
Updating Your Idea of Updates
One such problem tends to occur if you update the input data of any stage of your pipeline, particularly if done in a way that is non-transactional. This is particularly a problem for retryable errors since it might cause subsequent retries to behave differently, but it also impacts your ability to audit and fix issues that occur in your code since you can no longer verify what the initial state of any run was.
As an example that could have occurred in Nomadic Atlas, take the stage of the pipeline that was responsible for consolidating database entries that referred to the same thing with different spellings. Remember that the data largely came from travel blogs, so not only was I dealing with legitimately different ways to refer to the same tourist attraction, I was often dealing with very illegitimate ways of it as well. For instance, Machu Picchu and the Sanctuary of Machu Picchu both appeared as entries in the database, as did Macchu Pichu and Macchupichu and Machupicchu. Clearly, the world is divided on the correct spelling of this great ancient wonder.
If I had overwritten the names of all of the entires to be uniform, I wouldn’t be able to trace back what the original names were.
The More the Less Merry
Prepare to have your socks knocked off when I tell you that retrying errors means that the same code is run multiple times. I know. Shocking. If you’re reading this article from the committee that grants the Turing Award, you can find my email on the home page of this website.
The issue with running code multiple times is that any writes, particularly non-transactional writes, will occur multiple times. If you rely on the output a stage that does multiple non-transactional writes, you may end up duplicating the result set from that stage. This has the obvious problem of increasing the complexity of the code in all downstream stages, since you now have to account for multiple entries that could be duplicates. For example, an important part of Nomadic Atlas was making sure that I was cooler than everyone else by going to off-the-beaten path destinations. In order to accomplish this, I counted the number of times each tourist attraction appeared in any given travel blog. Could you imagine the calamity of overcounting tourist attractions because duplicate entries existed in my database? Particularly lesser known tourist attractions?
Duplication would also mean that I would have run the data pipeline over several input entries that were just duplicates of other database entries. And since jobs could be retried a lot, this would mean that I would have been wasting a lot of resources running downstream stages over data that I had already processed. Recall that the data pipeline took about six months to complete, duplication could have easily increased this by several factors. I didn’t want to be cool in a matter of years… I want to be cool now!
The Design
Let’s think all the way back to five paragraphs ago when I mentioned the stage of the data pipeline responsible for consolidating database entries. While the process of consolidation is not the focus of this post, suffice it to say that it involved a dubious combination of search and mapping APIs as well as semantic comparison of article excerpts. Once I had a list of names that I was reasonably confident mapped to the same real life place, I had a decision to make. I could either update references from the selected entry, which shockingly in the previous example was Machu Picchu and not Macchupichu, and delete the others; or I could create a new model that would serve as input into future stages of the data pipeline.
It probably shouldn’t be a surprise that I chose the latter. In this example, I created a database model called an ExperienceSelection. If you’re wondering what that means, it’s time to level with you that tourist attractions in my database were called Experiences, not attractions. It was a highfalutin type of project like that.
The ExperienceSelection looked like this:
class ExperienceSelection(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
parent_experience = models.ForeignKey(Experience, on_delete=models.CASCADE)
experience = models.ForeignKey(Experience, on_delete=models.CASCADE)
class Meta:
constraints=[
models.UniqueConstraint(
name='experience_selection_unique_constraint',
fields=['experience'],
)
]
indexes = [
models.Index(fields=['parent_experience']),
]In code, this model was used to query the “true” database entry for each Experience. The unique constraint on experience meant that I wasn’t duplicating the result set since I could run a no-op update after a retry. It also helped speed up the lookup since a unique constraint in Django creates a database index in Postgres.
Likewise, an index on the parent experience sped up lookups the other way, which was particularly useful when I needed all related models that actually referenced the child experiences.
This mental model of an only-write-forward data pipeline enabled me to retry all stages of my data pipeline without worrying about having to do gratuitous cleanup after each retry. Hopefully it helps you in your endeavors as well!