Vaz · Information engineering

Prompt caching: the one-line change that cuts 90% of LLM cost in production

Thu, 11 Jun 2026 00:00:00 +0000

18 thousand tokens. That was the cost of every run of my news pipeline with 6 parallel sub-agents. After one line of code, it became 4,500. Same model. Same prompt. Same output. I just turned on the cache.

The feature has been in the Anthropic API for over a year. Most teams running LLMs in production still haven’t turned it on. I myself ran for months paying full price before actually reading my invoice. It’s the highest return per minute of work I know of today.

Why LLM cost in production is prefix

Every API call sends 4 things: system prompt, few-shots, context, and the question. In a real pipeline, the first 3 add up to 80 to 95 percent of the tokens, and they repeat on every call. The question changes. The rest is prefix.

Without cache, you pay for the entire prefix every time. In a pipeline running dozens or hundreds of times an hour, that becomes the bill. In a pipeline with parallel fan-out (several sub-agents sharing the same system prompt), it becomes the bill times the number of sub-agents.

With cache, you pay for the prefix once (cache write), then only the delta of each new call (cache read). A cache read costs about 10% of the normal input price.

How Anthropic’s cache works

You mark a block of the prompt with cache_control: ephemeral. Simplified example:

"system": [
  {
    "type": "text",
    "text": "<long, stable system prompt here>",
    "cache_control": {"type": "ephemeral"}
  }
]

Default TTL is 5 minutes. Next call inside that window: the cached prefix is read at 10% of the normal price. Anthropic also offers a 1-hour TTL as a paid option, useful for more spaced-out workflows.

The API returns 2 metrics you need to monitor:

cache_creation_input_tokens: you paid the write.
cache_read_input_tokens: you paid only the read (90% discount).

No model change, no prompt rewrite. Just flag what’s cacheable.

A real benchmark from my daily news pipeline

The number in the opening comes from a pipeline I built and maintain: my daily news skill, running every day at 8am. It fires 6 parallel sub-agents: data engineering, AI, investing, crypto, local politics, international politics. Each one carries a fixed system prompt of roughly 3 thousand tokens with tone rules, output format, prioritized sources, and synthesis style.

Without cache, the bill I was paying is direct math:

6 sub-agents × 3 thousand prefix tokens = 18 thousand tokens paid per run.
Times 1 run per day = 540 thousand tokens a month on prefix alone.

With cache:

1 initial cache write (3 thousand tokens) + 5 cache reads (with a delta of ~300 tokens each) = ~4,500 effective tokens.
Roughly a 75% cut in prefix cost, with zero quality loss and not a comma changed in the output.

In a more aggressive production pipeline (running dozens of times an hour with larger prefixes), the cut reaches 90%.

Where it shines, where it doesn’t

Shines:

Large, fixed system prompt (rules, format spec, examples).
Fan-out: several sub-agents with the same prefix in the same session.
Agents looping over the same context.
Chat with a large attached document and several consecutive questions.

Doesn’t shine:

One-shot calls with no repeated pattern.
Prompts that change significantly on every call.
Workflows with more than 5 minutes between calls (the cache expired).

Caveats that kill the gain if you don’t know them:

A cache write is slower than a normal call. You pay once in latency, you win on every call after. In a nightly pipeline that’s irrelevant. In an interactive chat, it matters.
Don’t cache PII or sensitive data without auditing first. Anthropic’s cache is per-account, but the principle stands.
The 5-minute TTL is a short window. If your job re-runs the pipeline every 10 minutes, the cache never hits. For those cases, use the 1-hour TTL.
You only see the gain if you monitor the 2 metrics. A timestamp at the top of the system prompt is enough for the prefix to never cache, and without watching cache_read you think you turned it on and you didn’t.

It’s not micro-optimization. It’s architecture.

Whoever is paying 100% of the price of every call because “there was no time to configure it” is accumulating debt with Anthropic every month. In a production pipeline with serious volume, that becomes thousands of dollars a year. For one line of code.

The rule I now follow in everything I build: structure the prompt in layers. Stable first (cacheable), volatile last. Mark the stable part with cache_control: ephemeral. Monitor cache_creation and cache_read. Pay once, read many.

It’s the ABC. And there are still teams calling this “advanced optimization”.

SQL is still the most important language in data engineering in 2026

Wed, 10 Jun 2026 00:00:00 +0000

There are devs onboarding into senior teams right now who have never written a GROUP BY in their lives. They learned the ORM before SQL. They think df.groupby() covers it. When a query hangs because the execution plan turned into a full scan over an 80-million-row table, they paste the error into ChatGPT, paste the answer back, and when it hangs again, they paste again. Infinite loop.

That dev is what Akita calls a coder, as opposed to an engineer. And AI is accelerating his extinction.

The coder outsourced the understanding

I learned SQL before any framework, because it was the only way to talk to the database. Today it’s the opposite. Framework before SQL. ORM before SQL. pandas before SQL. Layer upon layer of abstraction hiding the query that will actually run.

The problem with abstraction is not the abstraction. It’s that it hides the cost. You assume User.objects.filter().select_related().prefetch_related() is cheap. It isn’t. It’s a JOIN that can blow up memory if you don’t know why it’s a JOIN, across how many tables, with what cardinality. The ORM writes the right query in 70% of cases. The other 30% destroy your cluster.

In a real pipeline, the abstraction doesn’t fit

A modern data pipeline processes billions of rows a day. Every query decision costs minutes times cluster times DBU times day times month. The gap between a well-written query and one generated by an unprepared ORM is a 10x to 100x factor on the final bill.

A concrete case from a consulting engagement: an accounting close pipeline at a Brazilian fintech. The ORM was generating 47 subqueries for something native SQL solves in 1 CTE with a window function. Databricks/Snowflake bill: about USD 1,600 a month. After someone finally wrote the query in plain SQL: USD 160 a month. Same business result, 10x difference.

It wasn’t an isolated case. It’s the pattern. Wherever there’s a large pipeline generated through abstraction, there’s a 10x fat factor waiting for someone to read the execution plan.

AI generates bad SQL at scale

Every generative AI today produces fluent SQL. It compiles, runs, and returns the right number on the first try. The problem is not correctness. It’s efficiency.

Patterns I keep seeing in LLM-generated SQL that nobody reviewed:

SELECT * in stacked CTEs, dragging columns nobody will use through the whole pipeline.
WHERE column IN (SELECT ...) instead of a JOIN, in cases where the JOIN would be 100x faster.
WHERE UPPER(column) = 'X' on an indexed column, killing the index.
No partition hint in Spark or Snowflake, scanning the whole table when one day of data was needed.
Window functions with the wrong PARTITION BY, computing the wrong thing without throwing an error.

Of these five patterns, there isn’t one I haven’t seen in generated queries. If you don’t read execution plans, you don’t see any of this. It ships to production and you pay the interest at the end of the month. Technical debt with AI is not the same debt as 5 years ago. You take it on 10x faster, convinced you’re getting ahead.

The execution plan is where the difference lives

EXPLAIN ANALYZE in Postgres. EXPLAIN COST in Snowflake. The physical plan in the Spark UI. It’s the first thing I look at before letting a new query run at scale. They all tell you the same thing: how many rows the engine will scan, which joins it picked, where the shuffle is, where the broadcast is, where the queue is.

A coder looks at the plan and doesn’t understand it. An engineer reads it and knows whether it’s fit for production or needs a rewrite. It’s not memorization. It’s reading from cause to cost.

When you ask an LLM to generate SQL, also ask for the estimated plan, ask it to compare against an alternative version, ask it to discuss the partition vs broadcast trade-off. If you can’t evaluate the answer, you’re not doing engineering yet. You’re outsourcing the decision.

The decision comes before the next feature

SQL didn’t die. The people who pretended to know it did.

AI is professional darwinism. Whoever truly learns SQL becomes 10x more productive with it, because they can evaluate what it generates. Whoever outsources ORM plus AI accumulates debt that will break production in 18 months, and on that day there will be nobody left to debug it, because nobody reads execution plans anymore.

The choice happens before the next feature. Will you learn what’s actually running, or bet that AI covers your gap? It’s a bad bet.

YouTube rate-limits its caption endpoint. Audio stays free.

Thu, 04 Jun 2026 00:00:00 +0000

Hit HTTP 429 on 14 consecutive YouTube videos. I tried --sleep-subtitles 60, exponential backoff up to 45s, browser cookies, yt-dlp pre-release. Nothing helped. Every timedtext request came back 429.

Switched to the audio endpoint. Zero 429.

In one sentence: YouTube’s timedtext (captions) and googlevideo (audio/video) are different endpoints. Only the first is aggressively rate-limited in 2026. Downloading audio and transcribing locally is cheaper than insisting on captions.

The problem transcription pipelines ignore

The timedtext rate limit became common enough in 2026 that yt-dlp has 3 open issues (#7123, #13770, #13831) with no definitive fix. The official advice is caching and using the YouTube Data API with OAuth. Both work but shift the problem rather than solving it. Anyone who scheduled 50 URLs and saw half come back empty knows the symptom.

Why `googlevideo` doesn’t fall with it

The discovery that took me too long lives in the two distinct layers YouTube exposes. timedtext is an API layer: serves small XML/VTT under a global per-IP, per-day quota, with heavy caching and bot detection hardened in 2025. Every request counts. googlevideo is the CDN that serves audio and video via DASH segments from Google Global Cache edges, peering directly with your ISP. Its billing layer is aggregated bandwidth at the server serving your ISP, not per-request. The rate limit there only fires on clearly robotic patterns.

In practice I saw this: 60 requests in 5 minutes against timedtext results in guaranteed 429. The same 60 downloads on googlevideo with a natural interval go through with no warning. That detail isn’t documented in any obvious place. I figured it out when my cron broke and I opened Wireshark.

A pipeline that handles real batch loads

I packaged the logic in an open source Python CLI called yt-nota. Combines 3 tools.

Step	Tool	Cost	Where it fails
Metadata + caption URL	`yt-dlp` (Python API)	$0	Private video, region lock
Audio fallback	`yt-dlp` format 139 (m4a 49kbps)	$0	Members-only without cookie
Local transcription	`faster-whisper` int8 CPU	$0	Video > 1h on weak hardware

faster-whisper is 4x faster than openai-whisper on the same model, with the same accuracy (same weights). My CLI’s API looks like this:

result = extract_transcript(
    url,
    whisper_fallback=True,   # default on
    whisper_model="small",   # or tiny/base/medium
)

On 429, it drops to googlevideo, downloads only the audio, transcribes, and returns the same format. The caller doesn’t know if the transcript came from timedtext or Whisper.

CPU benchmark (Intel i7 12th gen, 16 GB, int8)

I ran the pipeline on real videos of varying length to measure wall-clock time. No GPU.

Video duration	`base` (74 MB)	`small` (244 MB)	`medium` (769 MB)
5 min	35 s	1 min 30 s	5 min
13 min	1 min 50 s	4 min	13 min
45 min	6 min	14 min	45 min

On accuracy for technical Portuguese, I did comparative reading over ~14 hours of lecture audio. The base model confuses 1 in every 6 technical terms (95% readable but needs human review). The small confuses 1 in every 20 (default for a reason: the downstream LLM corrects rare errors from context). The medium gets close to zero errors but doubles the time. For my flow (transcript → synthesis via Claude Code), small is the sweet spot.

What about SaaS with Whisper fallback?

They exist. Two main ones in 2026.

Solution	Price	When it makes sense
Supadata	From $0.001/min, free tier 1000 req/month	Company with SLA, doesn’t want to maintain infra
Apify YouTube Transcript Scraper	$0.40 per 1000 actor runs + compute	Pipeline already on Apify
yt-nota self-host	250 MB deps + 244 MB model	Privacy, academic batch, full control

The call is trivial for me: learning notes and Obsidian vault don’t go through third-party APIs. If it were a corporate pipeline with SLA and audit, Supadata wins on operational simplicity. Self-host only makes sense when you are the customer of the data.

Honest verdict

What works: batch of 50+ videos without crashing midway, zero recurring cost after the initial 500 MB, quality on technical Portuguese good enough for an LLM to digest downstream.

What it costs: first install is heavy (pip install yt-nota[whisper]), small model can confuse exotic terminology (for critical audio, bump to medium), and CPU becomes a bottleneck on videos longer than 1h.

When NOT to use it: volume of 10,000 hours per month with tight SLA (OpenAI’s Whisper API at $0.006/min ends up cheaper per engineer-hour than running local infra), or audio with music and multiple simultaneous voices (faster-whisper doesn’t do diarization, pyannote does).

Anti-patterns I saw along the way

Trusting --sleep-subtitles 60 as a silver bullet. I tested it: it doesn’t trigger before the request, it triggers after the first 429. Game over. Reaching for a paid API before trying the local pipeline is also a trap. $36k/year on transcription (the public faster-whisper benchmark) is money that should buy you a mid-range GPU. And deleting the raw audio after transcribing is the mistake of someone who never wanted to re-run with a better model 6 months later. I keep mine.

What this changes for you

If you use YouTube as a learning source, RAG input, or note-taking pipeline:

Does your current pipeline handle 50 URLs in a row without crashing?
Can you tell a 429 from timedtext apart from a 429 from googlevideo?
Do you have automatic fallback or do you handle each failure manually?
Does your monthly transcription bill still fit, or has it passed the cost of an amortized GPU?

If you said “no” to more than one, it’s worth an afternoon of refactoring.

Code Review of My Own Old Repo. Five Things I'd Change Today.

Tue, 02 Jun 2026 00:00:00 +0000

I opened a two-year-old repo of mine. It was still public on GitHub, I cited it in interviews, and I had never re-read the code since I submitted it. This weekend I sat down to re-read it.

I found five anti-patterns. In my own code, written by me. But the kind of problem I see show up in real production pipelines at large companies, not just in interview projects.

I decided to write about it because it’s more honest to critique my own code than to point fingers at someone else’s repo. And because if you have a public repo from two years ago that you still cite in your portfolio, you probably also have at least three of these five.

The database credentials were inside the function

def load_data_to_snowflake(df_merged):
    conn = snowflake.connector.connect(
        user='thaiscxxx',
        password='xxx*',
        account='xxx'
    )

I masked it with xxx before pushing, but the design pattern is the problem, not the string. Credentials inside the function mean each task that talks to Snowflake duplicates the connection, rotating the password requires touching code, and auditing means grepping the entire repo to figure out who connects where.

The honest version would use a Hook (SnowflakeHook) or environment variable, with the connection managed outside the code:

from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
hook = SnowflakeHook(snowflake_conn_id='snowflake_default')

Encrypted, traceable, and never shows up in a pull request.

The pipeline lost parallelism for free

t1 >> t2 >> t3 >> t4

t1 validated students.json. t2 validated missed_days.json. I chained them sequentially, but they’re independent. No reason for t2 to wait on t1. With a tiny file, it barely matters. When the JSON weighs gigabytes and validation takes minutes, parallelizing cuts the duration in half.

The correct version:

[t1, t2] >> t3 >> t4

Whoever reads the pipeline now understands validation runs in parallel and then joins. Whoever read the original would assume there’s some hidden dependency that doesn’t exist.

Input data was baked into the Docker image

In the Dockerfile:

COPY files/students.json /students.json
COPY files/missed_days.json /missed_days.json

I embedded the input data into the image. Every rebuild assumes the same data. To run the pipeline with a different JSON, I’d have to rebuild the image or change the code. Coupling between execution artifact and input data, in the same place.

The rule I’d preach to others but ignored in my own repo: images are immutable, data is mutable. Data comes in through a mounted volume, S3, GCS, or runtime parameter. Never inside the image.

The pipeline ran daily on static data

with DAG('migrate_student_data_to_snowflake',
         schedule_interval=timedelta(days=1),
         catchup=False) as dag:

I scheduled the pipeline to run every day. The input data is two static JSONs baked into the image (the anti-pattern above). Running daily means processing the exact same files, generating the exact same records, and trying to insert them all again into the same table. On the second run, write_pandas would duplicate the rows. On the third, duplicate again.

The data is static. The correct choice would be schedule_interval=None (manual or external trigger only) or a sensor that detects a new file in the bucket. Scheduling a pipeline without a mutable source is ceremony: it burns a worker slot every day, fires alerts when it breaks, pollutes the execution history. And when you actually need to run it with new data, the operation becomes indistinguishable from the background noise.

It was meant to run once. I scheduled it to run daily. Subtle, but the kind of choice that produces ceremonial DAGs in production: pipelines that exist without a reason to exist on that cadence.

The `fillna(0)` erased an important signal

df_merged['missed_days'].fillna(0, inplace=True)

When a student appears in students.json but not in missed_days.json, the join leaves missed_days null. I replaced it with zero. It seemed right at the time.

Zero absences carries business meaning: the student showed up every day. A missing record carries another meaning: the school didn’t report this student’s attendance. Conflating the two masks an upstream data quality issue. A dashboard filtering “students with zero absences” will surface as model students precisely the kids whose data never arrived.

The honest version leaves null and opens a new column marking whether the record exists:

df_merged['missed_data_source'] = df_merged['missed_days'].notna().map(
    {True: 'reported', False: 'not_reported'}
)

Small change, completely changes what the dashboard shows.

The discomfort of reviewing your own code

Rewriting these five snippets today would take an hour. The discomfort of publicly admitting they were wrong is bigger than the hour. But the repo stayed public with the defects, and I cite that repo in my portfolio. Keeping the repo intact and doing an honest review on top is more useful for someone learning than deleting the history and pretending I always wrote clean code.

If you have an old public repo still listed in your portfolio, open it this week. You’ll find at least three of these five.

Data Flows Ep01: the concept that comes before any tool

Sat, 30 May 2026 00:00:00 +0000

On August 1st, 2012, Knight Capital lost $440 million in 45 minutes.

Not an algorithm bug. Not a market crash. One server out of eight received the new deploy, while another kept an old flag reactivated (Power Peg, 2003 code). The two ran in parallel. The result was a cascade of automated orders nobody could stop.

The SEC documented the case (Release No. 70694, October 2013): the root cause was not a trading logic error. It was state inconsistency between servers that should have been in sync. In data engineering language, a broken data flow.

Knight Capital had sophisticated algorithms. Over a decade of operation. What it did not have was a clear mental model of where the data was born, where it traveled, and where it had to arrive consistently.

That mental model defines everything else. I have worked with data long enough to have seen, at smaller scales, variations of the same failure. Before Apache Spark, before dbt, before Snowflake, before any tool, there is a concept that separates a robust pipeline from a fragile one.

In one sentence

A data flow is the path data travels from source to destination, with every transformation in the middle. Getting that path right is an architectural decision. Getting it wrong is expensive.

Where this idea came from

It is not new. Bill Inmon published Building the Data Warehouse in 1992 defending top-down, normalized, enterprise-wide architecture. Ralph Kimball replied in 1996 with The Data Warehouse Toolkit: bottom-up, dimensional modeling, data marts composing the whole. The Inmon vs Kimball debate dominated the 90s and still shows up in any architecture review.

What changed between 1996 and 2026 was not the concept, it was the scale. In 2017, Martin Kleppmann published Designing Data-Intensive Applications and formalized in chapter 11 the distinction that organizes modern data engineering:

“A stream refers to data that is incrementally made available over time… in contrast to batch processing, where the input is a known, finite size.”

Bounded vs unbounded. A dataset with known size (batch) versus one that never ends (stream). Every data architecture decision starts here.

In 2021, the Lakehouse paper (Armbrust, Ghodsi, Xin, Zaharia, CIDR) proposed unifying warehouse and lake via a metadata layer (Delta, Iceberg, Hudi). In 2020, the dbt Labs team popularized ELT over ETL: transformation inside the warehouse, not before. Each wave changed the tooling, not the principle.

Bounded vs unbounded: the decision that defines everything

Every pipeline decision starts here. Practical summary in a table:

Type	Trait	When to use	Cost
Batch	Finite dataset, processed in a defined window	SLA in hours, accounting reports, historical snapshots	Simple to build, debug, recover
Streaming	Infinite dataset, event processed on arrival	SLA from seconds to a few minutes, real-time fraud, ops dashboards	Complex, requires watermarks, exactly-once, heavy observability
Micro-batch	Streaming in short windows (seconds to minutes)	Middle ground: minute-level dashboards, ML feature stores near real-time	Spark Structured Streaming, Flink mini-batches

Tyler Akidau and team (Google) published in VLDB 2015 The Dataflow Model paper that formalized the modern vocabulary: event time, processing time, watermarks, triggers, windowing. The central line:

“A practical approach to balancing the inherent tension between correctness, latency, and cost in massive-scale, unbounded, out-of-order data.”

Translation: streaming is right on three variables at the same time. You do not maximize the three, you pick two and pay for the third.

When batch, when streaming

The practical rule I use is simple: acceptable latency SLA defines the answer.

SLA above 1h leans to batch. Simple reprocessing, direct debugging, cheap infrastructure.
SLA below 1 minute demands streaming. Whoever tries to force batch in that scenario creates windows so short that it reinvents streaming with the worst of both worlds.
SLA between 1 minute and 1h is micro-batch territory. Spark Structured Streaming or Flink mini-batches solve it.

Jay Kreps, Confluent founder, wrote in 2014 the essay Questioning the Lambda Architecture attacking the model proposed by Nathan Marz, which kept two parallel layers (batch + speed). The line that stuck:

“The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems.”

Kreps proposed Kappa: unified log (Kafka) as source of truth, reprocessing via replay. Kappa became standard among teams running serious streaming.

The most common mistake I see is forcing streaming because “it sounds modern”. Streaming is not a better version of batch. It is a different contract, different cost, different mental model. When the decision is taken by trend instead of by SLA, the team spends months building complexity the problem never asked for, and I have walked into that trap more than once.

What goes wrong when the flow is ignored

Knight Capital was not an isolated accident. The pattern repeats at other scales.

GitHub, October 2018: 24-hour outage. Root cause documented by Jason Warner (official post-mortem): 43 seconds of network partition between US East data centers caused divergence in MySQL Orchestrator failover, replication storm and cross-DC inconsistency. Pure data flow failure at the replication layer.

Airbnb, before Minerva: different teams calculated “active user” with divergent queries on the same Spark cluster. Metrics collided in executive meetings. The fix was not another dashboard, it was a single metric definition layer with explicit lineage from source to destination. Minerva indexes over 200K data assets today.

These cases fit named patterns in the literature. Worth knowing each:

Pipeline jungle (Sculley et al, NeurIPS 2015, Hidden Technical Debt in Machine Learning Systems): “pipeline jungles often appear as data preparation evolves organically… testing such pipelines requires expensive end-to-end integration tests.” That is what happens when no one drew the flow at the start and it grew by accretion.
Data swamp (Nick Heudecker, Gartner 2014): “lakes turn into swamps when there is no metadata, governance, or quality control.” Lake became a folder of files dumped anywhere.
Schema drift: fields change without warning between runs, downstream contracts break silently.
Lineage gaps: nobody knows where the dashboard number came from.
Reverse-ETL chaos: data flows back from the warehouse to SaaS without governance, becomes a secret source of truth no one audits.

How the big ones document their own flow

Companies running real data in production publish the architecture. Worth reading.

Company	Doc	Anchor
Netflix	Maestro: Netflix’s Workflow Orchestrator (TechBlog, Jul 2024)	Orchestrates hundreds of thousands of workflows per day, WAP (Write-Audit-Publish) pattern over Iceberg
Uber	Uber’s Big Data Platform (Eng Blog, Oct 2018)	Hudi cut ingestion latency from 24h to under 1h on 100+ PB
Airbnb	Democratizing Data at Airbnb (May 2017)	Dataportal indexes 200K+ data assets with explicit lineage
Stripe	Online migrations at scale (Eng Blog, Feb 2017)	Dual-write + backfill + reconciliation to migrate financial data without loss
Slack	How We Built Slack’s Data Warehouse (Sep 2023)	Presto+Hive to Trino+Iceberg migration, 60K queries per day

Common pattern: each one documented the flow before building the next tool. The tool was born from the diagram, not the other way around.

Anti-patterns to avoid

Forcing streaming because it sounds modern. If the SLA is daily, batch solves it with 10% of the complexity.
Building a pipeline without drawing the flow first. Pipeline jungle is literally this: growing without a map.
Accepting the lake as “throw it all in, I will organize later”. Becomes a swamp in 6 months.
Ignoring schema contracts. Schema drift breaks downstream silently. Use Schema Registry or versioned SQL contracts.
Keeping two parallel implementations (Lambda). Maintenance cost doubles, behaviors diverge, no one trusts either.
Skipping lineage. Lineage is not a luxury. It is the only way to answer “where did this number come from” without opening 12 jobs.

Where to start

Can you draw, on a napkin, the data flow of your most critical pipeline? Exact source, main transformations, destinations, SLA per stage.

If yes, you are ahead of most. If not, start there. Before Spark, before dbt, before any new tool.

The next episodes of the Zero to Expert series will go into each layer in depth: ingestion (formats, idempotency, CDC), transformation (SQL vs Python vs Spark), destination (warehouse vs lake vs lakehouse), orchestration. Each episode with a concrete case and a decision at the center, not theory.

If there is a specific concept you want covered, send it to me on LinkedIn or subscribe to the newsletter to get the next episodes.

SLA, not trend: when batch, when streaming, when both

Sat, 30 May 2026 00:00:00 +0000

I watched a marketing team do what every team does once: adopt streaming because it sounded modern. Managed Kafka, 24x7 workers, exactly-once guarantees. To process events arriving every 10 minutes. Nightly batch would solve the same. It cost a tenth. It took six months until someone measured.

The pattern repeats. I have walked through the same decision in four different domains: finance pipelines, industrial processes, marketing, analytics. The discussion always starts wrong. “Let’s go streaming because it is more modern.” Or “let’s keep batch because it is what we always did.” Both miss the right question.

The right question is one: what is the real SLA of the consumer that will use this data?

The right question is not “which is more modern”

Martin Kleppmann formalizes in chapter 11 of Designing Data-Intensive Applications the distinction that organizes any data architecture in 2026. Bounded data (finite set, known size) versus unbounded (a stream that never ends). Every decision starts there.

But the bounded/unbounded distinction is technical, not behavioral. Real data is rarely just one thing. Application logs are unbounded by nature. If I aggregate them in 1-hour batches to feed a dashboard nobody looks at more than once an hour, the consumer is treating it as bounded. Data is what the consumption decides.

Tyler Akidau and team at Google published in 2015 the paper that became the industry standard, The Dataflow Model. The central line:

A practical approach to balancing the inherent tension between correctness, latency, and cost in massive-scale, unbounded, out-of-order data.

Translation: streaming is right on three variables at the same time. Correctness, latency and cost. You pick two, you pay for the third. Batch is simpler precisely because it does not try to optimize latency.

Decision table: SLA × technology

For most pipelines I see, the table above resolves the decision in 30 seconds. SLA above 1 hour is batch territory. SLA below 1 minute requires streaming. The middle is micro-batch, and most cases land there, not at the extremes.

When batch wins (even in 2026)

Spotify runs recommendations in nightly batch on BigQuery. Netflix has Maestro orchestrating hundreds of thousands of workflows per day with the Write-Audit-Publish pattern over Iceberg. Neither is “late”. They chose batch where batch solves better.

Batch wins when:

Consumer SLA is hourly or daily (accounting report, closing, historical snapshot, ML training)
Input data is stable enough that you can reprocess whenever you want
Your team has more ease debugging Python that runs once a night than a 24x7 stream processor

Cost matters a lot. A nightly batch Spark cluster stays off during the day. Infrastructure when no job is running: zero. Managed Kafka is always on. Confluent Cloud Standard starts at $1k to $3k per month, and egress can hit $47k per month at 300 MiB/s outbound. The difference over a year is the salary of a mid-level engineer in Curitiba.

When streaming is the only answer

Pix has an SLA under 10 seconds, 24x7. BACEN publishes this. Daily batch does not work. Not optional. Point-of-sale fraud detection is the same: either identify before the transaction closes or it serves nothing. Call center ops dashboard, same logic: the agent needs to see the customer updated the moment they answer.

These cases do not allow batch. Streaming is the only answer.

For them, Flink delivers latency under 100 milliseconds. Spark Structured Streaming sits at 100 milliseconds to 1 second (micro-batch). Kafka Streams runs embedded in the application, without its own cluster, and processes around 1 million events per second. Choosing between the three is another post.

Uber is the most interesting case. Adopted streaming without going 100% streaming. Added Hudi for incremental processing and brought ingestion latency from 24 hours to under 1 hour on more than 100 PB. Their Flink IngestionNext consumes 25% less compute than the old batch. Streaming done right also saves, as long as it solves the right problem.

When “both” is the right answer

Jay Kreps published in 2014 the essay that killed Lambda Architecture. Lambda keeps two parallel pipelines to produce the same result: one batch and reliable, one streaming and fast. The line that stuck:

The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems.

Kreps proposed Kappa: single log (Kafka) as source of truth, reprocessing via replay. Batch becomes a special case of streaming over the history.

Lakehouse was a step further. The Databricks 2021 paper proposes a metadata layer (Delta, Iceberg, Hudi) that serves both natures. The same data can be consumed in batch by the BI team and in streaming by the fraud application. No 2 stacks. One contract.

“Both” is not technical cowardice. It is conscious design when you have consumers with different SLAs over the same data.

Questions that decide the case

Before opening Terraform or docker-compose, answer this honestly:

What is the real SLA of the consumer that will read this data? Not the SLA you imagine. What they actually need.
Is this SLA different per consumer? If yes, consider Lakehouse with a single contract, not 2 parallel pipelines.
How much does it cost to run 1 month of streaming vs batch at this volume? Do the math before, not after the invoice.
Does your team have maturity to debug exactly-once, watermarks and distributed state? If not, the learning cost comes embedded in the project.
Do you already have batch or streaming infrastructure running? Reusing reduces risk. Greenfield lets you pick better.

If you answered honestly and still landed on streaming, great. Streaming makes sense. If you landed on batch, great too. Batch solves most cases.

The mistake is not picking streaming. The mistake is picking streaming without answering them.

Which pipeline did you pick wrong and had to redo later? Tell me on LinkedIn or reply to this email. I want to see how many cases match.

Airflow for 2 years: what I would do differently

Sun, 24 May 2026 00:00:00 +0000

It was 2 a.m. when the alert came. The monthly report DAG had failed on step 8 of 12. Financial data, 6 a.m. deadline, and I spent the next 4 hours trying to understand if the task really failed, if it was a silent timeout, or if the worker had died without telling anyone. When I found out it was the third, 40 minutes were left.

This scenario is routine in teams running Airflow in production. Airflow works. And it also creates work nobody warns you about in the first tutorial.

This post is not to convince anyone to drop Airflow. It is about what is worth changing before the problem shows up.

Context: what it is and who uses it

Airflow was created by Maxime Beauchemin at Airbnb in October 2014 to orchestrate data pipelines with complex dependencies. It went open source in June 2015 and became an Apache Foundation top-level project in January 2019.

It is today the most used data orchestrator in the world: 320 million downloads in 2024 alone, ten times more than the second place. Uber runs 200,000 pipelines with 750,000 task runs per day. Shopify has 10,000 active DAGs. Stripe processes 150,000 daily tasks.

Real adoption, not hype.

But the same report that shows those numbers also reveals that 46% of users say that when Airflow has a problem, the entire operation stops. That is the tension nobody tells you about in the first tutorial.

What Airflow solves well

Dependencies between tasks are guaranteed. You define the graph in Python. Airflow guarantees that task B only runs when task A finishes successfully. With 50 interdependent tasks in a finance pipeline, having that guaranteed by an orchestrator avoids rewriting retry and dependency logic in every DAG, and removes the whole category of “task ran before time because cron fired” bugs.

Retry with backoff is native. Two lines and your task retries automatically. In pipelines depending on unstable external APIs, this kills 2 a.m. alerts for transient errors.

The execution history is auditable. Every run, every task, every log gets recorded. When compliance asks “was the March report generated with data from 03/31 or 04/01”, you open Airflow and answer in seconds.

Backfill works. Pipeline down for three days? You reprocess the historical runs with one command. For pipelines that need complete and consistent history, that matters.

Where Airflow gets complicated

The scheduler parses your whole code every 30 seconds

The scheduler needs to run the Python code of each DAG file repeatedly to understand what exists and what the dependencies are. With 200 DAGs, that parse cycle can take minutes.

What makes it critical: 98% of scheduler slowness cases come from heavy imports at the module level. A file that does import pandas as pd at the top, outside any function, makes the scheduler run that import every cycle. With 200 DAGs and heavy imports, that becomes minutes of parsing before any task runs.

# Wrong: pandas is imported every scheduler cycle
import pandas as pd

@dag
def pipeline():
    ...

# Right: import only when the task runs
@task
def process():
    import pandas as pd
    ...

XCom has a hard limit nobody warns you about

XCom is Airflow’s mechanism for tasks to communicate. The problem: it was designed for small messages, not data.

In PostgreSQL, the default row limit is 8KB. A 1,000-row DataFrame will blow up XCom. In production, the error shows up as a timeout or silent crash of the metadata database, not as a clear “data too big” message.

The pattern used in production: pass only the S3 path via XCom, never the data itself.

catchup=True has already triggered unwanted backfills in many teams

By default in old versions, if you redeploy a DAG with start_date in the past and catchup=True, Airflow will create and try to execute every historical run since start_date. With a monthly DAG and start_date two years ago, that is 24 runs fired at once.

DoubleVerify documented that after migrating to a setup with catchup=False as the cluster default and other changes, incidents dropped 80%.

Renaming a DAG loses the whole history

There is no rename operation in Airflow. Renaming a DAG creates a new entry in the metadata database and loses the whole execution history. In production, that means you cannot compare current behavior to past behavior, and any alert that depends on history breaks.

Business logic inside the operator becomes a problem later

The temptation is to put transformations and business rules directly inside PythonOperator. Works in the beginning. After six months, you have untestable logic stuck inside infrastructure, the same rule duplicated across three different operators, and a DAG you can only debug by bringing up the whole Airflow.

The right pattern: the operator is infrastructure and calls testable functions that live outside the DAG.

What I would do differently

TaskFlow API from day one. Released in Airflow 2.0, it lets you write DAGs with Python decorators instead of instantiating operators manually. The code is cleaner, dependencies are implicit in the flow, and it is easier to test. I spent too long writing in the old style before migrating.

catchup=False as the cluster default from initial configuration. One line in airflow.cfg that avoids dozens of incidents.

Resource pools from the first DAG. By default Airflow does not limit how many tasks of a DAG run in parallel. A heavy DAG can consume all the slots and block the others. Configure pools before the first problem, not after.

No multi-tenant on a single instance. Sharing one Airflow instance between different teams creates Python dependency conflicts, lack of resource isolation, and upgrade paralysis: one team cannot update without coordinating with all the others. One instance per team is the recommended pattern.

Monitor the scheduler, not just the tasks. The scheduler is the heart of Airflow and can degrade silently. Grafana on the scheduler heartbeat catches problems before the tasks start failing.

About Airflow 3.0

In April 2025 Airflow released version 3.0, the biggest release in the project’s history. It solves problems the community documented for years: Task Execution API that removes the need for workers to access the metadata database directly, native DAG Versioning, rebuilt React UI, and support for tasks in languages beyond Python.

If you are starting a new project, evaluate Airflow 3.0 before picking the version to install. The changes are breaking, so migrating an existing cluster takes planning.

When to evaluate alternatives

Airflow has 320 million downloads for a reason: it works, has the biggest integration ecosystem in the market, and the community is vast.

But there are cases where other tools solve it better:

Prefect or Dagster for smaller teams that value simple local development, event-driven workflows, and richer observability without the operational overhead of Airflow.

dbt Cloud when most pipelines are SQL transformations in a warehouse. Native orchestration is simpler for that specific case.

Managed Airflow (Astronomer, Amazon MWAA, Google Cloud Composer) if the cost fits and you do not want to maintain the infrastructure. Removes a significant chunk of the operational pain.

What does not pay off is picking by popularity without evaluating whether the problem Airflow solves is your problem.

What stays

Airflow works well for what it was made for: orchestrating batch pipelines with complex dependencies, auditable history and reliable retries.

The problems I ran into were almost all avoidable with the right configuration from the start: imports outside functions, XCom for big data, catchup without control, business logic inside operators.

If you are starting: imports inside functions, catchup=False on the cluster, XCom only for coordination, business logic in separate testable modules. Four decisions that avoid most of the problems I ran into.

What was the most annoying problem you have seen with Airflow? Tell me on LinkedIn or subscribe to the newsletter.

Twenty AI concepts you need to understand in 2026

Sat, 23 May 2026 00:00:00 +0000

Every week a new AI term shows up. Agent, RAG, fine-tuning, embedding, top-p, RLHF. You open LinkedIn and three people are already “building autonomous agents” before breakfast. Over on Twitter someone complains their RAG hallucinates while the next post debates whether it’s worth fine-tuning Llama 3. Then you head to the API docs you were going to test for something simple and walk into a hundred-word glossary before the first useful call.

The problem isn’t the number of terms. It’s that nobody stops to draw how they connect.

This infographic is my attempt at a map. Twenty concepts, six sections, a sequence that makes sense if you go from the base to the frontier. It’s nowhere near everything that exists in AI. But you can open it in a technical meeting in 2026 and understand what people are talking about, or read the code of an agentic system and identify what each piece is doing in the flow.

How AI works (1 to 4)

It all starts with neural networks. Layers of neurons connected by weights, adjusted during training to make predictions. That’s the only primitive of all this. Models that see images, models that write text, models that understand audio: all variations of the same thing, with different architectural choices on top.

For language to enter that network, it needs to become a number. That’s what tokenization does: break text into chunks the model can chew on. AI doesn’t read words. It reads tokens. Then each token becomes a vector in a space of hundreds of dimensions, and that’s an embedding. Similar meanings sit close together. It’s what makes semantic search, recommendation, and RAG work.

On top of those three comes attention. The mechanism that lets each word look at every other word in the input and decide what matters to it. Before attention, models read text in sequence and forgot the beginning by the middle of the sentence. Attention broke that bottleneck. Without it, the rest of contemporary AI simply wouldn’t exist in the form we know today.

The magic behind it (5 to 8)

Transformers are the architecture that packaged attention into something trainable in parallel. Before them, language models were slow and short. After them, they became GPT, Claude, Gemini.

But architecture without data is nothing. Pre-training is the phase where the model reads the equivalent of the Library of Alexandria. Trillions of tokens. This is where it absorbs syntax, grammar, facts about the world, and the patterns of reasoning that humans left in writing. Fine-tuning is what comes next: take that general model and specialize it on specific tasks with specific data. And RLHF is the stage that took models that could answer anything and taught them to answer in a way that’s actually useful to someone. Real people compare outputs, say which one is better, the model learns the preference. It’s what separates “a model that knows a lot” from “a model that converses well.”

Beyond the models (9 to 12)

No model goes to production on its own. Around it sits a layer of safeguards: filters and classifiers built on explicit rules, to keep the system from saying something that hurts someone or reproduces an obvious bias. That’s the boring part nobody wants to build and that every serious product needs to have.

And when the model needs to know something that wasn’t in pre-training, in comes RAG. Retrieval-Augmented Generation. The system fetches relevant documents, injects them into the context, and the model answers grounded in them. RAG depends on two close relatives: vector databases (which store embeddings in a way that lets you find the nearest match in milliseconds) and chunking (which breaks large documents into indexable pieces). RAG without good chunking is RAG that hallucinates elegantly.

How AI generates output (13 to 14)

When the model answers, it doesn’t write the whole sentence at once. It predicts one token, then the next, then the next. That’s decoding. And how it picks each next token completely changes the character of the output. High temperature gives creativity and variation. Low top-p sharpens focus on the most likely tokens. Tuning these two parameters is the difference between a model that writes poetry and a model that writes technical documentation.

How AI acts (15 to 16)

Up to here the model only responds. Agents are the next step: it decides and acts. Receives an objective, breaks it into steps, picks which tool to use, executes, observes the result, adjusts the next step. Tools and functions are the hands we give to that agent: API, calculator, search, code execution, database access. Without them, the agent gets stuck in its own head talking to itself. The part that actually matters about agentic systems starts when the model can finally call something that changes state in the real world.

Improvement and evaluation (17 to 20)

Agentic systems without explicit planning turn into chaos fast. Without rigorous evaluation, any claim about the model just became cheerleading. Iterative improvement is what separates a pretty prototype from a system that survives in production: test, measure, adjust, repeat. And bias and fairness has an inconvenient property: if you ignore it at design time, it will find you in the incident.

Closing

AI isn’t magic. It’s math with data on top, logic around it, and iteration at the center. People who understand these twenty concepts read agentic system architecture without getting lost in the glossary. They can debug weird model behavior from real hypotheses instead of guesses. And in a technical conversation, they speak like someone who took part in the build, not like someone who read the release.

Take the infographic. Save it on your phone, print it and put it on the wall, drop it in Notion. Come back to it every time a term that feels new shows up. And more important than any of that: build something with it. You only discover what each of these words really means when you try to make a RAG actually work.

This is the first post in the AI Foundations track at VazDEng. Three posts a week on data engineering in Portuguese (and English), at the senior level Brazil was missing.

When the model should say 'I don't know'

Sun, 17 May 2026 00:00:00 +0000

In September 1998, Long-Term Capital Management lost $4.6 billion in a few weeks. The spread models had been trained on normal-times correlations. The Russian default and the subsequent flight-to-quality made correlations historically around 0.3 converge to 1 within days. In When Genius Failed, Lowenstein cites the fund’s internal calculation of the probability of what happened:

“An event so freakish as to be unlikely to occur even once over the entire life of the universe.”

The models were technically correct. They were just extrapolating confidence into a region of the space they had never seen. They had no “I don’t know” button.

My quant agent had the same problem, at incomparably smaller scale but with the same nature. I solved it this week.

In one sentence

Conservative degradation is the principle that says a model must have the right to abstain. When data is outside what it has seen, returning “I don’t know” is more useful than returning a spurious classification with mathematically high confidence.

The blind spot left after the data leakage fix

The previous post closed the chapter on Sharpe -1.14. The posterior became causal, the data leakage went away, the number became honest. But there was a blind spot the Sharpe didn’t show.

The 3-state Gaussian HMM always classifies. It receives a candle, computes the posterior over BULL/SIDEWAYS/BEAR, and returns the one with highest probability. By construction. If the features are in the normal training zone, fine. If they’re completely outside, it keeps classifying, and the posterior keeps summing to 1.

Concrete scenario: daily ATR 4x above the 90-day average, funding rate in historical extreme negative, volume 10x above normal. A spike the model simply has no reference point for. The HMM returns something like “BULL with 73% confidence”, because one of the three classes has to win.

Mathematically legitimate. Operationally dangerous.

What the literature calls this

I looked at the literature before implementing anything. Three threads converge.

Out-of-Distribution detection (computer vision, classical ML). The lineage starts with Hendrycks & Gimpel 2017 (“A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks”), showing that maximum softmax probability is already a reasonable confidence signal. Liang et al 2018 (ODIN) adds temperature scaling and adversarial perturbation, reducing false positive rate from 34.7% to 4.3%. Lee et al 2018 proposes Mahalanobis distance in feature space to capture covariance between dimensions. The three are the OOD canon.

Selective classification (statistics, pattern recognition). Chow 1957 already formalized the reject option in IRE Trans. Electronic Computers. In 1970 he derived the optimal error-reject curve. In 2017, Geifman and El-Yaniv brought the concept to deep learning with formal risk guarantee:

“We can achieve a target coverage with a guaranteed level of risk.”

The canonical metric for evaluating abstention is AURC (Area Under Risk-Coverage curve): shows how error falls as the model is allowed to reject more cases.

Critical systems with conservative degradation. Aviation has explicit regulation (FAA AC 25.1329-1B): autopilot must alert when envelope protection is invoked and disengage in off-nominal conditions. SAE J3016 (autonomous driving) defines Operational Design Domain (ODD) and requires the system to exit operation or request takeover when operating outside it. The principle is the same: a model trained for conditions X does not operate in Y, it alerts and returns control.

Trading benefits from this vocabulary. It was what was missing.

Someone has done this in finance

Two precedents to anchor on.

Kritzman and Li 2010 (“Skulls, Financial Turbulence, and Risk Management”, Financial Analysts Journal). They define the Turbulence Index as the multivariate Mahalanobis distance of returns against historical mean and covariance. Central quote:

“The more asset returns, volatilities and correlations differ from their historical norms, the more likely it is that these differences result from a significant market event rather than from random noise.”

Empirically the index aligns with 1987, the 1998 Russian default, 9/11, and the 2008 crisis. Turbulence is persistent, which justifies abstaining by windows, not by isolated tick.

Chalkidis et al 2021 (“Trading via Selective Classification”, ACM ICAIF, arXiv 2110.14914). This paper is the direct case of what I did. A binary up/down classifier becomes a strategy that only takes a position when it’s confident, and abstains when it’s not. Empirical result: smaller coverage with same risk improves Sharpe. The abstract’s quote:

“Selective classifiers give rise to trading strategies that do not take a trading position when the classifier abstains.”

Selective classification in trading is not my insight. It’s a documented topic at ACM. What was missing was bringing it to my HMM.

How I implemented it

The HMM features pass through StandardScaler before training. In the scaled space, each feature’s mean is zero and standard deviation is one. Any new candle with one feature at very high absolute z-score is, by definition, outside the distribution the model has seen.

Threshold at 5 sigmas (conservative, crypto has fat tails). Static method on MarketRegimeHMM:

@staticmethod
def is_ood(x_scaled_row, threshold=OOD_SIGMA_THRESHOLD):
    if x_scaled_row.size == 0:
        return False
    return bool(np.nanmax(np.abs(x_scaled_row)) > threshold)

And predict_state checks before calling the posterior:

if self.is_ood(last_features):
    logger.warning("OOD detected: max |z| = %.2f > %.1f. Abstaining.",
                   max_dev, OOD_SIGMA_THRESHOLD)
    return REGIME_OOD, 0.0, {REGIME_OOD: 1.0}

The downstream decision (decide_position in layer 4) already had a lookup in REGIME_MULTIPLIER. I added "OOD": 0.0 as defense in depth, plus an explicit “ABSTAIN” log to make it visible whenever the system chose not to operate.

70 tests passed, plus 2 new ones covering the OOD path. Full suite in 6 seconds.

Scenario	Before	After
Features inside distribution	Classifies BULL/SIDEWAYS/BEAR with real posterior	Same
Features 5+ sigmas outside (rare)	Classifies anyway, with spurious posterior	Returns OOD, sizing zeros
Log of the OOD tick	No distinction	“ABSTAIN: regime without playbook (OOD, conf=0.000)”
Trade opened in anomalous condition	Possible, with 2% cap	Impossible

Why 5 sigmas, not 3

Threshold choice is where theory meets real crypto data. In perfectly Gaussian features, 3 sigmas would cover 99.73% and be reasonable. Crypto is not Gaussian. Realized volatility, funding rate, and DI spread have heavy tails. Bulla 2011 (Quantitative Finance) already showed that Gaussian HMM underestimates tails in financial returns, proposing Student-t instead.

At 5 sigmas, the detector fires only when the tick is in genuinely unprecedented region. At 3, it would fire on big but historical moves, generating excessive abstention. The next iteration is to swap univariate z-score for multivariate Mahalanobis (captures correlation between features), which is exactly what Kritzman-Li did in 2010 for returns.

What changed in my sleep

The most useful number for me isn’t the increase or decrease in Sharpe (I’ll measure in backtest next week). It’s this:

Before, when the agent took a position overnight and I woke up with Telegram blinking, I needed to open the auditor and read decision by decision to understand if the model had any logic at that moment or if it was guessing in chaotic market.

Now, if the system abstains, the log says ABSTAIN. If it operates, it’s because it was in territory it has seen. The question “does this decision have a basis?” became binary: there’s an ABSTAIN log before it, or there isn’t.

Nick Leeson, Jérôme Kerviel, LTCM, Knight Capital. The history of operational losses in finance almost always has the same pattern: a system continuing to make decisions when it shouldn’t. The cost of “I don’t know” has always been cheaper than the cost of “I thought it was”.

Anti-patterns to avoid

Accepting high posterior as evidence of good decision. An HMM’s posterior always sums to 1. Confidence is intra-model metric, not evidence that the model understands what it’s seeing.
Using OOD threshold based on intuition, not on distribution. 3 sigmas works in pure Gaussian. Crypto is not Gaussian. Measure the real tail of your data first.
Abstaining on isolated tick and going back to operating on the next. Turbulence is persistent. Good design abstains by window, not by candle.
Adding OOD without touching the decider. A detector that doesn’t change downstream behavior is decoration. REGIME_MULTIPLIER is where the effect happens.
Hiding the abstention from the log. If the system preferred not to operate, that’s a decision. It must appear in the audit trail with reason, not silently.

The next chapter

The current version uses a single criterion (absolute z-score per feature). Two extensions are already in the backlog: Mahalanobis distance in the full space (captures covariance, which is what Kritzman-Li implemented for returns in 2010) and tick likelihood under the trained HMM (more sensitive, more expensive).

For now, what’s in production is the simple version. And it has already changed what I look at when I wake up.

Have you ever had a model return high confidence on a decision that shouldn’t have been made? Tell me on LinkedIn or subscribe to the newsletter to receive the next posts.

Instrumenting lineage from scratch with Unity Catalog

Wed, 13 May 2026 00:00:00 +0000

When someone asks me “where does this number come from?”, I have two possible answers.

The first is to open the code, manually trace which job read from which table, work out which transformations were applied, and walk back to the source. In pipelines with 20 steps, that can take hours.

The second is to open Unity Catalog, click on the column in question, and see the full graph: source, transformations, intermediate tables, destination. In seconds.

That difference is what lineage solves in practice. But Unity Catalog doesn’t capture everything automatically. Understanding what it covers and what needs extra work is what separates a real implementation from one that gives a false sense of security.

What Unity Catalog captures automatically

Unity Catalog intercepts Spark execution plans at runtime and registers every read and write on metastore tables. No extra code configuration required.

Table lineage works for any SELECT, CREATE TABLE AS SELECT, INSERT INTO SELECT operation in any language: Python, SQL, Scala, R. For each operation, the system records which table was read, which was written, in which job, in which notebook, by which user, at what time.

Column lineage goes further: it maps which source columns feed which destination columns. Requires Databricks Runtime 11.3 LTS or higher for regular jobs. For Delta Live Tables, requires 13.3 LTS or higher.

This information is accessible two ways: via Catalog Explorer with a visual interface, and via the system tables system.access.table_lineage and system.access.column_lineage for those who need it programmatically.

What isn’t captured and where most people get it wrong

The official docs are clear but discreet about the limitations. I’ve seen these limitations bite in production more than once.

UPDATE, DELETE, and INSERT VALUES don’t generate lineage edges. This is the most critical limitation for anyone working with CDC, SCD Type 2, or any pipeline with in-place updates. The data was modified, but Unity Catalog doesn’t record that relationship.

MERGE INTO doesn’t capture lineage by default. It can be enabled with spark.databricks.dataLineage.mergeIntoV2Enabled, but it requires explicit configuration on each cluster or job.

RDDs aren’t supported. The Unity Catalog API doesn’t work with RDDs, so any pipeline using Spark’s low-level API stays completely outside tracking.

Renamed objects lose history permanently. If you rename a table, schema, or catalog, historical lineage breaks. There’s no automatic migration of the graph when the object changes name.

JDBC connections bypass entirely. Data read or written via JDBC doesn’t pass through Unity Catalog’s capture mechanism.

Path-referenced tables (s3://…) don’t capture column lineage. Table lineage via path works, but column mapping doesn’t.

And a practical detail: system tables only have data starting September 2024. If you need lineage history before that date, it doesn’t exist in the system tables.

Multi-hop lineage: what Catalog Explorer doesn’t show

The Catalog Explorer visualizer shows only one hop in each direction: one upstream table and one immediate downstream table. If the data went through five transformations, you only see the adjacent one.

To trace the full chain, the approach is iterative queries on the system tables:

-- Find all ancestors of a table (multi-hop)
WITH RECURSIVE lineage AS (
  SELECT source_table_name, target_table_name, 1 as hop
  FROM system.access.table_lineage
  WHERE target_table_name = 'my_gold_table'

  UNION ALL

  SELECT l.source_table_name, tl.target_table_name, lineage.hop + 1
  FROM system.access.table_lineage tl
  JOIN lineage l ON tl.target_table_name = l.source_table_name
)
SELECT * FROM lineage ORDER BY hop;

Databricks doesn’t support native recursive CTE on system tables. In practice, this needs iterative logic in Python that queries level by level.

OpenLineage as a complement

For pipelines that leave the Databricks ecosystem (Airflow orchestrating external jobs, dbt running on a different warehouse, Python scripts with pandas), OpenLineage is the most used alternative to unify cross-platform lineage.

OpenLineage integrates via OpenLineageSparkListener and captures lineage from S3, GCS, JDBC, Redshift, and BigQuery. The integration exists, but has documented bugs with Databricks Spark 3.4+: generated payloads sometimes contain only inputs without outputs, and there are incompatibilities between the OpenLineage Spark 3.3 agent and Databricks’ 3.4.1 implementation.

If OpenLineage is critical to your setup, verify version compatibility before going to production.

What to instrument manually

To have complete lineage in real pipelines, these are the gaps that need extra work:

BI tools (Tableau, Power BI, Looker) need an explicit connector or manual registration via the External Lineage API, which is in Public Preview. The limit is 10,000 external objects and 100,000 relationships per metastore.

External orchestrators (Airflow, Prefect) need integration via API so jobs appear in the lineage graph.

Pipelines with extensive UPDATE/DELETE need complementary logging via system.query.history for auditing, since automatic lineage doesn’t cover those operations.

Where to start from scratch

If you’re instrumenting lineage for the first time in a Databricks environment:

First, confirm that clusters and jobs are in workspaces with Unity Catalog enabled. Without it, no automatic capture works.

Second, validate Databricks Runtime: 11.3 LTS or higher for column lineage in regular jobs. Older projects running on runtimes below that won’t have column lineage even with Unity Catalog active.

Third, map which pipelines extensively use UPDATE/DELETE/MERGE. For those, define from the start what the complementary auditing strategy will be, whether via system.query.history or via explicit logging in code.

Fourth, build a validation query that runs weekly against the system tables and checks whether critical tables have lineage registered. Missing lineage on an important table is a sign that something fell outside capture scope.

Lineage isn’t a feature you turn on and forget. I use it as a continuous practice: for every new pipeline, I validate what Unity Catalog captured and what fell outside.

What part of lineage gives you the most trouble today? Tell me on LinkedIn or subscribe to the newsletter.

When Medallion Architecture gets in the way more than it helps

Tue, 12 May 2026 00:00:00 +0000

There’s an architecture pattern I’ve watched grow since 2020, created by Databricks, adopted by Microsoft as the official standard for the Fabric platform in 2023, and that today shows up in almost every conversation about data engineering: Medallion Architecture.

Bronze, Silver, Gold. Raw data, clean data, aggregated data.

The problem isn’t the pattern. The problem is that it became the automatic answer. And when any architecture becomes the automatic answer, it starts creating more problems than it solves.

Databricks itself is clear in the official docs: “Following the medallion architecture is a recommended best practice but not a requirement.”

That rarely shows up in the presentations.

What Medallion Architecture actually is

Databricks defines it like this: a design pattern that organizes data in a lakehouse into layers that progressively improve the structure and quality of the data, from Bronze to Silver to Gold.

Bronze stores data exactly as it came from the source, with no transformation. It’s the immutable historical archive. If something goes wrong in later layers, you come back here.

Silver applies the minimum transformation needed to create a consistent enterprise view: cleansing, standardization, deduplication, joins across sources. It’s where data becomes trusted information.

Gold organizes data for specific consumption: analytics dashboards, ML models, financial reports. Denormalized, optimized for reads, designed for the end user.

Worth a historical note: the layered pipeline concept isn’t new. Data warehousing in the 1990s already used staging, cleansed, and presentation layers. What Databricks created in 2020 was the Bronze/Silver/Gold terminology and the “Medallion” branding, not the principle itself. That doesn’t make the pattern invalid, it just helps separate innovation from naming.

When Medallion works well

The pattern solves three real problems, and solves them well.

First: reprocessing without loss. When a bug shows up in a Silver transformation, you go back to Bronze and reprocess without having to fetch the data from the source again. In systems where the source only keeps the last 90 days of history, that protection can be the difference between fixing a problem and losing two years of data.

Second: multiple teams with different needs. The analytics team needs monthly totals. The data science team needs the data at the finest grain for model training. Both share Silver, each builds its own Gold layer independently. No duplicated cleansing work, no inconsistency across views.

Third: separation of responsibility in large teams. The ingestion team owns Bronze without needing to know business rules. The transformation team owns Silver without depending on the ingestion team. In organizations with more than 20 data professionals working in parallel, this reduces coupling and blockers.

When these three problems exist, Medallion is a solid choice. When they don’t, you’re adding complexity without a return.

Where Medallion starts to get in the way

When there’s a single consumer

You have a pipeline that ingests payroll data to feed a single HR dashboard. One team consuming, one purpose, one transformation.

Applying Medallion here means creating Bronze, Silver, and Gold to serve exactly the same thing. The data goes through three layers of reads and writes, three sets of jobs to monitor, and three times the latency. For zero gain.

The practical signal: if Gold is identical to Silver plus one grouping, you don’t need three layers. A single direct transformation from source to consumed table does the same work with half the infrastructure.

A case documented by a data architect: a customer had 4.2 billion rows in Bronze accumulated over six years of data, but Silver only consumed the last 90 days. 97% of stored data was never used. The storage cost was real, the benefit wasn’t.

When latency matters more than quality

Each transition Bronze to Silver, Silver to Gold, is a separate job. In Spark pipelines, that’s usually 20 to 40 minutes per layer. Three layers in sequence and total latency tops one hour before data reaches anywhere.

Analyses with real practitioner data show overhead of 53% or more in simple cases: 23 minutes with Medallion versus 15 minutes with direct transformation, for the same result.

When the business needs data in 30 minutes to make a decision, an architecture with 80 minutes of latency isn’t a code problem. It’s an architecture problem.

For data that needs to arrive in real time or near it, Databricks is explicit: it recommends micro-batch (latency in seconds to a few minutes) for Medallion, and explicitly advises that when ingestion comes from a message broker like Kafka, reading directly without an intermediate stage reduces complexity and latency. For sub-second, the documentation itself flags limitations in real-time mode that negatively affect throughput.

When it’s a prototype or short-lived analysis

A quick data exploration. A model that will exist for three months. A one-off analysis that will turn into a number on a slide and never be consumed again.

Forcing Medallion onto a prototype creates tables that will never be maintained, jobs nobody will monitor, and structure that will be abandoned in two weeks. The team spends time and energy organizing what was supposed to be disposable.

A prototype needs to be quick to build and easy to throw away. Three layers make both harder.

When the team is small and the data is simple

A startup with 3 data engineers processing 500 GB doesn’t have the same problems as a bank with 50 engineers and 50 TB. The operational overhead of maintaining Bronze, Silver, and Gold, with all the tables, jobs, documentation, and monitoring that requires, can be unjustifiable when the real benefit is small.

For small teams with one or two use cases, two layers (raw data and consumable data) or a solution with dbt directly on the source solve the problem without the extra complexity.

The anti-pattern nobody talks about

I’ve seen one specific problem appear more than any other when Medallion doesn’t work well: Bronze gets exposed as a data product.

Elliott Cordo, a data engineer with published work on data architecture, documents this as a direct anti-pattern: exposing the Bronze layer to consumers creates strong coupling between those using the data and the internal details of how it’s stored. When the source changes, every consumer breaks together.

The second documented problem: when Silver is Bronze with a renamed field, and Gold is Silver with a GROUP BY, the intermediate layers add no real value. Analysts end up writing complex SQL in Gold or building parallel spreadsheets to compensate. Multiple teams implement the same metric in different ways, and the numbers start to diverge.

In those cases, the pattern isn’t being applied, it’s being imitated.

The right question before deciding

Three questions define whether Medallion is the right architecture:

Are there multiple consumers with different needs? If yes, a shared layer between them makes sense. If not, you’re creating separation without benefit.

Is reprocessing data from the source expensive or impossible? If yes, immutable Bronze is real protection. If you can reprocess without cost or history loss, the benefit shrinks.

Does the latency of each layer fit the deadline the business demands? If yes, Medallion works. If not, you need a different architecture for that use case.

Three “yes”: Medallion is a solid choice. Two or fewer: worth questioning how many layers you actually need.

What large companies actually use

An important detail that rarely shows up in the discussions: Netflix and Uber, two of the most referenced companies in data engineering, don’t use Bronze/Silver/Gold terminology.

Netflix uses the WAP pattern (Write-Audit-Publish) with Apache Iceberg: data is written to a hidden snapshot, audited automatically, published if approved. The problem solved is the same (quality before exposure), but the implementation is different and doesn’t use Medallion’s three layers.

Uber uses a transactional data lake with Apache Hudi, with raw, derived, and aggregated tables. The migration from full batch to incremental ETL cut pipeline time by 82% and cost by 78%, according to the Uber Engineering Blog in March 2023. But those numbers are from incremental ETL, not from the layered pattern itself.

Microsoft adopted Medallion as Fabric’s official architecture in 2023 and is today the largest public case of institutional adoption. Even so, Microsoft’s own documentation guides: before building complex pipelines between layers, evaluate Materialized Lake Views, which manage transformations automatically without operational overhead.

What stays

Medallion Architecture is a good pattern for the right problems: large teams, multiple consumers, critical data that needs protected history and progressive quality.

It isn’t required. It isn’t universal. And when applied where it doesn’t fit, the cost is real: unnecessary latency, wasted storage, operational complexity without benefit.

Architecture choices should start from the problem, not from the pattern. What does this pipeline need to solve? Who will consume it? What’s the acceptable deadline? Is reprocessing from the source expensive?

If the answers point to Medallion, great. If they don’t, a simpler architecture will work better.

Have you ever implemented Medallion somewhere it didn’t belong? What happened next? Tell me on LinkedIn or subscribe to the newsletter for the next posts.

LGPD and ML models: what to do with data that has already become model weights

Sat, 02 May 2026 00:00:00 +0000

A data subject requested deletion. You deleted the row from the database. And the model?

The weights of an ML model trained on personal data hold, in a non-explicit form, the contribution of every training record. Deleting the original data doesn’t erase that influence. Membership inference research can determine, with some probability, whether a specific CPF was part of a model’s training set. That qualifies as personal data under LGPD.

I’ve seen most teams without a process for this scenario. Not for lack of intent: nobody set up the flow before training the first model.

What Article 18 actually requires

Article 18, IV of LGPD grants the data subject the right to request anonymization, blocking, or erasure of data that is “unnecessary, excessive, or processed outside compliance.”

The interpretation ANPD has been signaling in its public consultations on AI is that ML models are processors of personal data when the training data was personal at the time of processing. The production model inherits that classification.

If a data subject requested deletion and you can demonstrate that their data was used in training, the right to erasure applies to the model too. Not just to the dataset.

The law doesn’t specify how to execute that erasure. It specifies the expected result: the data subject should no longer have influence over the model’s decisions. How you get there is your technical problem.

The real technical problem

Three scenarios with different difficulty levels, that I’ve seen in practice.

Genuinely anonymized data before training: if you applied real anonymization, not pseudonymization, before any ML processing, you’re outside LGPD’s scope for that data. Article 12 is clear: anonymized data isn’t personal data. But anonymization needs to be irreversible. K-anonymity with k=3 on financial transactions isn’t real anonymization.

Pseudonymized data in training: you replaced the CPF with a token but kept the mapping. The data remains personal. The model was trained with that data and is now in production. A deletion request activates the full problem.

Raw data in training, no treatment: the most common scenario in older models, trained before any regulatory concern. Also the hardest to solve.

What teams do in practice

Three reference approaches I use, with real trade-offs, none free.

Full retraining without the data: you remove the record from the dataset, retrain from scratch or from an earlier checkpoint. It’s the cleanest legally, the most defensible in an audit, and the most expensive computationally. For models that take weeks to train, it’s impractical as a routine response.

Selective machine unlearning: techniques that try to remove the influence of specific records without full retraining. SISA training (Sharded, Isolated, Sliced, Aggregated) and gradient-based unlearning reduce cost. The problem: most production implementations still lack formal certification that the erasure was effective. In a dispute with ANPD, “we used machine unlearning” without measurable evidence doesn’t settle it.

Documenting impracticability and mitigating risk: LGPD allows, in some cases, continued processing when erasure is impossible and there’s a residual legal basis. Documenting that the model was trained with data that had a legal basis at the time, that retraining is technically unfeasible, and that mitigation measures were implemented can be the legally defensible answer. This needs legal opinion, not just technical analysis.

How to architect before training

The right moment to solve this is before the first model goes to production, not after the first deletion request.

Dataset versioning by data subject: maintain an index of which records were used in which training version. Without that index, you don’t even know which models need action when a data subject requests deletion.

Separation of training data by consent: if part of the dataset came from explicit consent and part from legitimate interest, treat them as separate datasets from the start. When consent is revoked, you know exactly which subset is affected.

Checkpoints labeled by dataset composition: if you use modular training, keep checkpoints with metadata on which shards were used. That reduces selective retraining cost from weeks to hours.

The decision every team will have to make

The scenario will show up: a data subject sends a deletion request, you delete the data, and someone asks what to do with the credit scoring model that used that CPF in training.

The honest answer today is: it depends on which model, when it was trained, how the dataset was managed, and what the original legal basis for processing was.

What’s no longer acceptable is not having the answer. ANPD is building its position on AI and LGPD. Teams that have already documented their architectural decisions will be in a far better position than those improvising when guidance arrives.

Delta Lake or Parquet? You're asking the wrong question

Thu, 30 Apr 2026 00:00:00 +0000

The question comes up every week in my team’s Slack: “should we use Delta Lake or Parquet?”

Delta Lake isn’t a competing file format to Parquet. It’s a transactional management layer that stores data in Parquet files. You aren’t choosing between two formats. You’re deciding whether you need a transactional layer on top of your files.

That distinction changes the decision criteria completely. And confusing the two in production costs real money.

What Parquet doesn’t do

Parquet solves one specific problem very well: storing data in a columnar, compressed format that’s efficient for analytical reads. It’s the right format for that.

What Parquet doesn’t do: concurrency control. If two jobs write to the same partition at the same time, the result is non-deterministic. No transactions, no rollback, no conflict detection. The last writer wins. The other one disappears.

At a fintech where I worked, with distributed ingestion pipelines, this wasn’t theoretical. It was the default scenario every time a streaming job and a backfill job ran together on the same table.

In pipelines with simultaneous streaming and backfill, the scenario shows up without warning. The symptom is subtle: row counts look right, but values diverge from the previous day with no error in the log. The last writer overwrote the previous one. Silent, no rollback.

What Delta Lake adds

Delta Lake solves the concurrency problem with _delta_log: a directory of JSON commits and Parquet checkpoints that records every transaction. Every writer registers what was added, what was removed, and the resulting version. Readers see consistent states, never partial ones.

That enables four capabilities pure Parquet can’t offer:

UPDATE, DELETE, and MERGE operations without rewriting the entire table. Delta marks affected files as removed and adds new ones. Old data remains accessible via time travel (SELECT * FROM table VERSION AS OF 10), but doesn’t appear in current queries.

Schema enforcement. If a pipeline tries to write a column with an incompatible type, the write fails before contaminating the table. With pure Parquet, you discover the problem at the consumer, not at the source.

Controlled compaction via OPTIMIZE. Streaming ingestion generates dozens of small files per hour. Delta consolidates these fragments without downtime, keeping the transaction log intact.

Data skipping using min/max statistics per file. In a 2 TB table with 10,000 Parquet files, a date-filtered query potentially has to open every file to check metadata. Delta keeps min/max per column in the log and skips whole files without reading them.

When Delta Lake is overkill

Delta Lake has a cost. The _delta_log adds overhead on small writes. Checkpoints are generated every 10 commits by default. For immutable datasets, that cost has no return.

Three scenarios where Parquet is the right choice:

Reference datasets that never change. BACEN code tables, calendar tables, historical data sealed after processing. No concurrent writers, no updates. Pure Parquet, no log overhead.

Export pipelines to external systems. You’re generating files to send to a partner, a legacy system, or an S3 bucket consumed by a tool that doesn’t read Delta. Parquet is the interoperability standard.

Experiments and ephemeral data. A notebook that reads a CSV and saves a result. No need for versioning or transactions. Delta’s overhead adds nothing here.

The decision in three questions

Before choosing the format, answer:

Does more than one process write to this table at the same time, or will it in the future? If yes, Delta Lake.
Is the data updated, deleted, or subject to audit requirements? If yes, Delta Lake.
Is the table read-only and never modified after writing? Parquet is enough.

Most operational tables in a productive lakehouse answer “yes” to question one or two. Most lookup tables answer “yes” to question three.

In the context of BACEN 521 compliance, which takes effect in October 2026, audit tables for financial transactions need time travel and schema enforcement. Using pure Parquet on those tables isn’t just inefficient. It’s a regulatory risk.

The real architectural decision

Delta Lake isn’t an improved version of Parquet. It’s a different layer that solves a different problem.

Parquet solves: how to store data efficiently for analytical reads.

Delta Lake solves: how to guarantee consistency when multiple processes access the same data at the same time.

The right question isn’t “which format should I use”. It’s “does this data need transactional control?” If it does, Delta Lake. If it doesn’t, Parquet. I’ve gone both ways across different projects. Picking the wrong one cost me on both sides.

If you’ve already hit silent corruption from concurrency in Parquet, or chose Delta on something that later felt excessive, share the context in the comments.

Sharpe Ratio -1.14 is an Engineering Win, Not a Failure

Thu, 23 Apr 2026 00:00:00 +0000

For 6 months, I built a quant agent for BTC/USDT trading.

Goal: maximize returns.

Result: Sharpe ratio of -1.14. Not good.

The system didn’t fail. It failed at one objective (alpha) and excelled at another (capital preservation).

Architecture by layers

Quant trading is complex. It’s not “buy here, sell there.” It’s this:

L1: Ingestion        (real data)
L2: Processing       (signals)
L3: Intelligence     (predictions)
L4: Decision         (sizing)
L5: Execution        (minimize impact)
L6: Evaluation       (backtests)
L7: Compliance       (audit)

Each layer is independent. Each has fallbacks.

L1: Ingestion

- BinanceFetcher: OHLCV, funding rates, open interest, order book
- MacroFetcher: DXY, S&P 500 via yfinance
- GlassnodeFetcher: on-chain metrics

Why 3 sources? Triangulation. If Binance goes down, you still have macro + on-chain.

L2: Processing

32+ technical indicators:
- RSI, MACD, Bollinger Bands (classics)
- ATR, Stochastic, Williams %R (volatility)
- Volume profile, Time-weighted moving average
- On-chain: MVRV, SOPR, Cumulative delta
- Macro: VIX-like crypto index

Everything normalized (z-score, min-max).
Everything temporally aligned (no forward-looking bias).

L3: Intelligence

Gaussian HMM (Hidden Markov Model) with 3 states:

BULL (uptrend)    → RSI > 60 + momentum + macro positive
SIDEWAYS (range)  → RSI 40-60 + low volatility
BEAR (downtrend)  → RSI < 40 + momentum negative

LightGBM regressor predicts returns on the next 4 candles (walk-forward).

You don’t need 60% accuracy to have alpha. You need consistency. A model that’s right 45% of the time but with low drawdown beats one that’s 70% accurate with 30% max DD.

L4: Decision

Quarter Kelly sizing. Not full Kelly (too aggressive).

Position size = (edge * odds) / odds_ratio
Capped at 2% of portfolio (max risk per trade)

Guardrails (non-negotiable):
- Max drawdown: 15%
- Circuit breaker: 3 consecutive losses = pause
- Kill switch: manual override always available

L5: Execution

Almgren-Chriss (minimize market impact):

Don't execute 100% in 1 candle.
Break it into 5-10 small orders.
Use TWAP/VWAP for better timing.
Check liquidity before each order.

L6: Evaluation

Walk-forward backtesting (no data leakage):

Train window: 60 days
Test window: 5 days
Roll forward: shift 5 days, repeat

Metrics:
- Sharpe, Sortino, Calmar ratios
- Max drawdown
- Win rate
- Recovery factor

L7: Compliance

- KillSwitch thread-safe (emergency)
- Auditor append-only in JSONL (immutable)
- Telegram notifications (real-time alerts)
- 202 tests (Python, pytest)
- CI/CD (GitHub Actions)

The insight: Quant engineering isn’t about “predicting prices.” It’s about building a system that’s tested, auditable, and fails gracefully (minimal drawdown).

The Bug That Revealed Everything

Initially, Sharpe was +0.66. Looked good.

Then I found data leakage in the HMM: the model was seeing the future during training.

A simple oversight:

# WRONG: trains with all data (future data leaks)
hmm.fit(all_indicators)

# RIGHT: trains only with past up to time T
hmm.fit(indicators_until_date_T)

After fixing: Sharpe dropped to -1.14.

That moment was crucial: real » spurious.

I could have:

Ignored the bug and shipped (risk: fraud)
Abandoned the project (risk: missed learning)

Instead, I documented the fix, rewrote the tests, and asked the right question: “What does this system actually solve?”

The Tradeoff: Alpha vs Capital Preservation

Let’s look at the numbers (out-of-sample, walk-forward):

Metric	Quant Agent	Buy & Hold
Sharpe ratio	-1.14	-0.04
Max drawdown	0.29%	26.24%
Win rate	1/7 windows	4/7 windows

Read that again.

The agent has no alpha. But it reduces drawdown by ~90x.

Ask yourself: which scenario would you prefer?

Scenario 1: You buy and hold. In one year, there’s one day where you lose 26% of everything. The next day, you recover 15%. Do you sleep?

Scenario 2: You’re running the agent. Max loss is 0.29% on any given day. You sleep better.

Capital preservation > chasing alpha.

Framework vs Outcome

The code didn’t “fail.” It solved a different problem than planned.

Systems thinking:

Original goal: Generate positive returns (alpha)
Problem discovered: Alpha is rare (even for professionals)
Emergent solution: Risk management is consistent
Actual result: A capital preservation system

Sometimes, failing at your original goal is the universe’s way of showing you the real one.

The Technical Stack

For devs, here’s what worked:

What worked:

Python + SQLAlchemy (robust ORM)
asyncio (true concurrency, non-blocking I/O)
pytest (202 tests passing)
Postgres (append-only auditing, compliance)
Windows Task Scheduler (low-cost orchestration)

What was challenging:

HMM on non-stationary data (quant is hard)
Market microstructure (Almgren-Chriss is complex)
Real-time data latency (lag = real slippage)

Final stack:

Data ingestion:  Binance API + Glassnode + yfinance
ML stack:        scikit-learn (HMM), LightGBM (regression)
Backend:         FastAPI (optional, current: local scheduler)
Database:        Postgres 16 + JSONL audit trail
Notifications:   Telegram bot + Discord webhook
Infrastructure:  Cheap VPS (1 vCPU, 4GB RAM, 50GB NVMe)

Runs on a cheap machine. No Kubernetes, no scary AWS bills.

Lasting lessons

1. Test First (TDD)

202 tests = confidence. You refactor without fear.

No tests? Silent failures. You discover them in production.

Each feature has an associated test:
- test_hmmpredict.py (model validation)
- test_kelly_sizing.py (risk management)
- test_market_impact.py (execution)
- test_audit_trail.py (compliance)

2. Auditing is Design

JSONL append-only logs saved me when I questioned results.

{"timestamp": "2026-04-22T10:30:00", "action": "BUY", "size": 0.05, "price": 65000, "reason": "BULL_regime_high_momentum"}
{"timestamp": "2026-04-22T11:45:00", "action": "CLOSE", "pnl": 50, "drawdown": 0.0015}

You can trace why each decision was made.

3. Constraints Generate Innovation

Quarter Kelly sizing is more conservative than full Kelly. But it was more effective.

Constraints (2% max risk, 15% max DD) forced creativity in decision-making.

Too much freedom = overfitting.

4. Real-Time is Different from Backtesting

Walk-forward validation prevents surprises.

Your model might be 70% accurate in backtest, but in production? 45%. Why?

Slippage (you don’t get the exact price)
Latency (0.5s delay = different price)
Spread (bid/ask widens in volatility)

Real-time doesn’t forgive.

5. Failure is Learning

Data leakage (-1.14 vs +0.66) was the most valuable discovery.

Fixing that bug = I learned more than from 10 books on quant.

Don’t fear “failures” that teach.

6. Simplicity > Complexity

3 states in the HMM worked better than 10+ features.

6 months building. Result: simple.

Time inversion: 95% building, 5% simplifying. But that 5% = the code that actually runs in production.

7. Capital Preservation > Chasing Alpha

Your goal should be: “Don’t lose money.”

Alpha (extra returns) is a bonus.

Most quants invert it: “I’ll chase alpha, tolerate losses.”

Wrong.

What Comes Next

This agent won’t generate overnight wealth.

(If anyone promises that, run.)

But it solves a real problem:

“How do I build a robust decision system in Python?”

Next steps for you:

The code: project closed for now. The architecture described above (HMM + LightGBM + Kelly + HRP, train/production separation, event-based vs polling) is what matters to replicate the approach.
Adapt it: For stocks, commodities, crypto (framework is agnostic)
Realize: How hard quant is. Respect those who do it well.

What’s Your Metric?

Sharpe is useful. But maybe you optimize for something else:

Maximum wealth in minimum time? (time allocated)
Minimum drawdown? (peace of mind)
Minimum capital needed? (accessibility)

Pick your metric. Build for it. Validate with real data.

Not his choice. Not the trend. Yours.

Sharpe -1.14 is a marketing failure. But it’s an engineering win.

If the goal was to learn how to build a robust, tested, auditable, scalable system, mission accomplished.

Your next objective is yours.

Reply on LinkedIn or subscribe to the Substack newsletter to get the next posts.

Real data engineering content in Portuguese is rare. I'm going to help change that.

Sat, 18 Apr 2026 00:00:00 +0000

The kind of data engineering content in Portuguese where you can tell the person actually lived what they’re writing about, that’s hard to find.

Search right now. You’ll find a lot of solid material to start with: translated articles from English blogs, tutorials grounded in the official docs, courses teaching Pandas on simple datasets. All of that has its place, it’s where most people start, and the people producing it are doing important work.

What’s still hard to find is someone telling you how they decided to use Delta Lake instead of Parquet in an environment processing hundreds of millions of daily transactions. Or when Medallion Architecture helps and when it just gets in the way. Or how LGPD (Brazil’s data privacy law) actually changes the way you design an ingestion layer.

That’s the gap I want to help fill.

Who I am, by what I’ve built

I won’t list certificates. I’ll tell you what I’ve shipped.

I’m a senior data engineer with 8+ years of experience. I started in data quality at a major Brazilian bank, then moved to a global-scale Brazilian fintech building ETL pipelines, worked on a big-tech project in Silicon Valley through an international tech consultancy, and today I’m back in the Brazilian banking sector. (Full résumé on the /sobre/ page.)

My core stack is Databricks. Not because I read the docs. Because it’s what runs in production where I’ve worked.

In 2026 I started a master’s in applied computational methods. My research is on AI-driven predictive monitoring for critical operational systems. Everything I learn there I plan to bring here, translated into something useful for engineers working with real data.

Why crypto entered the story

A few years ago I started studying on-chain analytics. And I noticed something that few people seem to be saying clearly: crypto, in large part, is a data engineering problem that’s still poorly solved.

The data is all there. On-chain, open, public. But most people investing in crypto don’t know how to process it, and many data engineers still aren’t looking at it.

So I decided to build a crypto AI agent from scratch. In public, documenting every architecture decision. Using the same tools I use at work: real pipelines, rigorous backtesting, actual statistical models. No hype, no get-rich-quick promises.

What you’ll find here

Three tracks, one newsletter.

The first is production data engineering: Databricks, Delta Lake, Spark, dbt, Airflow. Real architecture decisions, mistakes I made and what I learned, Brazilian context where it’s relevant (LGPD in practice, cloud cost reality, what data actually looks like inside financial institutions).

The second is the crypto AI agent, built in public. Architecture, code, backtesting, on-chain analysis. Every step documented. If something breaks, you’ll know why.

The third is the master’s research translated to practice. What academic research has to say about the problems you face every day. No filter, no academic jargon.

Published in Portuguese and English, every week.

Hit reply and tell me: what’s the hardest data problem you’re dealing with right now? I read everything.

Thais Vaz

Newsletter on Substack →

LGPD at the ingestion layer: 4 principles that change your architecture

Thu, 16 Apr 2026 00:00:00 +0000

Most data teams treat privacy law as something to solve “later”.

First the pipeline gets built, the data lands in the lake, the dashboards start shipping. Then one day a data subject request shows up asking for deletion of personal data. And the team finds out it doesn’t know where that ID lives, how many copies sit in Bronze, how many ML models were trained on it.

That’s too late.

LGPD (Brazil’s data privacy law, similar in spirit to GDPR) isn’t compliance at the end of the pipeline. It’s a design constraint that starts at the first byte you ingest. There are four principles that, if you build them into the ingestion layer, prevent almost every downstream pain.

Principle 1: minimize at the source, not at the destination

Art. 6, III of LGPD requires necessity: only process data that is adequate and limited to the purpose.

The practical translation is simple. Don’t ingest what you won’t use.

Sounds obvious, but it isn’t. Most pipelines ingest entire tables (including IDs, phone numbers, addresses, emails) “because it’s in the source”. Then compliance shows up, asks for the mapping of these fields, and discovers 80% of them were never consumed by anyone.

The right pattern is to apply schema filtering before persistence. In the ingestion pipeline, you explicitly define which fields enter the lake. Whatever doesn’t enter never becomes your retention problem, anonymization problem, audit problem.

The question worth asking before each field is: what concrete use case needs this data?. If the answer is “I dunno, could be useful”, then it isn’t needed.

Principle 2: pseudonymize from the first byte

Three terms that look alike and aren’t.

Anonymization is data made irreversible. Nobody can be identified anymore. It’s the only state LGPD treats as out of scope (Art. 12).

Pseudonymization is identity replaced by a code, but reversible via a separate mapping table. Still personal data (Art. 13, §4). Reduces risk, but doesn’t remove the obligation.

Tokenization is a specific pseudonymization pattern with deterministic tokens, useful for preserving joins without exposing the original data.

The pattern that works is to tokenize at ingestion. Bronze never sees raw data. It sees the deterministic token. The token ↔ original mapping lives in an isolated table, with encryption at rest, audited access, and its own retention policy.

This solves three problems at once. You can join tables in the lake without exposing the original data. Right to erasure becomes a DELETE in the mapping, no need to touch Bronze. And analysts and ML models work with pseudonymized data by default, reducing the risk surface.

Principle 3: lineage is a requirement, not a feature

When a data subject request shows up (Art. 18, right of access, correction, deletion), you have 15 days to respond. Without complete lineage, that deadline becomes a nightmare.

Real lineage answers three questions for any personal data. Where did it come from? Source system, original field, ingestion timestamp. What transformations did it go through? Pipeline steps, applied rules, derivations. Where is it now? Tables, trained models, dashboards that consume it.

Tools like OpenLineage, DataHub and Databricks Unity Catalog deliver this, but only if you instrument from ingestion onward. Adding lineage after the pipeline is already running is ten times more expensive than adding it before.

The practical test is direct: can you, in under an hour, list every table and model that contains the ID 123.456.789-00? If you can’t, your lineage isn’t LGPD-ready.

Principle 4: retention by purpose, not by table

Art. 15 says processing ends when the purpose is fulfilled. Art. 16 completes: after that, data must be deleted.

In data engineering practice, this means each piece of data has its own clock. You can’t define a single “retention equals 5 years” policy for all tables. Some purposes require months, others years, others are indefinite (under different legal bases).

Patterns that work: tables partitioned by processing date, with VACUUM or TRUNCATE PARTITION at the end of the cycle. A purpose map documented in code, a YAML that defines, per table and per field, which purpose justifies it, which legal basis, which deadline. And automated expiration jobs, no relying on manual process: configure retention policies that run themselves.

Delta Lake, BigQuery and Snowflake all have mechanisms for this. The real work is translating legal purpose into technical configuration, and that’s the work nobody wants to do, but it determines whether you clash with the regulator or not.

What data engineers need to align with legal

Three conversations engineering can’t outsource.

The first is the legal basis of each data. Consent? Legitimate interest? Contract execution? Each has different technical implications. Right to revoke, for example, only exists under consent.

The second is the concrete purpose of each pipeline. “Analytics” doesn’t count. Which business decision does this data support?

The third is the response process for subject requests. Who receives? What’s the flow? What’s the internal SLA? This must be documented, tested, and have an owner.

If these three conversations haven’t happened yet, your personal-data pipeline is running on compliance debt.

What stays

LGPD isn’t a checklist at the end. It’s a design constraint that changes four things. What you ingest (minimization). How you ingest (pseudonymization). What you track (lineage). How long you keep (retention by purpose).

Teams that treat it as “we’ll solve it later” pay the entire tech debt on the first subject request that arrives. Teams that treat it as a design constraint from the first byte don’t even notice it’s there, because it’s just how things work.

The difference between the two isn’t legal. It’s engineering.

What’s the trickiest data subject request your team has ever dealt with? Reply on LinkedIn or subscribe to the Substack for the next posts.

Vaz · Information engineering

Prompt caching: the one-line change that cuts 90% of LLM cost in production

Why LLM cost in production is prefix

How Anthropic’s cache works

A real benchmark from my daily news pipeline

Where it shines, where it doesn’t

It’s not micro-optimization. It’s architecture.

SQL is still the most important language in data engineering in 2026

The coder outsourced the understanding

In a real pipeline, the abstraction doesn’t fit

AI generates bad SQL at scale

The execution plan is where the difference lives

The decision comes before the next feature

YouTube rate-limits its caption endpoint. Audio stays free.

The problem transcription pipelines ignore

Why googlevideo doesn’t fall with it

A pipeline that handles real batch loads

CPU benchmark (Intel i7 12th gen, 16 GB, int8)

What about SaaS with Whisper fallback?

Honest verdict

Anti-patterns I saw along the way

What this changes for you

Code Review of My Own Old Repo. Five Things I'd Change Today.

The database credentials were inside the function

The pipeline lost parallelism for free

Input data was baked into the Docker image

The pipeline ran daily on static data

The fillna(0) erased an important signal

The discomfort of reviewing your own code

Data Flows Ep01: the concept that comes before any tool

In one sentence

Where this idea came from

Bounded vs unbounded: the decision that defines everything

When batch, when streaming

What goes wrong when the flow is ignored

How the big ones document their own flow

Anti-patterns to avoid

Where to start

SLA, not trend: when batch, when streaming, when both

The right question is not “which is more modern”

Decision table: SLA × technology

When batch wins (even in 2026)

When streaming is the only answer

When “both” is the right answer

Questions that decide the case

Airflow for 2 years: what I would do differently

Context: what it is and who uses it

What Airflow solves well

Where Airflow gets complicated

The scheduler parses your whole code every 30 seconds

XCom has a hard limit nobody warns you about

catchup=True has already triggered unwanted backfills in many teams

Renaming a DAG loses the whole history

Business logic inside the operator becomes a problem later

What I would do differently

About Airflow 3.0

When to evaluate alternatives

What stays

Twenty AI concepts you need to understand in 2026

How AI works (1 to 4)

The magic behind it (5 to 8)

Beyond the models (9 to 12)

How AI generates output (13 to 14)

How AI acts (15 to 16)

Improvement and evaluation (17 to 20)

Closing

When the model should say 'I don't know'

In one sentence

The blind spot left after the data leakage fix

What the literature calls this

Someone has done this in finance

How I implemented it

Why 5 sigmas, not 3

What changed in my sleep

Anti-patterns to avoid

The next chapter

Instrumenting lineage from scratch with Unity Catalog

What Unity Catalog captures automatically

What isn’t captured and where most people get it wrong

Multi-hop lineage: what Catalog Explorer doesn’t show

Why `googlevideo` doesn’t fall with it

The `fillna(0)` erased an important signal