
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Vaz · Information engineering</title>
    <link>https://vazdeng.pages.dev/en/</link>
    <description>Engenharia de dados de produção. Agente de IA para cripto. Mestrado traduzido para a prática.</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <lastBuildDate>Thu, 11 Jun 2026 00:00:00 +0000</lastBuildDate>
    
	  <atom:link href="https://vazdeng.pages.dev/en/index.xml" rel="self" type="application/rss+xml" />
    
    
      
      
    
    
    <item>
      <title>Prompt caching: the one-line change that cuts 90% of LLM cost in production</title>
      <link>https://vazdeng.pages.dev/en/2026/06/11/prompt-caching-corte-90-custo-llm/</link>
      <pubDate>Thu, 11 Jun 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/06/11/prompt-caching-corte-90-custo-llm/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/06/11/prompt-caching-corte-90-custo-llm/cover_hu_236ef434b3576c83.png" width="640" height="336"/>]]>
        
        &lt;p&gt;18 thousand tokens. That was the cost of every run of my news pipeline with 6 parallel sub-agents. After one line of code, it became 4,500. Same model. Same prompt. Same output. I just turned on the cache.&lt;/p&gt;
&lt;p&gt;The feature has been in the Anthropic API for over a year. Most teams running LLMs in production still haven&amp;rsquo;t turned it on. I myself ran for months paying full price before actually reading my invoice. It&amp;rsquo;s the highest return per minute of work I know of today.&lt;/p&gt;
&lt;h2&gt;Why LLM cost in production is prefix&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;why-llm-cost-in-production-is-prefix&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#why-llm-cost-in-production-is-prefix&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Every API call sends 4 things: system prompt, few-shots, context, and the question. In a real pipeline, the first 3 add up to 80 to 95 percent of the tokens, and they repeat on every call. The question changes. The rest is prefix.&lt;/p&gt;
&lt;p&gt;Without cache, you pay for the entire prefix every time. In a pipeline running dozens or hundreds of times an hour, that becomes the bill. In a pipeline with parallel fan-out (several sub-agents sharing the same system prompt), it becomes the bill times the number of sub-agents.&lt;/p&gt;
&lt;p&gt;With cache, you pay for the prefix once (cache write), then only the delta of each new call (cache read). A cache read costs about 10% of the normal input price.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/11/prompt-caching-corte-90-custo-llm/images/01-custo-prefixo.png&#34; alt=&#34;Anatomy of a call: system prompt, few-shots and context are 80-95% of the tokens and repeat; only the question changes&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;How Anthropic&amp;rsquo;s cache works&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-anthropics-cache-works&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-anthropics-cache-works&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;You mark a block of the prompt with &lt;code&gt;cache_control: ephemeral&lt;/code&gt;. Simplified example:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;system&amp;#34;&lt;/span&gt;&lt;span class=&#34;err&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nt&#34;&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nt&#34;&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;&amp;lt;long, stable system prompt here&amp;gt;&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;nt&#34;&gt;&amp;#34;cache_control&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;&amp;#34;type&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;ephemeral&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Default TTL is 5 minutes. Next call inside that window: the cached prefix is read at 10% of the normal price. Anthropic also offers a 1-hour TTL as a paid option, useful for more spaced-out workflows.&lt;/p&gt;
&lt;p&gt;The API returns 2 metrics you need to monitor:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;cache_creation_input_tokens&lt;/code&gt;: you paid the write.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cache_read_input_tokens&lt;/code&gt;: you paid only the read (90% discount).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No model change, no prompt rewrite. Just flag what&amp;rsquo;s cacheable.&lt;/p&gt;
&lt;h2&gt;A real benchmark from my daily news pipeline&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;a-real-benchmark-from-my-daily-news-pipeline&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#a-real-benchmark-from-my-daily-news-pipeline&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The number in the opening comes from a pipeline I built and maintain: my daily news skill, running every day at 8am. It fires 6 parallel sub-agents: data engineering, AI, investing, crypto, local politics, international politics. Each one carries a fixed system prompt of roughly 3 thousand tokens with tone rules, output format, prioritized sources, and synthesis style.&lt;/p&gt;
&lt;p&gt;Without cache, the bill I was paying is direct math:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6 sub-agents × 3 thousand prefix tokens = 18 thousand tokens paid per run.&lt;/li&gt;
&lt;li&gt;Times 1 run per day = 540 thousand tokens a month on prefix alone.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With cache:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1 initial cache write (3 thousand tokens) + 5 cache reads (with a delta of ~300 tokens each) = ~4,500 effective tokens.&lt;/li&gt;
&lt;li&gt;Roughly a 75% cut in prefix cost, with zero quality loss and not a comma changed in the output.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a more aggressive production pipeline (running dozens of times an hour with larger prefixes), the cut reaches 90%.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/11/prompt-caching-corte-90-custo-llm/images/02-bench-real.png&#34; alt=&#34;Real benchmark: 18 thousand tokens per run without cache vs 4,500 with cache, a 75% cut&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;Where it shines, where it doesn&amp;rsquo;t&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-it-shines-where-it-doesnt&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-it-shines-where-it-doesnt&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Shines:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large, fixed system prompt (rules, format spec, examples).&lt;/li&gt;
&lt;li&gt;Fan-out: several sub-agents with the same prefix in the same session.&lt;/li&gt;
&lt;li&gt;Agents looping over the same context.&lt;/li&gt;
&lt;li&gt;Chat with a large attached document and several consecutive questions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Doesn&amp;rsquo;t shine:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One-shot calls with no repeated pattern.&lt;/li&gt;
&lt;li&gt;Prompts that change significantly on every call.&lt;/li&gt;
&lt;li&gt;Workflows with more than 5 minutes between calls (the cache expired).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/11/prompt-caching-corte-90-custo-llm/images/03-brilha-nao-brilha.png&#34; alt=&#34;Where the cache shines: fixed prefix, fan-out, loops, large documents. Where it doesn’t: one-shot, unstable prompt, cadence beyond the TTL&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Caveats that kill the gain if you don&amp;rsquo;t know them:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A cache write is slower than a normal call. You pay once in latency, you win on every call after. In a nightly pipeline that&amp;rsquo;s irrelevant. In an interactive chat, it matters.&lt;/li&gt;
&lt;li&gt;Don&amp;rsquo;t cache PII or sensitive data without auditing first. Anthropic&amp;rsquo;s cache is per-account, but the principle stands.&lt;/li&gt;
&lt;li&gt;The 5-minute TTL is a short window. If your job re-runs the pipeline every 10 minutes, the cache never hits. For those cases, use the 1-hour TTL.&lt;/li&gt;
&lt;li&gt;You only see the gain if you monitor the 2 metrics. A timestamp at the top of the system prompt is enough for the prefix to never cache, and without watching &lt;code&gt;cache_read&lt;/code&gt; you think you turned it on and you didn&amp;rsquo;t.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;It&amp;rsquo;s not micro-optimization. It&amp;rsquo;s architecture.&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;its-not-micro-optimization-its-architecture&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#its-not-micro-optimization-its-architecture&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Whoever is paying 100% of the price of every call because &amp;ldquo;there was no time to configure it&amp;rdquo; is accumulating debt with Anthropic every month. In a production pipeline with serious volume, that becomes thousands of dollars a year. For one line of code.&lt;/p&gt;
&lt;p&gt;The rule I now follow in everything I build: structure the prompt in layers. Stable first (cacheable), volatile last. Mark the stable part with &lt;code&gt;cache_control: ephemeral&lt;/code&gt;. Monitor &lt;code&gt;cache_creation&lt;/code&gt; and &lt;code&gt;cache_read&lt;/code&gt;. Pay once, read many.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s the ABC. And there are still teams calling this &amp;ldquo;advanced optimization&amp;rdquo;.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>SQL is still the most important language in data engineering in 2026</title>
      <link>https://vazdeng.pages.dev/en/2026/06/10/sql-ainda-importa-2026/</link>
      <pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/06/10/sql-ainda-importa-2026/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/06/10/sql-ainda-importa-2026/cover_hu_9181cf658c307756.png" width="640" height="336"/>]]>
        
        &lt;p&gt;There are devs onboarding into senior teams right now who have never written a &lt;code&gt;GROUP BY&lt;/code&gt; in their lives. They learned the ORM before SQL. They think &lt;code&gt;df.groupby()&lt;/code&gt; covers it. When a query hangs because the execution plan turned into a full scan over an 80-million-row table, they paste the error into ChatGPT, paste the answer back, and when it hangs again, they paste again. Infinite loop.&lt;/p&gt;
&lt;p&gt;That dev is what Akita calls a coder, as opposed to an engineer. And AI is accelerating his extinction.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/10/sql-ainda-importa-2026/coder-loop.png&#34; alt=&#34;The coder’s loop: query hangs, copy the error, paste the answer, it hangs again&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;The coder outsourced the understanding&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-coder-outsourced-the-understanding&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-coder-outsourced-the-understanding&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I learned SQL before any framework, because it was the only way to talk to the database. Today it&amp;rsquo;s the opposite. Framework before SQL. ORM before SQL. pandas before SQL. Layer upon layer of abstraction hiding the query that will actually run.&lt;/p&gt;
&lt;p&gt;The problem with abstraction is not the abstraction. It&amp;rsquo;s that it hides the cost. You assume &lt;code&gt;User.objects.filter().select_related().prefetch_related()&lt;/code&gt; is cheap. It isn&amp;rsquo;t. It&amp;rsquo;s a JOIN that can blow up memory if you don&amp;rsquo;t know why it&amp;rsquo;s a JOIN, across how many tables, with what cardinality. The ORM writes the right query in 70% of cases. The other 30% destroy your cluster.&lt;/p&gt;
&lt;h2&gt;In a real pipeline, the abstraction doesn&amp;rsquo;t fit&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;in-a-real-pipeline-the-abstraction-doesnt-fit&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#in-a-real-pipeline-the-abstraction-doesnt-fit&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A modern data pipeline processes billions of rows a day. Every query decision costs minutes times cluster times DBU times day times month. The gap between a well-written query and one generated by an unprepared ORM is a 10x to 100x factor on the final bill.&lt;/p&gt;
&lt;p&gt;A concrete case from a consulting engagement: an accounting close pipeline at a Brazilian fintech. The ORM was generating 47 subqueries for something native SQL solves in 1 CTE with a window function. Databricks/Snowflake bill: about USD 1,600 a month. After someone finally wrote the query in plain SQL: USD 160 a month. Same business result, 10x difference.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/10/sql-ainda-importa-2026/abstraction-cost.png&#34; alt=&#34;Real accounting-close case: unprepared ORM at R$ 8,000/month vs plain SQL at R$ 800/month, 10x cheaper&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It wasn&amp;rsquo;t an isolated case. It&amp;rsquo;s the pattern. Wherever there&amp;rsquo;s a large pipeline generated through abstraction, there&amp;rsquo;s a 10x fat factor waiting for someone to read the execution plan.&lt;/p&gt;
&lt;h2&gt;AI generates bad SQL at scale&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;ai-generates-bad-sql-at-scale&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#ai-generates-bad-sql-at-scale&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Every generative AI today produces fluent SQL. It compiles, runs, and returns the right number on the first try. The problem is not correctness. It&amp;rsquo;s efficiency.&lt;/p&gt;
&lt;p&gt;Patterns I keep seeing in LLM-generated SQL that nobody reviewed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SELECT *&lt;/code&gt; in stacked CTEs, dragging columns nobody will use through the whole pipeline.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WHERE column IN (SELECT ...)&lt;/code&gt; instead of a JOIN, in cases where the JOIN would be 100x faster.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WHERE UPPER(column) = &#39;X&#39;&lt;/code&gt; on an indexed column, killing the index.&lt;/li&gt;
&lt;li&gt;No partition hint in Spark or Snowflake, scanning the whole table when one day of data was needed.&lt;/li&gt;
&lt;li&gt;Window functions with the wrong &lt;code&gt;PARTITION BY&lt;/code&gt;, computing the wrong thing without throwing an error.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/10/sql-ainda-importa-2026/llm-antipatterns.png&#34; alt=&#34;Antipatterns in unreviewed LLM-generated SQL: SELECT * in stacked CTEs, IN instead of JOIN, function on indexed column, no partition hint, WINDOW without proper PARTITION BY&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Of these five patterns, there isn&amp;rsquo;t one I haven&amp;rsquo;t seen in generated queries. If you don&amp;rsquo;t read execution plans, you don&amp;rsquo;t see any of this. It ships to production and you pay the interest at the end of the month. Technical debt with AI is not the same debt as 5 years ago. You take it on 10x faster, convinced you&amp;rsquo;re getting ahead.&lt;/p&gt;
&lt;h2&gt;The execution plan is where the difference lives&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-execution-plan-is-where-the-difference-lives&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-execution-plan-is-where-the-difference-lives&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; in Postgres. &lt;code&gt;EXPLAIN COST&lt;/code&gt; in Snowflake. The physical plan in the Spark UI. It&amp;rsquo;s the first thing I look at before letting a new query run at scale. They all tell you the same thing: how many rows the engine will scan, which joins it picked, where the shuffle is, where the broadcast is, where the queue is.&lt;/p&gt;
&lt;p&gt;A coder looks at the plan and doesn&amp;rsquo;t understand it. An engineer reads it and knows whether it&amp;rsquo;s fit for production or needs a rewrite. It&amp;rsquo;s not memorization. It&amp;rsquo;s reading from cause to cost.&lt;/p&gt;
&lt;p&gt;When you ask an LLM to generate SQL, also ask for the estimated plan, ask it to compare against an alternative version, ask it to discuss the partition vs broadcast trade-off. If you can&amp;rsquo;t evaluate the answer, you&amp;rsquo;re not doing engineering yet. You&amp;rsquo;re outsourcing the decision.&lt;/p&gt;
&lt;h2&gt;The decision comes before the next feature&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-decision-comes-before-the-next-feature&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-decision-comes-before-the-next-feature&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;SQL didn&amp;rsquo;t die. The people who pretended to know it did.&lt;/p&gt;
&lt;p&gt;AI is professional darwinism. Whoever truly learns SQL becomes 10x more productive with it, because they can evaluate what it generates. Whoever outsources ORM plus AI accumulates debt that will break production in 18 months, and on that day there will be nobody left to debug it, because nobody reads execution plans anymore.&lt;/p&gt;
&lt;p&gt;The choice happens before the next feature. Will you learn what&amp;rsquo;s actually running, or bet that AI covers your gap? It&amp;rsquo;s a bad bet.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>YouTube rate-limits its caption endpoint. Audio stays free.</title>
      <link>https://vazdeng.pages.dev/en/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/</link>
      <pubDate>Thu, 04 Jun 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/cover_hu_7177a10538791195.png" width="640" height="336"/>]]>
        
        &lt;p&gt;Hit HTTP 429 on 14 consecutive YouTube videos. I tried &lt;code&gt;--sleep-subtitles 60&lt;/code&gt;, exponential backoff up to 45s, browser cookies, yt-dlp pre-release. Nothing helped. Every &lt;code&gt;timedtext&lt;/code&gt; request came back 429.&lt;/p&gt;
&lt;p&gt;Switched to the audio endpoint. Zero 429.&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;In one sentence: YouTube&amp;rsquo;s &lt;code&gt;timedtext&lt;/code&gt; (captions) and &lt;code&gt;googlevideo&lt;/code&gt; (audio/video) are different endpoints. Only the first is aggressively rate-limited in 2026. Downloading audio and transcribing locally is cheaper than insisting on captions.&lt;/p&gt;

&lt;/blockquote&gt;
&lt;h2&gt;The problem transcription pipelines ignore&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-problem-transcription-pipelines-ignore&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-problem-transcription-pipelines-ignore&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The &lt;code&gt;timedtext&lt;/code&gt; rate limit became common enough in 2026 that yt-dlp has 3 open issues (#7123, #13770, #13831) with no definitive fix. The official advice is caching and using the YouTube Data API with OAuth. Both work but shift the problem rather than solving it. Anyone who scheduled 50 URLs and saw half come back empty knows the symptom.&lt;/p&gt;
&lt;h2&gt;Why &lt;code&gt;googlevideo&lt;/code&gt; doesn&amp;rsquo;t fall with it&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;why-googlevideo-doesnt-fall-with-it&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#why-googlevideo-doesnt-fall-with-it&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The discovery that took me too long lives in the two distinct layers YouTube exposes. &lt;code&gt;timedtext&lt;/code&gt; is an API layer: serves small XML/VTT under a global per-IP, per-day quota, with heavy caching and bot detection hardened in 2025. Every request counts. &lt;code&gt;googlevideo&lt;/code&gt; is the CDN that serves audio and video via DASH segments from Google Global Cache edges, peering directly with your ISP. Its billing layer is aggregated bandwidth at the server serving your ISP, not per-request. The rate limit there only fires on clearly robotic patterns.&lt;/p&gt;
&lt;p&gt;In practice I saw this: 60 requests in 5 minutes against &lt;code&gt;timedtext&lt;/code&gt; results in guaranteed 429. The same 60 downloads on &lt;code&gt;googlevideo&lt;/code&gt; with a natural interval go through with no warning. That detail isn&amp;rsquo;t documented in any obvious place. I figured it out when my cron broke and I opened Wireshark.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/images/01-dois-endpoints.png&#34; alt=&#34;Two endpoints, only one blocks: timedtext is an API with per-IP quota and guaranteed 429; googlevideo is a CDN billed by bandwidth, zero 429&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;A pipeline that handles real batch loads&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;a-pipeline-that-handles-real-batch-loads&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#a-pipeline-that-handles-real-batch-loads&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I packaged the logic in an open source Python CLI called &lt;a href=&#34;https://github.com/thaiscvaz/yt-nota&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;yt-nota&lt;/a&gt;. Combines 3 tools.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Step&lt;/th&gt;
          &lt;th&gt;Tool&lt;/th&gt;
          &lt;th&gt;Cost&lt;/th&gt;
          &lt;th&gt;Where it fails&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Metadata + caption URL&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;yt-dlp&lt;/code&gt; (Python API)&lt;/td&gt;
          &lt;td&gt;$0&lt;/td&gt;
          &lt;td&gt;Private video, region lock&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Audio fallback&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;yt-dlp&lt;/code&gt; format 139 (m4a 49kbps)&lt;/td&gt;
          &lt;td&gt;$0&lt;/td&gt;
          &lt;td&gt;Members-only without cookie&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Local transcription&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;faster-whisper&lt;/code&gt; int8 CPU&lt;/td&gt;
          &lt;td&gt;$0&lt;/td&gt;
          &lt;td&gt;Video &amp;gt; 1h on weak hardware&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;faster-whisper&lt;/code&gt; is 4x faster than &lt;code&gt;openai-whisper&lt;/code&gt; on the same model, with the same accuracy (same weights). My CLI&amp;rsquo;s API looks like this:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;result&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;extract_transcript&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;url&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;whisper_fallback&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;c1&#34;&gt;# default on&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;whisper_model&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;small&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;   &lt;span class=&#34;c1&#34;&gt;# or tiny/base/medium&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;On 429, it drops to &lt;code&gt;googlevideo&lt;/code&gt;, downloads only the audio, transcribes, and returns the same format. The caller doesn&amp;rsquo;t know if the transcript came from &lt;code&gt;timedtext&lt;/code&gt; or Whisper.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/images/02-fallback-audio.png&#34; alt=&#34;3-step fallback: caption via timedtext, audio via googlevideo on 429, local transcription with faster-whisper, same output format&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;CPU benchmark (Intel i7 12th gen, 16 GB, int8)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;cpu-benchmark-intel-i7-12th-gen-16-gb-int8&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#cpu-benchmark-intel-i7-12th-gen-16-gb-int8&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I ran the pipeline on real videos of varying length to measure wall-clock time. No GPU.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Video duration&lt;/th&gt;
          &lt;th&gt;&lt;code&gt;base&lt;/code&gt; (74 MB)&lt;/th&gt;
          &lt;th&gt;&lt;code&gt;small&lt;/code&gt; (244 MB)&lt;/th&gt;
          &lt;th&gt;&lt;code&gt;medium&lt;/code&gt; (769 MB)&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;5 min&lt;/td&gt;
          &lt;td&gt;35 s&lt;/td&gt;
          &lt;td&gt;1 min 30 s&lt;/td&gt;
          &lt;td&gt;5 min&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;13 min&lt;/td&gt;
          &lt;td&gt;1 min 50 s&lt;/td&gt;
          &lt;td&gt;4 min&lt;/td&gt;
          &lt;td&gt;13 min&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;45 min&lt;/td&gt;
          &lt;td&gt;6 min&lt;/td&gt;
          &lt;td&gt;14 min&lt;/td&gt;
          &lt;td&gt;45 min&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;On accuracy for technical Portuguese, I did comparative reading over ~14 hours of lecture audio. The &lt;code&gt;base&lt;/code&gt; model confuses 1 in every 6 technical terms (95% readable but needs human review). The &lt;code&gt;small&lt;/code&gt; confuses 1 in every 20 (default for a reason: the downstream LLM corrects rare errors from context). The &lt;code&gt;medium&lt;/code&gt; gets close to zero errors but doubles the time. For my flow (transcript → synthesis via Claude Code), &lt;code&gt;small&lt;/code&gt; is the sweet spot.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/04/youtube-timedtext-bloqueia-googlevideo-nao/images/03-sweet-spot-small.png&#34; alt=&#34;Which Whisper on CPU: base misses 1 in 6 technical terms, small misses 1 in 20 and is the sweet spot, medium near zero but runs at real-time speed&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What about SaaS with Whisper fallback?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-about-saas-with-whisper-fallback&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-about-saas-with-whisper-fallback&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;They exist. Two main ones in 2026.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Solution&lt;/th&gt;
          &lt;th&gt;Price&lt;/th&gt;
          &lt;th&gt;When it makes sense&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Supadata&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;From $0.001/min, free tier 1000 req/month&lt;/td&gt;
          &lt;td&gt;Company with SLA, doesn&amp;rsquo;t want to maintain infra&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Apify YouTube Transcript Scraper&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;$0.40 per 1000 actor runs + compute&lt;/td&gt;
          &lt;td&gt;Pipeline already on Apify&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;yt-nota self-host&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;250 MB deps + 244 MB model&lt;/td&gt;
          &lt;td&gt;Privacy, academic batch, full control&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The call is trivial for me: learning notes and Obsidian vault don&amp;rsquo;t go through third-party APIs. If it were a corporate pipeline with SLA and audit, Supadata wins on operational simplicity. Self-host only makes sense when you are the customer of the data.&lt;/p&gt;
&lt;h2&gt;Honest verdict&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;honest-verdict&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#honest-verdict&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;What works: batch of 50+ videos without crashing midway, zero recurring cost after the initial 500 MB, quality on technical Portuguese good enough for an LLM to digest downstream.&lt;/p&gt;
&lt;p&gt;What it costs: first install is heavy (&lt;code&gt;pip install yt-nota[whisper]&lt;/code&gt;), &lt;code&gt;small&lt;/code&gt; model can confuse exotic terminology (for critical audio, bump to &lt;code&gt;medium&lt;/code&gt;), and CPU becomes a bottleneck on videos longer than 1h.&lt;/p&gt;
&lt;p&gt;When NOT to use it: volume of 10,000 hours per month with tight SLA (OpenAI&amp;rsquo;s Whisper API at $0.006/min ends up cheaper per engineer-hour than running local infra), or audio with music and multiple simultaneous voices (faster-whisper doesn&amp;rsquo;t do diarization, pyannote does).&lt;/p&gt;
&lt;h2&gt;Anti-patterns I saw along the way&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;anti-patterns-i-saw-along-the-way&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#anti-patterns-i-saw-along-the-way&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Trusting &lt;code&gt;--sleep-subtitles 60&lt;/code&gt; as a silver bullet. I tested it: it doesn&amp;rsquo;t trigger before the request, it triggers after the first 429. Game over. Reaching for a paid API before trying the local pipeline is also a trap. $36k/year on transcription (the public faster-whisper benchmark) is money that should buy you a mid-range GPU. And deleting the raw audio after transcribing is the mistake of someone who never wanted to re-run with a better model 6 months later. I keep mine.&lt;/p&gt;
&lt;h2&gt;What this changes for you&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-this-changes-for-you&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-this-changes-for-you&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;If you use YouTube as a learning source, RAG input, or note-taking pipeline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Does your current pipeline handle 50 URLs in a row without crashing?&lt;/li&gt;
&lt;li&gt;&lt;input disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Can you tell a 429 from &lt;code&gt;timedtext&lt;/code&gt; apart from a 429 from &lt;code&gt;googlevideo&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;&lt;input disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Do you have automatic fallback or do you handle each failure manually?&lt;/li&gt;
&lt;li&gt;&lt;input disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Does your monthly transcription bill still fit, or has it passed the cost of an amortized GPU?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you said &amp;ldquo;no&amp;rdquo; to more than one, it&amp;rsquo;s worth an afternoon of refactoring.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Code Review of My Own Old Repo. Five Things I&#39;d Change Today.</title>
      <link>https://vazdeng.pages.dev/en/2026/06/02/code-review-of-my-own-old-repo/</link>
      <pubDate>Tue, 02 Jun 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/06/02/code-review-of-my-own-old-repo/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/06/02/code-review-meu-codigo-antigo/cover_hu_997d5e46c0010a94.png" width="640" height="336"/>]]>
        
        &lt;p&gt;I opened a two-year-old repo of mine. It was still public on GitHub, I cited it in interviews, and I had never re-read the code since I submitted it. This weekend I sat down to re-read it.&lt;/p&gt;
&lt;p&gt;I found five anti-patterns. In my own code, written by me. But the kind of problem I see show up in real production pipelines at large companies, not just in interview projects.&lt;/p&gt;
&lt;p&gt;I decided to write about it because it&amp;rsquo;s more honest to critique my own code than to point fingers at someone else&amp;rsquo;s repo. And because if you have a public repo from two years ago that you still cite in your portfolio, you probably also have at least three of these five.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/02/code-review-meu-codigo-antigo/images/01-cinco-antipadroes.png&#34; alt=&#34;The 5 anti-patterns from the review: credential inside the function, serial tasks for no reason, data baked into the Docker image, daily DAG over static data, fillna(0) erasing signal&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;The database credentials were inside the function&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-database-credentials-were-inside-the-function&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-database-credentials-were-inside-the-function&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;load_data_to_snowflake&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;df_merged&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;conn&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;snowflake&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;connector&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;connect&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;n&#34;&gt;user&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;thaiscxxx&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;n&#34;&gt;password&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;xxx*&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;n&#34;&gt;account&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;xxx&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I masked it with &lt;code&gt;xxx&lt;/code&gt; before pushing, but the design pattern is the problem, not the string. Credentials inside the function mean each task that talks to Snowflake duplicates the connection, rotating the password requires touching code, and auditing means grepping the entire repo to figure out who connects where.&lt;/p&gt;
&lt;p&gt;The honest version would use a Hook (&lt;code&gt;SnowflakeHook&lt;/code&gt;) or environment variable, with the connection managed outside the code:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;from&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;airflow.providers.snowflake.hooks.snowflake&lt;/span&gt; &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SnowflakeHook&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;hook&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;SnowflakeHook&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;snowflake_conn_id&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;snowflake_default&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Encrypted, traceable, and never shows up in a pull request.&lt;/p&gt;
&lt;h2&gt;The pipeline lost parallelism for free&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-pipeline-lost-parallelism-for-free&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-pipeline-lost-parallelism-for-free&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;t1&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t2&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t3&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;code&gt;t1&lt;/code&gt; validated &lt;code&gt;students.json&lt;/code&gt;. &lt;code&gt;t2&lt;/code&gt; validated &lt;code&gt;missed_days.json&lt;/code&gt;. I chained them sequentially, but they&amp;rsquo;re independent. No reason for &lt;code&gt;t2&lt;/code&gt; to wait on &lt;code&gt;t1&lt;/code&gt;. With a tiny file, it barely matters. When the JSON weighs gigabytes and validation takes minutes, parallelizing cuts the duration in half.&lt;/p&gt;
&lt;p&gt;The correct version:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;t1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t2&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t3&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;t4&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Whoever reads the pipeline now understands validation runs in parallel and then joins. Whoever read the original would assume there&amp;rsquo;s some hidden dependency that doesn&amp;rsquo;t exist.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/02/code-review-meu-codigo-antigo/images/02-serie-vs-paralelo.png&#34; alt=&#34;How I wrote it vs how it should be: t1 » t2 » t3 » t4 in series against [t1, t2] » t3 » t4 with validations in parallel&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;Input data was baked into the Docker image&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;input-data-was-baked-into-the-docker-image&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#input-data-was-baked-into-the-docker-image&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In the Dockerfile:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;COPY files/students.json /students.json
COPY files/missed_days.json /missed_days.json&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I embedded the input data into the image. Every rebuild assumes the same data. To run the pipeline with a different JSON, I&amp;rsquo;d have to rebuild the image or change the code. Coupling between execution artifact and input data, in the same place.&lt;/p&gt;
&lt;p&gt;The rule I&amp;rsquo;d preach to others but ignored in my own repo: images are immutable, data is mutable. Data comes in through a mounted volume, S3, GCS, or runtime parameter. Never inside the image.&lt;/p&gt;
&lt;h2&gt;The pipeline ran daily on static data&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-pipeline-ran-daily-on-static-data&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-pipeline-ran-daily-on-static-data&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;with&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;DAG&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;migrate_student_data_to_snowflake&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;         &lt;span class=&#34;n&#34;&gt;schedule_interval&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;timedelta&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;days&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;         &lt;span class=&#34;n&#34;&gt;catchup&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;dag&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I scheduled the pipeline to run every day. The input data is two static JSONs baked into the image (the anti-pattern above). Running daily means processing the exact same files, generating the exact same records, and trying to insert them all again into the same table. On the second run, &lt;code&gt;write_pandas&lt;/code&gt; would duplicate the rows. On the third, duplicate again.&lt;/p&gt;
&lt;p&gt;The data is static. The correct choice would be &lt;code&gt;schedule_interval=None&lt;/code&gt; (manual or external trigger only) or a sensor that detects a new file in the bucket. Scheduling a pipeline without a mutable source is ceremony: it burns a worker slot every day, fires alerts when it breaks, pollutes the execution history. And when you actually need to run it with new data, the operation becomes indistinguishable from the background noise.&lt;/p&gt;
&lt;p&gt;It was meant to run once. I scheduled it to run daily. Subtle, but the kind of choice that produces ceremonial DAGs in production: pipelines that exist without a reason to exist on that cadence.&lt;/p&gt;
&lt;h2&gt;The &lt;code&gt;fillna(0)&lt;/code&gt; erased an important signal&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-fillna0-erased-an-important-signal&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-fillna0-erased-an-important-signal&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;df_merged&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;missed_days&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fillna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;inplace&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;When a student appears in &lt;code&gt;students.json&lt;/code&gt; but not in &lt;code&gt;missed_days.json&lt;/code&gt;, the join leaves &lt;code&gt;missed_days&lt;/code&gt; null. I replaced it with zero. It seemed right at the time.&lt;/p&gt;
&lt;p&gt;Zero absences carries business meaning: the student showed up every day. A missing record carries another meaning: the school didn&amp;rsquo;t report this student&amp;rsquo;s attendance. Conflating the two masks an upstream data quality issue. A dashboard filtering &amp;ldquo;students with zero absences&amp;rdquo; will surface as model students precisely the kids whose data never arrived.&lt;/p&gt;
&lt;p&gt;The honest version leaves null and opens a new column marking whether the record exists:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;df_merged&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;missed_data_source&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;=&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;df_merged&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;[&lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;missed_days&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;]&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;notna&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;()&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;map&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;True&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;reported&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s1&#34;&gt;&amp;#39;not_reported&amp;#39;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Small change, completely changes what the dashboard shows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/06/02/code-review-meu-codigo-antigo/images/03-zero-nao-e-nulo.png&#34; alt=&#34;Zero is not null: 0 means the student attended every day, NULL means the data never arrived, fillna(0) mixes the two&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;The discomfort of reviewing your own code&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-discomfort-of-reviewing-your-own-code&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-discomfort-of-reviewing-your-own-code&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Rewriting these five snippets today would take an hour. The discomfort of publicly admitting they were wrong is bigger than the hour. But the repo stayed public with the defects, and I cite that repo in my portfolio. Keeping the repo intact and doing an honest review on top is more useful for someone learning than deleting the history and pretending I always wrote clean code.&lt;/p&gt;
&lt;p&gt;If you have an old public repo still listed in your portfolio, open it this week. You&amp;rsquo;ll find at least three of these five.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Data Flows Ep01: the concept that comes before any tool</title>
      <link>https://vazdeng.pages.dev/en/2026/05/30/data-flows-ep01-the-concept-that-comes-before-any-tool/</link>
      <pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/30/data-flows-ep01-the-concept-that-comes-before-any-tool/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/30/data-flows-ep01/cover_hu_5dccf7c4c79d4e55.png" width="640" height="336"/>]]>
        
        &lt;p&gt;On August 1st, 2012, Knight Capital lost $440 million in 45 minutes.&lt;/p&gt;
&lt;p&gt;Not an algorithm bug. Not a market crash. One server out of eight received the new deploy, while another kept an old flag reactivated (Power Peg, 2003 code). The two ran in parallel. The result was a cascade of automated orders nobody could stop.&lt;/p&gt;
&lt;p&gt;The SEC documented the case (Release No. 70694, October 2013): the root cause was not a trading logic error. It was state inconsistency between servers that should have been in sync. In data engineering language, a broken data flow.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/data-flows-ep01/images/01-knight-timeline.png&#34; alt=&#34;Timeline August 1st, 2012: Knight Capital loses 440 million dollars in 45 minutes from divergent state between servers&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Knight Capital had sophisticated algorithms. Over a decade of operation. What it did not have was a clear mental model of where the data was born, where it traveled, and where it had to arrive consistently.&lt;/p&gt;
&lt;p&gt;That mental model defines everything else. I have worked with data long enough to have seen, at smaller scales, variations of the same failure. Before Apache Spark, before dbt, before Snowflake, before any tool, there is a concept that separates a robust pipeline from a fragile one.&lt;/p&gt;
&lt;h2&gt;In one sentence&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;in-one-sentence&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#in-one-sentence&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote&gt;
  &lt;p&gt;A data flow is the path data travels from source to destination, with every transformation in the middle. Getting that path right is an architectural decision. Getting it wrong is expensive.&lt;/p&gt;

&lt;/blockquote&gt;
&lt;h2&gt;Where this idea came from&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-this-idea-came-from&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-this-idea-came-from&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;It is not new. Bill Inmon published &lt;em&gt;Building the Data Warehouse&lt;/em&gt; in 1992 defending top-down, normalized, enterprise-wide architecture. Ralph Kimball replied in 1996 with &lt;em&gt;The Data Warehouse Toolkit&lt;/em&gt;: bottom-up, dimensional modeling, data marts composing the whole. The Inmon vs Kimball debate dominated the 90s and still shows up in any architecture review.&lt;/p&gt;
&lt;p&gt;What changed between 1996 and 2026 was not the concept, it was the scale. In 2017, Martin Kleppmann published &lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt; and formalized in chapter 11 the distinction that organizes modern data engineering:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;A stream refers to data that is incrementally made available over time&amp;hellip; in contrast to batch processing, where the input is a known, finite size.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Bounded vs unbounded. A dataset with known size (batch) versus one that never ends (stream). Every data architecture decision starts here.&lt;/p&gt;
&lt;p&gt;In 2021, the Lakehouse paper (Armbrust, Ghodsi, Xin, Zaharia, CIDR) proposed unifying warehouse and lake via a metadata layer (Delta, Iceberg, Hudi). In 2020, the dbt Labs team popularized ELT over ETL: transformation inside the warehouse, not before. Each wave changed the tooling, not the principle.&lt;/p&gt;
&lt;h2&gt;Bounded vs unbounded: the decision that defines everything&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;bounded-vs-unbounded-the-decision-that-defines-everything&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#bounded-vs-unbounded-the-decision-that-defines-everything&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Every pipeline decision starts here. Practical summary in a table:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/data-flows-ep01/images/02-data-flow-overview.png&#34; alt=&#34;Data flow diagram: source, transformation, destination, with batch, micro-batch and streaming by SLA&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Type&lt;/th&gt;
          &lt;th&gt;Trait&lt;/th&gt;
          &lt;th&gt;When to use&lt;/th&gt;
          &lt;th&gt;Cost&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Batch&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Finite dataset, processed in a defined window&lt;/td&gt;
          &lt;td&gt;SLA in hours, accounting reports, historical snapshots&lt;/td&gt;
          &lt;td&gt;Simple to build, debug, recover&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Infinite dataset, event processed on arrival&lt;/td&gt;
          &lt;td&gt;SLA from seconds to a few minutes, real-time fraud, ops dashboards&lt;/td&gt;
          &lt;td&gt;Complex, requires watermarks, exactly-once, heavy observability&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Micro-batch&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;Streaming in short windows (seconds to minutes)&lt;/td&gt;
          &lt;td&gt;Middle ground: minute-level dashboards, ML feature stores near real-time&lt;/td&gt;
          &lt;td&gt;Spark Structured Streaming, Flink mini-batches&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Tyler Akidau and team (Google) published in VLDB 2015 &lt;em&gt;The Dataflow Model&lt;/em&gt; paper that formalized the modern vocabulary: event time, processing time, watermarks, triggers, windowing. The central line:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;A practical approach to balancing the inherent tension between correctness, latency, and cost in massive-scale, unbounded, out-of-order data.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Translation: streaming is right on three variables at the same time. You do not maximize the three, you pick two and pay for the third.&lt;/p&gt;
&lt;h2&gt;When batch, when streaming&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-batch-when-streaming&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-batch-when-streaming&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The practical rule I use is simple: acceptable latency SLA defines the answer.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SLA above 1h&lt;/strong&gt; leans to batch. Simple reprocessing, direct debugging, cheap infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SLA below 1 minute&lt;/strong&gt; demands streaming. Whoever tries to force batch in that scenario creates windows so short that it reinvents streaming with the worst of both worlds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SLA between 1 minute and 1h&lt;/strong&gt; is micro-batch territory. Spark Structured Streaming or Flink mini-batches solve it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Jay Kreps, Confluent founder, wrote in 2014 the essay &lt;em&gt;Questioning the Lambda Architecture&lt;/em&gt; attacking the model proposed by Nathan Marz, which kept two parallel layers (batch + speed). The line that stuck:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Kreps proposed Kappa: unified log (Kafka) as source of truth, reprocessing via replay. Kappa became standard among teams running serious streaming.&lt;/p&gt;
&lt;p&gt;The most common mistake I see is forcing streaming because &amp;ldquo;it sounds modern&amp;rdquo;. Streaming is not a better version of batch. It is a different contract, different cost, different mental model. When the decision is taken by trend instead of by SLA, the team spends months building complexity the problem never asked for, and I have walked into that trap more than once.&lt;/p&gt;
&lt;h2&gt;What goes wrong when the flow is ignored&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-goes-wrong-when-the-flow-is-ignored&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-goes-wrong-when-the-flow-is-ignored&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Knight Capital was not an isolated accident. The pattern repeats at other scales.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GitHub, October 2018&lt;/strong&gt;: 24-hour outage. Root cause documented by Jason Warner (official post-mortem): 43 seconds of network partition between US East data centers caused divergence in MySQL Orchestrator failover, replication storm and cross-DC inconsistency. Pure data flow failure at the replication layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Airbnb, before Minerva&lt;/strong&gt;: different teams calculated &amp;ldquo;active user&amp;rdquo; with divergent queries on the same Spark cluster. Metrics collided in executive meetings. The fix was not another dashboard, it was a single metric definition layer with explicit lineage from source to destination. Minerva indexes over 200K data assets today.&lt;/p&gt;
&lt;p&gt;These cases fit named patterns in the literature. Worth knowing each:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pipeline jungle&lt;/strong&gt; (Sculley et al, NeurIPS 2015, &lt;em&gt;Hidden Technical Debt in Machine Learning Systems&lt;/em&gt;): &lt;em&gt;&amp;ldquo;pipeline jungles often appear as data preparation evolves organically&amp;hellip; testing such pipelines requires expensive end-to-end integration tests.&amp;rdquo;&lt;/em&gt; That is what happens when no one drew the flow at the start and it grew by accretion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data swamp&lt;/strong&gt; (Nick Heudecker, Gartner 2014): &lt;em&gt;&amp;ldquo;lakes turn into swamps when there is no metadata, governance, or quality control.&amp;rdquo;&lt;/em&gt; Lake became a folder of files dumped anywhere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema drift&lt;/strong&gt;: fields change without warning between runs, downstream contracts break silently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lineage gaps&lt;/strong&gt;: nobody knows where the dashboard number came from.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reverse-ETL chaos&lt;/strong&gt;: data flows back from the warehouse to SaaS without governance, becomes a secret source of truth no one audits.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;How the big ones document their own flow&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-the-big-ones-document-their-own-flow&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-the-big-ones-document-their-own-flow&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Companies running real data in production publish the architecture. Worth reading.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Company&lt;/th&gt;
          &lt;th&gt;Doc&lt;/th&gt;
          &lt;th&gt;Anchor&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Netflix&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;em&gt;Maestro: Netflix&amp;rsquo;s Workflow Orchestrator&lt;/em&gt; (TechBlog, Jul 2024)&lt;/td&gt;
          &lt;td&gt;Orchestrates hundreds of thousands of workflows per day, WAP (Write-Audit-Publish) pattern over Iceberg&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Uber&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;em&gt;Uber&amp;rsquo;s Big Data Platform&lt;/em&gt; (Eng Blog, Oct 2018)&lt;/td&gt;
          &lt;td&gt;Hudi cut ingestion latency from 24h to under 1h on 100+ PB&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Airbnb&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;em&gt;Democratizing Data at Airbnb&lt;/em&gt; (May 2017)&lt;/td&gt;
          &lt;td&gt;Dataportal indexes 200K+ data assets with explicit lineage&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Stripe&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;em&gt;Online migrations at scale&lt;/em&gt; (Eng Blog, Feb 2017)&lt;/td&gt;
          &lt;td&gt;Dual-write + backfill + reconciliation to migrate financial data without loss&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&lt;strong&gt;Slack&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;&lt;em&gt;How We Built Slack&amp;rsquo;s Data Warehouse&lt;/em&gt; (Sep 2023)&lt;/td&gt;
          &lt;td&gt;Presto+Hive to Trino+Iceberg migration, 60K queries per day&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Common pattern: each one documented the flow before building the next tool. The tool was born from the diagram, not the other way around.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/data-flows-ep01/images/03-quote-ferramenta.png&#34; alt=&#34;Quote: the tool was born from the diagram, not the other way around&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;Anti-patterns to avoid&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;anti-patterns-to-avoid&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#anti-patterns-to-avoid&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Forcing streaming because it sounds modern&lt;/strong&gt;. If the SLA is daily, batch solves it with 10% of the complexity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Building a pipeline without drawing the flow first&lt;/strong&gt;. Pipeline jungle is literally this: growing without a map.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accepting the lake as &amp;ldquo;throw it all in, I will organize later&amp;rdquo;&lt;/strong&gt;. Becomes a swamp in 6 months.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ignoring schema contracts&lt;/strong&gt;. Schema drift breaks downstream silently. Use Schema Registry or versioned SQL contracts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keeping two parallel implementations (Lambda)&lt;/strong&gt;. Maintenance cost doubles, behaviors diverge, no one trusts either.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skipping lineage&lt;/strong&gt;. Lineage is not a luxury. It is the only way to answer &amp;ldquo;where did this number come from&amp;rdquo; without opening 12 jobs.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Where to start&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-to-start&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-to-start&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Can you draw, on a napkin, the data flow of your most critical pipeline? Exact source, main transformations, destinations, SLA per stage.&lt;/p&gt;
&lt;p&gt;If yes, you are ahead of most. If not, start there. Before Spark, before dbt, before any new tool.&lt;/p&gt;
&lt;p&gt;The next episodes of the Zero to Expert series will go into each layer in depth: ingestion (formats, idempotency, CDC), transformation (SQL vs Python vs Spark), destination (warehouse vs lake vs lakehouse), orchestration. Each episode with a concrete case and a decision at the center, not theory.&lt;/p&gt;
&lt;p&gt;If there is a specific concept you want covered, send it to me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;newsletter&lt;/a&gt; to get the next episodes.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>SLA, not trend: when batch, when streaming, when both</title>
      <link>https://vazdeng.pages.dev/en/2026/05/30/sla-not-trend-when-batch-when-streaming-when-both/</link>
      <pubDate>Sat, 30 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/30/sla-not-trend-when-batch-when-streaming-when-both/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/30/sla-nao-moda-batch-streaming/cover_hu_ecfe8a72c8fe9d0c.png" width="640" height="336"/>]]>
        
        &lt;p&gt;I watched a marketing team do what every team does once: adopt streaming because it sounded modern. Managed Kafka, 24x7 workers, exactly-once guarantees. To process events arriving every 10 minutes. Nightly batch would solve the same. It cost a tenth. It took six months until someone measured.&lt;/p&gt;
&lt;p&gt;The pattern repeats. I have walked through the same decision in four different domains: finance pipelines, industrial processes, marketing, analytics. The discussion always starts wrong. &amp;ldquo;Let&amp;rsquo;s go streaming because it is more modern.&amp;rdquo; Or &amp;ldquo;let&amp;rsquo;s keep batch because it is what we always did.&amp;rdquo; Both miss the right question.&lt;/p&gt;
&lt;p&gt;The right question is one: what is the real SLA of the consumer that will use this data?&lt;/p&gt;
&lt;h2&gt;The right question is not &amp;ldquo;which is more modern&amp;rdquo;&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-right-question-is-not-which-is-more-modern&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-right-question-is-not-which-is-more-modern&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Martin Kleppmann formalizes in chapter 11 of &lt;em&gt;Designing Data-Intensive Applications&lt;/em&gt; the distinction that organizes any data architecture in 2026. &lt;strong&gt;Bounded&lt;/strong&gt; data (finite set, known size) versus &lt;strong&gt;unbounded&lt;/strong&gt; (a stream that never ends). Every decision starts there.&lt;/p&gt;
&lt;p&gt;But the bounded/unbounded distinction is technical, not behavioral. Real data is rarely just one thing. Application logs are unbounded by nature. If I aggregate them in 1-hour batches to feed a dashboard nobody looks at more than once an hour, the consumer is treating it as bounded. Data is what the consumption decides.&lt;/p&gt;
&lt;p&gt;Tyler Akidau and team at Google published in 2015 the paper that became the industry standard, &lt;em&gt;The Dataflow Model&lt;/em&gt;. The central line:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;A practical approach to balancing the inherent tension between correctness, latency, and cost in massive-scale, unbounded, out-of-order data.&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Translation: streaming is right on three variables at the same time. Correctness, latency and cost. You pick two, you pay for the third. Batch is simpler precisely because it does not try to optimize latency.&lt;/p&gt;
&lt;h2&gt;Decision table: SLA × technology&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;decision-table-sla--technology&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#decision-table-sla--technology&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/sla-nao-moda-batch-streaming/images/01-tabela-sla-tecnologia.png&#34; alt=&#34;Visual table SLA versus recommended technology&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For most pipelines I see, the table above resolves the decision in 30 seconds. SLA above 1 hour is batch territory. SLA below 1 minute requires streaming. The middle is micro-batch, and most cases land there, not at the extremes.&lt;/p&gt;
&lt;h2&gt;When batch wins (even in 2026)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-batch-wins-even-in-2026&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-batch-wins-even-in-2026&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Spotify runs recommendations in nightly batch on BigQuery. Netflix has Maestro orchestrating hundreds of thousands of workflows per day with the Write-Audit-Publish pattern over Iceberg. Neither is &amp;ldquo;late&amp;rdquo;. They chose batch where batch solves better.&lt;/p&gt;
&lt;p&gt;Batch wins when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Consumer SLA is hourly or daily (accounting report, closing, historical snapshot, ML training)&lt;/li&gt;
&lt;li&gt;Input data is stable enough that you can reprocess whenever you want&lt;/li&gt;
&lt;li&gt;Your team has more ease debugging Python that runs once a night than a 24x7 stream processor&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cost matters a lot. A nightly batch Spark cluster stays off during the day. Infrastructure when no job is running: zero. Managed Kafka is always on. Confluent Cloud Standard starts at $1k to $3k per month, and egress can hit $47k per month at 300 MiB/s outbound. The difference over a year is the salary of a mid-level engineer in Curitiba.&lt;/p&gt;
&lt;h2&gt;When streaming is the only answer&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-streaming-is-the-only-answer&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-streaming-is-the-only-answer&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Pix has an SLA under 10 seconds, 24x7. BACEN publishes this. Daily batch does not work. Not optional. Point-of-sale fraud detection is the same: either identify before the transaction closes or it serves nothing. Call center ops dashboard, same logic: the agent needs to see the customer updated the moment they answer.&lt;/p&gt;
&lt;p&gt;These cases do not allow batch. Streaming is the only answer.&lt;/p&gt;
&lt;p&gt;For them, Flink delivers latency under 100 milliseconds. Spark Structured Streaming sits at 100 milliseconds to 1 second (micro-batch). Kafka Streams runs embedded in the application, without its own cluster, and processes around 1 million events per second. Choosing between the three is another post.&lt;/p&gt;
&lt;p&gt;Uber is the most interesting case. Adopted streaming without going 100% streaming. Added Hudi for incremental processing and brought ingestion latency from 24 hours to under 1 hour on more than 100 PB. Their Flink IngestionNext consumes 25% less compute than the old batch. Streaming done right also saves, as long as it solves the right problem.&lt;/p&gt;
&lt;h2&gt;When &amp;ldquo;both&amp;rdquo; is the right answer&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-both-is-the-right-answer&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-both-is-the-right-answer&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Jay Kreps published in 2014 the essay that killed Lambda Architecture. Lambda keeps two parallel pipelines to produce the same result: one batch and reliable, one streaming and fast. The line that stuck:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems.&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/sla-nao-moda-batch-streaming/images/02-quote-kreps-lambda.png&#34; alt=&#34;Kreps quote on Lambda Architecture&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Kreps proposed Kappa: single log (Kafka) as source of truth, reprocessing via replay. Batch becomes a special case of streaming over the history.&lt;/p&gt;
&lt;p&gt;Lakehouse was a step further. The Databricks 2021 paper proposes a metadata layer (Delta, Iceberg, Hudi) that serves both natures. The same data can be consumed in batch by the BI team and in streaming by the fraud application. No 2 stacks. One contract.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Both&amp;rdquo; is not technical cowardice. It is conscious design when you have consumers with different SLAs over the same data.&lt;/p&gt;
&lt;h2&gt;Questions that decide the case&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;questions-that-decide-the-case&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#questions-that-decide-the-case&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before opening Terraform or docker-compose, answer this honestly:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What is the real SLA of the consumer that will read this data?&lt;/strong&gt; Not the SLA you imagine. What they actually need.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is this SLA different per consumer?&lt;/strong&gt; If yes, consider Lakehouse with a single contract, not 2 parallel pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How much does it cost to run 1 month of streaming vs batch at this volume?&lt;/strong&gt; Do the math before, not after the invoice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does your team have maturity to debug exactly-once, watermarks and distributed state?&lt;/strong&gt; If not, the learning cost comes embedded in the project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do you already have batch or streaming infrastructure running?&lt;/strong&gt; Reusing reduces risk. Greenfield lets you pick better.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/30/sla-nao-moda-batch-streaming/images/03-decision-tree-5-perguntas.png&#34; alt=&#34;Decision tree visual of the questions&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you answered honestly and still landed on streaming, great. Streaming makes sense. If you landed on batch, great too. Batch solves most cases.&lt;/p&gt;
&lt;p&gt;The mistake is not picking streaming. The mistake is picking streaming without answering them.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Which pipeline did you pick wrong and had to redo later? Tell me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or reply to this email. I want to see how many cases match.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Airflow for 2 years: what I would do differently</title>
      <link>https://vazdeng.pages.dev/en/2026/05/24/airflow-for-2-years-what-i-would-do-differently/</link>
      <pubDate>Sun, 24 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/24/airflow-for-2-years-what-i-would-do-differently/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/24/airflow-2-anos-o-que-faria-diferente/cover_hu_f193dea45ae9e3c3.png" width="640" height="336"/>]]>
        
        &lt;p&gt;It was 2 a.m. when the alert came. The monthly report DAG had failed on step 8 of 12. Financial data, 6 a.m. deadline, and I spent the next 4 hours trying to understand if the task really failed, if it was a silent timeout, or if the worker had died without telling anyone. When I found out it was the third, 40 minutes were left.&lt;/p&gt;
&lt;p&gt;This scenario is routine in teams running Airflow in production. Airflow works. And it also creates work nobody warns you about in the first tutorial.&lt;/p&gt;
&lt;p&gt;This post is not to convince anyone to drop Airflow. It is about what is worth changing before the problem shows up.&lt;/p&gt;
&lt;h2&gt;Context: what it is and who uses it&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;context-what-it-is-and-who-uses-it&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#context-what-it-is-and-who-uses-it&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Airflow was created by Maxime Beauchemin at Airbnb in October 2014 to orchestrate data pipelines with complex dependencies. It went open source in June 2015 and became an Apache Foundation top-level project in January 2019.&lt;/p&gt;
&lt;p&gt;It is today the most used data orchestrator in the world: 320 million downloads in 2024 alone, ten times more than the second place. Uber runs 200,000 pipelines with 750,000 task runs per day. Shopify has 10,000 active DAGs. Stripe processes 150,000 daily tasks.&lt;/p&gt;
&lt;p&gt;Real adoption, not hype.&lt;/p&gt;
&lt;p&gt;But the same report that shows those numbers also reveals that 46% of users say that when Airflow has a problem, the entire operation stops. That is the tension nobody tells you about in the first tutorial.&lt;/p&gt;
&lt;h2&gt;What Airflow solves well&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-airflow-solves-well&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-airflow-solves-well&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Dependencies between tasks are guaranteed.&lt;/strong&gt; You define the graph in Python. Airflow guarantees that task B only runs when task A finishes successfully. With 50 interdependent tasks in a finance pipeline, having that guaranteed by an orchestrator avoids rewriting retry and dependency logic in every DAG, and removes the whole category of &amp;ldquo;task ran before time because cron fired&amp;rdquo; bugs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Retry with backoff is native.&lt;/strong&gt; Two lines and your task retries automatically. In pipelines depending on unstable external APIs, this kills 2 a.m. alerts for transient errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The execution history is auditable.&lt;/strong&gt; Every run, every task, every log gets recorded. When compliance asks &amp;ldquo;was the March report generated with data from 03/31 or 04/01&amp;rdquo;, you open Airflow and answer in seconds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backfill works.&lt;/strong&gt; Pipeline down for three days? You reprocess the historical runs with one command. For pipelines that need complete and consistent history, that matters.&lt;/p&gt;
&lt;h2&gt;Where Airflow gets complicated&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-airflow-gets-complicated&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-airflow-gets-complicated&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/24/airflow-2-anos-o-que-faria-diferente/images/01-diagrama-arquitetura.png&#34; alt=&#34;Diagram of Airflow’s 3 components and where each problem happens&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h3&gt;The scheduler parses your whole code every 30 seconds&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-scheduler-parses-your-whole-code-every-30-seconds&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-scheduler-parses-your-whole-code-every-30-seconds&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The scheduler needs to run the Python code of each DAG file repeatedly to understand what exists and what the dependencies are. With 200 DAGs, that parse cycle can take minutes.&lt;/p&gt;
&lt;p&gt;What makes it critical: 98% of scheduler slowness cases come from heavy imports at the module level. A file that does &lt;code&gt;import pandas as pd&lt;/code&gt; at the top, outside any function, makes the scheduler run that import every cycle. With 200 DAGs and heavy imports, that becomes minutes of parsing before any task runs.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Wrong: pandas is imported every scheduler cycle&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nd&#34;&gt;@dag&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;pipeline&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;o&#34;&gt;...&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Right: import only when the task runs&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nd&#34;&gt;@task&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;process&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;():&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;kn&#34;&gt;import&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pandas&lt;/span&gt; &lt;span class=&#34;k&#34;&gt;as&lt;/span&gt; &lt;span class=&#34;nn&#34;&gt;pd&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;o&#34;&gt;...&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/24/airflow-2-anos-o-que-faria-diferente/images/02-antes-depois-imports.png&#34; alt=&#34;Visual comparison: pandas import at the top of the DAG vs inside the task&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h3&gt;XCom has a hard limit nobody warns you about&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;xcom-has-a-hard-limit-nobody-warns-you-about&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#xcom-has-a-hard-limit-nobody-warns-you-about&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;XCom is Airflow&amp;rsquo;s mechanism for tasks to communicate. The problem: it was designed for small messages, not data.&lt;/p&gt;
&lt;p&gt;In PostgreSQL, the default row limit is 8KB. A 1,000-row DataFrame will blow up XCom. In production, the error shows up as a timeout or silent crash of the metadata database, not as a clear &amp;ldquo;data too big&amp;rdquo; message.&lt;/p&gt;
&lt;p&gt;The pattern used in production: pass only the S3 path via XCom, never the data itself.&lt;/p&gt;
&lt;h3&gt;catchup=True has already triggered unwanted backfills in many teams&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;catchuptrue-has-already-triggered-unwanted-backfills-in-many-teams&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#catchuptrue-has-already-triggered-unwanted-backfills-in-many-teams&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;By default in old versions, if you redeploy a DAG with &lt;code&gt;start_date&lt;/code&gt; in the past and &lt;code&gt;catchup=True&lt;/code&gt;, Airflow will create and try to execute every historical run since &lt;code&gt;start_date&lt;/code&gt;. With a monthly DAG and &lt;code&gt;start_date&lt;/code&gt; two years ago, that is 24 runs fired at once.&lt;/p&gt;
&lt;p&gt;DoubleVerify documented that after migrating to a setup with &lt;code&gt;catchup=False&lt;/code&gt; as the cluster default and other changes, incidents dropped 80%.&lt;/p&gt;
&lt;h3&gt;Renaming a DAG loses the whole history&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;renaming-a-dag-loses-the-whole-history&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#renaming-a-dag-loses-the-whole-history&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;There is no rename operation in Airflow. Renaming a DAG creates a new entry in the metadata database and loses the whole execution history. In production, that means you cannot compare current behavior to past behavior, and any alert that depends on history breaks.&lt;/p&gt;
&lt;h3&gt;Business logic inside the operator becomes a problem later&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;business-logic-inside-the-operator-becomes-a-problem-later&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#business-logic-inside-the-operator-becomes-a-problem-later&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;The temptation is to put transformations and business rules directly inside &lt;code&gt;PythonOperator&lt;/code&gt;. Works in the beginning. After six months, you have untestable logic stuck inside infrastructure, the same rule duplicated across three different operators, and a DAG you can only debug by bringing up the whole Airflow.&lt;/p&gt;
&lt;p&gt;The right pattern: the operator is infrastructure and calls testable functions that live outside the DAG.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/24/airflow-2-anos-o-que-faria-diferente/images/03-quote-card.png&#34; alt=&#34;Business logic inside the operator becomes a problem later&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What I would do differently&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-i-would-do-differently&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-i-would-do-differently&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;TaskFlow API from day one.&lt;/strong&gt; Released in Airflow 2.0, it lets you write DAGs with Python decorators instead of instantiating operators manually. The code is cleaner, dependencies are implicit in the flow, and it is easier to test. I spent too long writing in the old style before migrating.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;catchup=False&lt;/code&gt; as the cluster default from initial configuration.&lt;/strong&gt; One line in &lt;code&gt;airflow.cfg&lt;/code&gt; that avoids dozens of incidents.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Resource pools from the first DAG.&lt;/strong&gt; By default Airflow does not limit how many tasks of a DAG run in parallel. A heavy DAG can consume all the slots and block the others. Configure pools before the first problem, not after.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No multi-tenant on a single instance.&lt;/strong&gt; Sharing one Airflow instance between different teams creates Python dependency conflicts, lack of resource isolation, and upgrade paralysis: one team cannot update without coordinating with all the others. One instance per team is the recommended pattern.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monitor the scheduler, not just the tasks.&lt;/strong&gt; The scheduler is the heart of Airflow and can degrade silently. Grafana on the scheduler heartbeat catches problems before the tasks start failing.&lt;/p&gt;
&lt;h2&gt;About Airflow 3.0&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;about-airflow-30&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#about-airflow-30&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;In April 2025 Airflow released version 3.0, the biggest release in the project&amp;rsquo;s history. It solves problems the community documented for years: Task Execution API that removes the need for workers to access the metadata database directly, native DAG Versioning, rebuilt React UI, and support for tasks in languages beyond Python.&lt;/p&gt;
&lt;p&gt;If you are starting a new project, evaluate Airflow 3.0 before picking the version to install. The changes are breaking, so migrating an existing cluster takes planning.&lt;/p&gt;
&lt;h2&gt;When to evaluate alternatives&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-to-evaluate-alternatives&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-to-evaluate-alternatives&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Airflow has 320 million downloads for a reason: it works, has the biggest integration ecosystem in the market, and the community is vast.&lt;/p&gt;
&lt;p&gt;But there are cases where other tools solve it better:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prefect or Dagster&lt;/strong&gt; for smaller teams that value simple local development, event-driven workflows, and richer observability without the operational overhead of Airflow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;dbt Cloud&lt;/strong&gt; when most pipelines are SQL transformations in a warehouse. Native orchestration is simpler for that specific case.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Managed Airflow&lt;/strong&gt; (Astronomer, Amazon MWAA, Google Cloud Composer) if the cost fits and you do not want to maintain the infrastructure. Removes a significant chunk of the operational pain.&lt;/p&gt;
&lt;p&gt;What does not pay off is picking by popularity without evaluating whether the problem Airflow solves is your problem.&lt;/p&gt;
&lt;h2&gt;What stays&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-stays&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-stays&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Airflow works well for what it was made for: orchestrating batch pipelines with complex dependencies, auditable history and reliable retries.&lt;/p&gt;
&lt;p&gt;The problems I ran into were almost all avoidable with the right configuration from the start: imports outside functions, XCom for big data, catchup without control, business logic inside operators.&lt;/p&gt;
&lt;p&gt;If you are starting: imports inside functions, &lt;code&gt;catchup=False&lt;/code&gt; on the cluster, XCom only for coordination, business logic in separate testable modules. Four decisions that avoid most of the problems I ran into.&lt;/p&gt;
&lt;p&gt;What was the most annoying problem you have seen with Airflow? Tell me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;newsletter&lt;/a&gt;.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Twenty AI concepts you need to understand in 2026</title>
      <link>https://vazdeng.pages.dev/en/2026/05/23/twenty-ai-concepts-you-need-to-understand-in-2026/</link>
      <pubDate>Sat, 23 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/23/twenty-ai-concepts-you-need-to-understand-in-2026/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/23/20-conceitos-ia-2026/cover_hu_e9316444b961433b.png" width="640" height="336"/>]]>
        
        &lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/23/20-conceitos-ia-2026/images/infografo.png&#34; alt=&#34;Twenty AI concepts you need to understand in 2026&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Every week a new AI term shows up. Agent, RAG, fine-tuning, embedding, top-p, RLHF. You open LinkedIn and three people are already &amp;ldquo;building autonomous agents&amp;rdquo; before breakfast. Over on Twitter someone complains their RAG hallucinates while the next post debates whether it&amp;rsquo;s worth fine-tuning Llama 3. Then you head to the API docs you were going to test for something simple and walk into a hundred-word glossary before the first useful call.&lt;/p&gt;
&lt;p&gt;The problem isn&amp;rsquo;t the number of terms. It&amp;rsquo;s that nobody stops to draw how they connect.&lt;/p&gt;
&lt;p&gt;This infographic is my attempt at a map. Twenty concepts, six sections, a sequence that makes sense if you go from the base to the frontier. It&amp;rsquo;s nowhere near everything that exists in AI. But you can open it in a technical meeting in 2026 and understand what people are talking about, or read the code of an agentic system and identify what each piece is doing in the flow.&lt;/p&gt;
&lt;h2&gt;How AI works (1 to 4)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-ai-works-1-to-4&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-ai-works-1-to-4&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;It all starts with &lt;strong&gt;neural networks&lt;/strong&gt;. Layers of neurons connected by weights, adjusted during training to make predictions. That&amp;rsquo;s the only primitive of all this. Models that see images, models that write text, models that understand audio: all variations of the same thing, with different architectural choices on top.&lt;/p&gt;
&lt;p&gt;For language to enter that network, it needs to become a number. That&amp;rsquo;s what &lt;strong&gt;tokenization&lt;/strong&gt; does: break text into chunks the model can chew on. AI doesn&amp;rsquo;t read words. It reads tokens. Then each token becomes a vector in a space of hundreds of dimensions, and that&amp;rsquo;s an &lt;strong&gt;embedding&lt;/strong&gt;. Similar meanings sit close together. It&amp;rsquo;s what makes semantic search, recommendation, and RAG work.&lt;/p&gt;
&lt;p&gt;On top of those three comes &lt;strong&gt;attention&lt;/strong&gt;. The mechanism that lets each word look at every other word in the input and decide what matters to it. Before attention, models read text in sequence and forgot the beginning by the middle of the sentence. Attention broke that bottleneck. Without it, the rest of contemporary AI simply wouldn&amp;rsquo;t exist in the form we know today.&lt;/p&gt;
&lt;h2&gt;The magic behind it (5 to 8)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-magic-behind-it-5-to-8&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-magic-behind-it-5-to-8&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Transformers&lt;/strong&gt; are the architecture that packaged attention into something trainable in parallel. Before them, language models were slow and short. After them, they became GPT, Claude, Gemini.&lt;/p&gt;
&lt;p&gt;But architecture without data is nothing. &lt;strong&gt;Pre-training&lt;/strong&gt; is the phase where the model reads the equivalent of the Library of Alexandria. Trillions of tokens. This is where it absorbs syntax, grammar, facts about the world, and the patterns of reasoning that humans left in writing. &lt;strong&gt;Fine-tuning&lt;/strong&gt; is what comes next: take that general model and specialize it on specific tasks with specific data. And &lt;strong&gt;RLHF&lt;/strong&gt; is the stage that took models that could answer anything and taught them to answer in a way that&amp;rsquo;s actually useful to someone. Real people compare outputs, say which one is better, the model learns the preference. It&amp;rsquo;s what separates &amp;ldquo;a model that knows a lot&amp;rdquo; from &amp;ldquo;a model that converses well.&amp;rdquo;&lt;/p&gt;
&lt;h2&gt;Beyond the models (9 to 12)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;beyond-the-models-9-to-12&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#beyond-the-models-9-to-12&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;No model goes to production on its own. Around it sits a layer of &lt;strong&gt;safeguards&lt;/strong&gt;: filters and classifiers built on explicit rules, to keep the system from saying something that hurts someone or reproduces an obvious bias. That&amp;rsquo;s the boring part nobody wants to build and that every serious product needs to have.&lt;/p&gt;
&lt;p&gt;And when the model needs to know something that wasn&amp;rsquo;t in pre-training, in comes &lt;strong&gt;RAG&lt;/strong&gt;. Retrieval-Augmented Generation. The system fetches relevant documents, injects them into the context, and the model answers grounded in them. RAG depends on two close relatives: &lt;strong&gt;vector databases&lt;/strong&gt; (which store embeddings in a way that lets you find the nearest match in milliseconds) and &lt;strong&gt;chunking&lt;/strong&gt; (which breaks large documents into indexable pieces). RAG without good chunking is RAG that hallucinates elegantly.&lt;/p&gt;
&lt;h2&gt;How AI generates output (13 to 14)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-ai-generates-output-13-to-14&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-ai-generates-output-13-to-14&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When the model answers, it doesn&amp;rsquo;t write the whole sentence at once. It predicts one token, then the next, then the next. That&amp;rsquo;s &lt;strong&gt;decoding&lt;/strong&gt;. And how it picks each next token completely changes the character of the output. High &lt;strong&gt;temperature&lt;/strong&gt; gives creativity and variation. Low &lt;strong&gt;top-p&lt;/strong&gt; sharpens focus on the most likely tokens. Tuning these two parameters is the difference between a model that writes poetry and a model that writes technical documentation.&lt;/p&gt;
&lt;h2&gt;How AI acts (15 to 16)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-ai-acts-15-to-16&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-ai-acts-15-to-16&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Up to here the model only responds. &lt;strong&gt;Agents&lt;/strong&gt; are the next step: it decides and acts. Receives an objective, breaks it into steps, picks which tool to use, executes, observes the result, adjusts the next step. &lt;strong&gt;Tools and functions&lt;/strong&gt; are the hands we give to that agent: API, calculator, search, code execution, database access. Without them, the agent gets stuck in its own head talking to itself. The part that actually matters about agentic systems starts when the model can finally call something that changes state in the real world.&lt;/p&gt;
&lt;h2&gt;Improvement and evaluation (17 to 20)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;improvement-and-evaluation-17-to-20&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#improvement-and-evaluation-17-to-20&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Agentic systems without explicit &lt;strong&gt;planning&lt;/strong&gt; turn into chaos fast. Without rigorous &lt;strong&gt;evaluation&lt;/strong&gt;, any claim about the model just became cheerleading. &lt;strong&gt;Iterative improvement&lt;/strong&gt; is what separates a pretty prototype from a system that survives in production: test, measure, adjust, repeat. And &lt;strong&gt;bias and fairness&lt;/strong&gt; has an inconvenient property: if you ignore it at design time, it will find you in the incident.&lt;/p&gt;
&lt;h2&gt;Closing&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;closing&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#closing&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;AI isn&amp;rsquo;t magic. It&amp;rsquo;s math with data on top, logic around it, and iteration at the center. People who understand these twenty concepts read agentic system architecture without getting lost in the glossary. They can debug weird model behavior from real hypotheses instead of guesses. And in a technical conversation, they speak like someone who took part in the build, not like someone who read the release.&lt;/p&gt;
&lt;p&gt;Take the infographic. Save it on your phone, print it and put it on the wall, drop it in Notion. Come back to it every time a term that feels new shows up. And more important than any of that: build something with it. You only discover what each of these words really means when you try to make a RAG actually work.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;This is the first post in the AI Foundations track at &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;VazDEng&lt;/a&gt;. Three posts a week on data engineering in Portuguese (and English), at the senior level Brazil was missing.&lt;/em&gt;&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>When the model should say &#39;I don&#39;t know&#39;</title>
      <link>https://vazdeng.pages.dev/en/2026/05/17/when-the-model-should-say-i-dont-know/</link>
      <pubDate>Sun, 17 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/17/when-the-model-should-say-i-dont-know/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/17/hmm-aprende-a-abster/cover_hu_3e18e146efd31b6a.png" width="640" height="336"/>]]>
        
        &lt;p&gt;In September 1998, Long-Term Capital Management lost $4.6 billion in a few weeks. The spread models had been trained on normal-times correlations. The Russian default and the subsequent flight-to-quality made correlations historically around 0.3 converge to 1 within days. In &lt;em&gt;When Genius Failed&lt;/em&gt;, Lowenstein cites the fund&amp;rsquo;s internal calculation of the probability of what happened:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;An event so freakish as to be unlikely to occur even once over the entire life of the universe.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;The models were technically correct. They were just extrapolating confidence into a region of the space they had never seen. They had no &amp;ldquo;I don&amp;rsquo;t know&amp;rdquo; button.&lt;/p&gt;
&lt;p&gt;My quant agent had the same problem, at incomparably smaller scale but with the same nature. I solved it this week.&lt;/p&gt;
&lt;h2&gt;In one sentence&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;in-one-sentence&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#in-one-sentence&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;blockquote&gt;
  &lt;p&gt;Conservative degradation is the principle that says a model must have the right to abstain. When data is outside what it has seen, returning &amp;ldquo;I don&amp;rsquo;t know&amp;rdquo; is more useful than returning a spurious classification with mathematically high confidence.&lt;/p&gt;

&lt;/blockquote&gt;
&lt;h2&gt;The blind spot left after the data leakage fix&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-blind-spot-left-after-the-data-leakage-fix&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-blind-spot-left-after-the-data-leakage-fix&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The previous post closed the chapter on Sharpe -1.14. The posterior became causal, the data leakage went away, the number became honest. But there was a blind spot the Sharpe didn&amp;rsquo;t show.&lt;/p&gt;
&lt;p&gt;The 3-state Gaussian HMM always classifies. It receives a candle, computes the posterior over BULL/SIDEWAYS/BEAR, and returns the one with highest probability. By construction. If the features are in the normal training zone, fine. If they&amp;rsquo;re completely outside, it keeps classifying, and the posterior keeps summing to 1.&lt;/p&gt;
&lt;p&gt;Concrete scenario: daily ATR 4x above the 90-day average, funding rate in historical extreme negative, volume 10x above normal. A spike the model simply has no reference point for. The HMM returns something like &amp;ldquo;BULL with 73% confidence&amp;rdquo;, because one of the three classes has to win.&lt;/p&gt;
&lt;p&gt;Mathematically legitimate. Operationally dangerous.&lt;/p&gt;
&lt;h2&gt;What the literature calls this&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-the-literature-calls-this&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-the-literature-calls-this&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I looked at the literature before implementing anything. Three threads converge.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Out-of-Distribution detection (computer vision, classical ML).&lt;/strong&gt; The lineage starts with Hendrycks &amp;amp; Gimpel 2017 (&amp;ldquo;A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks&amp;rdquo;), showing that maximum softmax probability is already a reasonable confidence signal. Liang et al 2018 (ODIN) adds temperature scaling and adversarial perturbation, reducing false positive rate from 34.7% to 4.3%. Lee et al 2018 proposes Mahalanobis distance in feature space to capture covariance between dimensions. The three are the OOD canon.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Selective classification (statistics, pattern recognition).&lt;/strong&gt; Chow 1957 already formalized the reject option in IRE Trans. Electronic Computers. In 1970 he derived the optimal error-reject curve. In 2017, Geifman and El-Yaniv brought the concept to deep learning with formal risk guarantee:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;We can achieve a target coverage with a guaranteed level of risk.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;The canonical metric for evaluating abstention is AURC (Area Under Risk-Coverage curve): shows how error falls as the model is allowed to reject more cases.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Critical systems with conservative degradation.&lt;/strong&gt; Aviation has explicit regulation (FAA AC 25.1329-1B): autopilot must alert when envelope protection is invoked and disengage in off-nominal conditions. SAE J3016 (autonomous driving) defines Operational Design Domain (ODD) and requires the system to exit operation or request takeover when operating outside it. The principle is the same: a model trained for conditions X does not operate in Y, it alerts and returns control.&lt;/p&gt;
&lt;p&gt;Trading benefits from this vocabulary. It was what was missing.&lt;/p&gt;
&lt;h2&gt;Someone has done this in finance&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;someone-has-done-this-in-finance&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#someone-has-done-this-in-finance&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Two precedents to anchor on.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Kritzman and Li 2010&lt;/strong&gt; (&amp;ldquo;Skulls, Financial Turbulence, and Risk Management&amp;rdquo;, Financial Analysts Journal). They define the Turbulence Index as the multivariate Mahalanobis distance of returns against historical mean and covariance. Central quote:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;The more asset returns, volatilities and correlations differ from their historical norms, the more likely it is that these differences result from a significant market event rather than from random noise.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Empirically the index aligns with 1987, the 1998 Russian default, 9/11, and the 2008 crisis. Turbulence is persistent, which justifies abstaining by windows, not by isolated tick.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Chalkidis et al 2021&lt;/strong&gt; (&amp;ldquo;Trading via Selective Classification&amp;rdquo;, ACM ICAIF, arXiv 2110.14914). This paper is the direct case of what I did. A binary up/down classifier becomes a strategy that only takes a position when it&amp;rsquo;s confident, and abstains when it&amp;rsquo;s not. Empirical result: smaller coverage with same risk improves Sharpe. The abstract&amp;rsquo;s quote:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;&amp;ldquo;Selective classifiers give rise to trading strategies that do not take a trading position when the classifier abstains.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Selective classification in trading is not my insight. It&amp;rsquo;s a documented topic at ACM. What was missing was bringing it to my HMM.&lt;/p&gt;
&lt;h2&gt;How I implemented it&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-i-implemented-it&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-i-implemented-it&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The HMM features pass through &lt;code&gt;StandardScaler&lt;/code&gt; before training. In the scaled space, each feature&amp;rsquo;s mean is zero and standard deviation is one. Any new candle with one feature at very high absolute z-score is, by definition, outside the distribution the model has seen.&lt;/p&gt;
&lt;p&gt;Threshold at 5 sigmas (conservative, crypto has fat tails). Static method on &lt;code&gt;MarketRegimeHMM&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nd&#34;&gt;@staticmethod&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;def&lt;/span&gt; &lt;span class=&#34;nf&#34;&gt;is_ood&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x_scaled_row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;threshold&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;OOD_SIGMA_THRESHOLD&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;x_scaled_row&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;size&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;==&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;kc&#34;&gt;False&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;nb&#34;&gt;bool&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;nanmax&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;np&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;abs&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;x_scaled_row&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;))&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;threshold&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;And &lt;code&gt;predict_state&lt;/code&gt; checks before calling the posterior:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;if&lt;/span&gt; &lt;span class=&#34;bp&#34;&gt;self&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;is_ood&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;last_features&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;n&#34;&gt;logger&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;warning&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;&amp;#34;OOD detected: max |z| = &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;%.2f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt; &amp;gt; &lt;/span&gt;&lt;span class=&#34;si&#34;&gt;%.1f&lt;/span&gt;&lt;span class=&#34;s2&#34;&gt;. Abstaining.&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                   &lt;span class=&#34;n&#34;&gt;max_dev&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;OOD_SIGMA_THRESHOLD&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    &lt;span class=&#34;k&#34;&gt;return&lt;/span&gt; &lt;span class=&#34;n&#34;&gt;REGIME_OOD&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;REGIME_OOD&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;1.0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;The downstream decision (&lt;code&gt;decide_position&lt;/code&gt; in layer 4) already had a lookup in &lt;code&gt;REGIME_MULTIPLIER&lt;/code&gt;. I added &lt;code&gt;&amp;quot;OOD&amp;quot;: 0.0&lt;/code&gt; as defense in depth, plus an explicit &amp;ldquo;ABSTAIN&amp;rdquo; log to make it visible whenever the system chose not to operate.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/17/hmm-aprende-a-abster/images/01-gate-ood.png&#34; alt=&#34;The abstention gate: a new candle goes through the max |z| above 5 sigmas check, inside it classifies and operates, outside it returns OOD and zeros the sizing&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;70 tests passed, plus 2 new ones covering the OOD path. Full suite in 6 seconds.&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Scenario&lt;/th&gt;
          &lt;th&gt;Before&lt;/th&gt;
          &lt;th&gt;After&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Features inside distribution&lt;/td&gt;
          &lt;td&gt;Classifies BULL/SIDEWAYS/BEAR with real posterior&lt;/td&gt;
          &lt;td&gt;Same&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Features 5+ sigmas outside (rare)&lt;/td&gt;
          &lt;td&gt;Classifies anyway, with spurious posterior&lt;/td&gt;
          &lt;td&gt;Returns OOD, sizing zeros&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Log of the OOD tick&lt;/td&gt;
          &lt;td&gt;No distinction&lt;/td&gt;
          &lt;td&gt;&amp;ldquo;ABSTAIN: regime without playbook (OOD, conf=0.000)&amp;rdquo;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Trade opened in anomalous condition&lt;/td&gt;
          &lt;td&gt;Possible, with 2% cap&lt;/td&gt;
          &lt;td&gt;Impossible&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Why 5 sigmas, not 3&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;why-5-sigmas-not-3&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#why-5-sigmas-not-3&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Threshold choice is where theory meets real crypto data. In perfectly Gaussian features, 3 sigmas would cover 99.73% and be reasonable. Crypto is not Gaussian. Realized volatility, funding rate, and DI spread have heavy tails. Bulla 2011 (Quantitative Finance) already showed that Gaussian HMM underestimates tails in financial returns, proposing Student-t instead.&lt;/p&gt;
&lt;p&gt;At 5 sigmas, the detector fires only when the tick is in genuinely unprecedented region. At 3, it would fire on big but historical moves, generating excessive abstention. The next iteration is to swap univariate z-score for multivariate Mahalanobis (captures correlation between features), which is exactly what Kritzman-Li did in 2010 for returns.&lt;/p&gt;
&lt;h2&gt;What changed in my sleep&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-changed-in-my-sleep&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-changed-in-my-sleep&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The most useful number for me isn&amp;rsquo;t the increase or decrease in Sharpe (I&amp;rsquo;ll measure in backtest next week). It&amp;rsquo;s this:&lt;/p&gt;
&lt;p&gt;Before, when the agent took a position overnight and I woke up with Telegram blinking, I needed to open the auditor and read decision by decision to understand if the model had any logic at that moment or if it was guessing in chaotic market.&lt;/p&gt;
&lt;p&gt;Now, if the system abstains, the log says &lt;code&gt;ABSTAIN&lt;/code&gt;. If it operates, it&amp;rsquo;s because it was in territory it has seen. The question &amp;ldquo;does this decision have a basis?&amp;rdquo; became binary: there&amp;rsquo;s an ABSTAIN log before it, or there isn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Nick Leeson, Jérôme Kerviel, LTCM, Knight Capital. The history of operational losses in finance almost always has the same pattern: a system continuing to make decisions when it shouldn&amp;rsquo;t. The cost of &amp;ldquo;I don&amp;rsquo;t know&amp;rdquo; has always been cheaper than the cost of &amp;ldquo;I thought it was&amp;rdquo;.&lt;/p&gt;
&lt;h2&gt;Anti-patterns to avoid&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;anti-patterns-to-avoid&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#anti-patterns-to-avoid&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Accepting high posterior as evidence of good decision.&lt;/strong&gt; An HMM&amp;rsquo;s posterior always sums to 1. Confidence is intra-model metric, not evidence that the model understands what it&amp;rsquo;s seeing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Using OOD threshold based on intuition, not on distribution.&lt;/strong&gt; 3 sigmas works in pure Gaussian. Crypto is not Gaussian. Measure the real tail of your data first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Abstaining on isolated tick and going back to operating on the next.&lt;/strong&gt; Turbulence is persistent. Good design abstains by window, not by candle.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adding OOD without touching the decider.&lt;/strong&gt; A detector that doesn&amp;rsquo;t change downstream behavior is decoration. &lt;code&gt;REGIME_MULTIPLIER&lt;/code&gt; is where the effect happens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hiding the abstention from the log.&lt;/strong&gt; If the system preferred not to operate, that&amp;rsquo;s a decision. It must appear in the audit trail with reason, not silently.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/17/hmm-aprende-a-abster/images/02-anti-padroes.png&#34; alt=&#34;The 5 abstention anti-patterns: high posterior as evidence, threshold by intuition, abstaining on isolated ticks, OOD without touching the decider, abstention hidden from the log&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;The next chapter&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-next-chapter&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-next-chapter&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The current version uses a single criterion (absolute z-score per feature). Two extensions are already in the backlog: Mahalanobis distance in the full space (captures covariance, which is what Kritzman-Li implemented for returns in 2010) and tick likelihood under the trained HMM (more sensitive, more expensive).&lt;/p&gt;
&lt;p&gt;For now, what&amp;rsquo;s in production is the simple version. And it has already changed what I look at when I wake up.&lt;/p&gt;
&lt;p&gt;Have you ever had a model return high confidence on a decision that shouldn&amp;rsquo;t have been made? Tell me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;newsletter&lt;/a&gt; to receive the next posts.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Instrumenting lineage from scratch with Unity Catalog</title>
      <link>https://vazdeng.pages.dev/en/2026/05/13/instrumenting-lineage-from-scratch-with-unity-catalog/</link>
      <pubDate>Wed, 13 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/13/instrumenting-lineage-from-scratch-with-unity-catalog/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/13/databricks-unity-catalog-lineage/cover_hu_54642d929bdf4bb.png" width="640" height="336"/>]]>
        
        &lt;p&gt;When someone asks me &amp;ldquo;where does this number come from?&amp;rdquo;, I have two possible answers.&lt;/p&gt;
&lt;p&gt;The first is to open the code, manually trace which job read from which table, work out which transformations were applied, and walk back to the source. In pipelines with 20 steps, that can take hours.&lt;/p&gt;
&lt;p&gt;The second is to open Unity Catalog, click on the column in question, and see the full graph: source, transformations, intermediate tables, destination. In seconds.&lt;/p&gt;
&lt;p&gt;That difference is what lineage solves in practice. But Unity Catalog doesn&amp;rsquo;t capture everything automatically. Understanding what it covers and what needs extra work is what separates a real implementation from one that gives a false sense of security.&lt;/p&gt;
&lt;h2&gt;What Unity Catalog captures automatically&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-unity-catalog-captures-automatically&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-unity-catalog-captures-automatically&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Unity Catalog intercepts Spark execution plans at runtime and registers every read and write on metastore tables. No extra code configuration required.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table lineage&lt;/strong&gt; works for any SELECT, CREATE TABLE AS SELECT, INSERT INTO SELECT operation in any language: Python, SQL, Scala, R. For each operation, the system records which table was read, which was written, in which job, in which notebook, by which user, at what time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column lineage&lt;/strong&gt; goes further: it maps which source columns feed which destination columns. Requires Databricks Runtime 11.3 LTS or higher for regular jobs. For Delta Live Tables, requires 13.3 LTS or higher.&lt;/p&gt;
&lt;p&gt;This information is accessible two ways: via Catalog Explorer with a visual interface, and via the system tables &lt;code&gt;system.access.table_lineage&lt;/code&gt; and &lt;code&gt;system.access.column_lineage&lt;/code&gt; for those who need it programmatically.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/13/databricks-unity-catalog-lineage/images/unity-catalog-cobertura.png&#34; alt=&#34;Unity Catalog coverage: what it captures automatically vs where most people get it wrong&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What isn&amp;rsquo;t captured and where most people get it wrong&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-isnt-captured-and-where-most-people-get-it-wrong&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-isnt-captured-and-where-most-people-get-it-wrong&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The official docs are clear but discreet about the limitations. I&amp;rsquo;ve seen these limitations bite in production more than once.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;UPDATE, DELETE, and INSERT VALUES don&amp;rsquo;t generate lineage edges.&lt;/strong&gt; This is the most critical limitation for anyone working with CDC, SCD Type 2, or any pipeline with in-place updates. The data was modified, but Unity Catalog doesn&amp;rsquo;t record that relationship.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MERGE INTO doesn&amp;rsquo;t capture lineage by default.&lt;/strong&gt; It can be enabled with &lt;code&gt;spark.databricks.dataLineage.mergeIntoV2Enabled&lt;/code&gt;, but it requires explicit configuration on each cluster or job.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RDDs aren&amp;rsquo;t supported.&lt;/strong&gt; The Unity Catalog API doesn&amp;rsquo;t work with RDDs, so any pipeline using Spark&amp;rsquo;s low-level API stays completely outside tracking.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Renamed objects lose history permanently.&lt;/strong&gt; If you rename a table, schema, or catalog, historical lineage breaks. There&amp;rsquo;s no automatic migration of the graph when the object changes name.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JDBC connections bypass entirely.&lt;/strong&gt; Data read or written via JDBC doesn&amp;rsquo;t pass through Unity Catalog&amp;rsquo;s capture mechanism.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Path-referenced tables (s3://&amp;hellip;) don&amp;rsquo;t capture column lineage.&lt;/strong&gt; Table lineage via path works, but column mapping doesn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;And a practical detail: system tables only have data starting September 2024. If you need lineage history before that date, it doesn&amp;rsquo;t exist in the system tables.&lt;/p&gt;
&lt;h2&gt;Multi-hop lineage: what Catalog Explorer doesn&amp;rsquo;t show&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;multi-hop-lineage-what-catalog-explorer-doesnt-show&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#multi-hop-lineage-what-catalog-explorer-doesnt-show&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The Catalog Explorer visualizer shows only one hop in each direction: one upstream table and one immediate downstream table. If the data went through five transformations, you only see the adjacent one.&lt;/p&gt;
&lt;p&gt;To trace the full chain, the approach is iterative queries on the system tables:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-sql&#34; data-lang=&#34;sql&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;-- Find all ancestors of a table (multi-hop)
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;WITH&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;RECURSIVE&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lineage&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;AS&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;SELECT&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;source_table_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;target_table_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;as&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hop&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;FROM&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;access&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;table_lineage&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;WHERE&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;target_table_name&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;s1&#34;&gt;&amp;#39;my_gold_table&amp;#39;&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;UNION&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;ALL&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;SELECT&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;l&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;source_table_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tl&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;target_table_name&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lineage&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hop&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;o&#34;&gt;+&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;mi&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;FROM&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;system&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;access&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;table_lineage&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tl&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;  &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;JOIN&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lineage&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;l&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;ON&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;tl&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;target_table_name&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;l&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;source_table_name&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;SELECT&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;o&#34;&gt;*&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;FROM&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;lineage&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;ORDER&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;k&#34;&gt;BY&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;hop&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Databricks doesn&amp;rsquo;t support native recursive CTE on system tables. In practice, this needs iterative logic in Python that queries level by level.&lt;/p&gt;
&lt;h2&gt;OpenLineage as a complement&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;openlineage-as-a-complement&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#openlineage-as-a-complement&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;For pipelines that leave the Databricks ecosystem (Airflow orchestrating external jobs, dbt running on a different warehouse, Python scripts with pandas), OpenLineage is the most used alternative to unify cross-platform lineage.&lt;/p&gt;
&lt;p&gt;OpenLineage integrates via &lt;code&gt;OpenLineageSparkListener&lt;/code&gt; and captures lineage from S3, GCS, JDBC, Redshift, and BigQuery. The integration exists, but has documented bugs with Databricks Spark 3.4+: generated payloads sometimes contain only inputs without outputs, and there are incompatibilities between the OpenLineage Spark 3.3 agent and Databricks&amp;rsquo; 3.4.1 implementation.&lt;/p&gt;
&lt;p&gt;If OpenLineage is critical to your setup, verify version compatibility before going to production.&lt;/p&gt;
&lt;h2&gt;What to instrument manually&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-to-instrument-manually&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-to-instrument-manually&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;To have complete lineage in real pipelines, these are the gaps that need extra work:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BI tools&lt;/strong&gt; (Tableau, Power BI, Looker) need an explicit connector or manual registration via the External Lineage API, which is in Public Preview. The limit is 10,000 external objects and 100,000 relationships per metastore.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;External orchestrators&lt;/strong&gt; (Airflow, Prefect) need integration via API so jobs appear in the lineage graph.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pipelines with extensive UPDATE/DELETE&lt;/strong&gt; need complementary logging via &lt;code&gt;system.query.history&lt;/code&gt; for auditing, since automatic lineage doesn&amp;rsquo;t cover those operations.&lt;/p&gt;
&lt;h2&gt;Where to start from scratch&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-to-start-from-scratch&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-to-start-from-scratch&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;If you&amp;rsquo;re instrumenting lineage for the first time in a Databricks environment:&lt;/p&gt;
&lt;p&gt;First, confirm that clusters and jobs are in workspaces with Unity Catalog enabled. Without it, no automatic capture works.&lt;/p&gt;
&lt;p&gt;Second, validate Databricks Runtime: 11.3 LTS or higher for column lineage in regular jobs. Older projects running on runtimes below that won&amp;rsquo;t have column lineage even with Unity Catalog active.&lt;/p&gt;
&lt;p&gt;Third, map which pipelines extensively use UPDATE/DELETE/MERGE. For those, define from the start what the complementary auditing strategy will be, whether via &lt;code&gt;system.query.history&lt;/code&gt; or via explicit logging in code.&lt;/p&gt;
&lt;p&gt;Fourth, build a validation query that runs weekly against the system tables and checks whether critical tables have lineage registered. Missing lineage on an important table is a sign that something fell outside capture scope.&lt;/p&gt;
&lt;p&gt;Lineage isn&amp;rsquo;t a feature you turn on and forget. I use it as a continuous practice: for every new pipeline, I validate what Unity Catalog captured and what fell outside.&lt;/p&gt;
&lt;p&gt;What part of lineage gives you the most trouble today? Tell me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;newsletter&lt;/a&gt;.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>When Medallion Architecture gets in the way more than it helps</title>
      <link>https://vazdeng.pages.dev/en/2026/05/12/when-medallion-architecture-gets-in-the-way-more-than-it-helps/</link>
      <pubDate>Tue, 12 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/12/when-medallion-architecture-gets-in-the-way-more-than-it-helps/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/12/medallion-architecture-quando-nao/cover_hu_a0e26d2a2c83edec.png" width="640" height="336"/>]]>
        
        &lt;p&gt;There&amp;rsquo;s an architecture pattern I&amp;rsquo;ve watched grow since 2020, created by Databricks, adopted by Microsoft as the official standard for the Fabric platform in 2023, and that today shows up in almost every conversation about data engineering: Medallion Architecture.&lt;/p&gt;
&lt;p&gt;Bronze, Silver, Gold. Raw data, clean data, aggregated data.&lt;/p&gt;
&lt;p&gt;The problem isn&amp;rsquo;t the pattern. The problem is that it became the automatic answer. And when any architecture becomes the automatic answer, it starts creating more problems than it solves.&lt;/p&gt;
&lt;p&gt;Databricks itself is clear in the official docs: &lt;em&gt;&amp;ldquo;Following the medallion architecture is a recommended best practice but not a requirement.&amp;rdquo;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;That rarely shows up in the presentations.&lt;/p&gt;
&lt;h2&gt;What Medallion Architecture actually is&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-medallion-architecture-actually-is&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-medallion-architecture-actually-is&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Databricks defines it like this: a design pattern that organizes data in a lakehouse into layers that &lt;em&gt;progressively improve&lt;/em&gt; the structure and quality of the data, from Bronze to Silver to Gold.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bronze&lt;/strong&gt; stores data exactly as it came from the source, with no transformation. It&amp;rsquo;s the immutable historical archive. If something goes wrong in later layers, you come back here.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Silver&lt;/strong&gt; applies the minimum transformation needed to create a consistent enterprise view: cleansing, standardization, deduplication, joins across sources. It&amp;rsquo;s where data becomes trusted information.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Gold&lt;/strong&gt; organizes data for specific consumption: analytics dashboards, ML models, financial reports. Denormalized, optimized for reads, designed for the end user.&lt;/p&gt;
&lt;p&gt;Worth a historical note: the layered pipeline concept isn&amp;rsquo;t new. Data warehousing in the 1990s already used staging, cleansed, and presentation layers. What Databricks created in 2020 was the Bronze/Silver/Gold terminology and the &amp;ldquo;Medallion&amp;rdquo; branding, not the principle itself. That doesn&amp;rsquo;t make the pattern invalid, it just helps separate innovation from naming.&lt;/p&gt;
&lt;h2&gt;When Medallion works well&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-medallion-works-well&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-medallion-works-well&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The pattern solves three real problems, and solves them well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;First: reprocessing without loss.&lt;/strong&gt; When a bug shows up in a Silver transformation, you go back to Bronze and reprocess without having to fetch the data from the source again. In systems where the source only keeps the last 90 days of history, that protection can be the difference between fixing a problem and losing two years of data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Second: multiple teams with different needs.&lt;/strong&gt; The analytics team needs monthly totals. The data science team needs the data at the finest grain for model training. Both share Silver, each builds its own Gold layer independently. No duplicated cleansing work, no inconsistency across views.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Third: separation of responsibility in large teams.&lt;/strong&gt; The ingestion team owns Bronze without needing to know business rules. The transformation team owns Silver without depending on the ingestion team. In organizations with more than 20 data professionals working in parallel, this reduces coupling and blockers.&lt;/p&gt;
&lt;p&gt;When these three problems exist, Medallion is a solid choice. When they don&amp;rsquo;t, you&amp;rsquo;re adding complexity without a return.&lt;/p&gt;
&lt;h2&gt;Where Medallion starts to get in the way&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;where-medallion-starts-to-get-in-the-way&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#where-medallion-starts-to-get-in-the-way&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3&gt;When there&amp;rsquo;s a single consumer&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-theres-a-single-consumer&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-theres-a-single-consumer&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;You have a pipeline that ingests payroll data to feed a single HR dashboard. One team consuming, one purpose, one transformation.&lt;/p&gt;
&lt;p&gt;Applying Medallion here means creating Bronze, Silver, and Gold to serve exactly the same thing. The data goes through three layers of reads and writes, three sets of jobs to monitor, and three times the latency. For zero gain.&lt;/p&gt;
&lt;p&gt;The practical signal: if Gold is identical to Silver plus one grouping, you don&amp;rsquo;t need three layers. A single direct transformation from source to consumed table does the same work with half the infrastructure.&lt;/p&gt;
&lt;p&gt;A case documented by a data architect: a customer had 4.2 billion rows in Bronze accumulated over six years of data, but Silver only consumed the last 90 days. 97% of stored data was never used. The storage cost was real, the benefit wasn&amp;rsquo;t.&lt;/p&gt;
&lt;h3&gt;When latency matters more than quality&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-latency-matters-more-than-quality&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-latency-matters-more-than-quality&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Each transition Bronze to Silver, Silver to Gold, is a separate job. In Spark pipelines, that&amp;rsquo;s usually 20 to 40 minutes per layer. Three layers in sequence and total latency tops one hour before data reaches anywhere.&lt;/p&gt;
&lt;p&gt;Analyses with real practitioner data show overhead of 53% or more in simple cases: 23 minutes with Medallion versus 15 minutes with direct transformation, for the same result.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/12/medallion-architecture-quando-nao/images/medallion_latency_infographic.png&#34; alt=&#34;Latency comparison: direct transformation 15min vs Medallion 23min&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;When the business needs data in 30 minutes to make a decision, an architecture with 80 minutes of latency isn&amp;rsquo;t a code problem. It&amp;rsquo;s an architecture problem.&lt;/p&gt;
&lt;p&gt;For data that needs to arrive in real time or near it, Databricks is explicit: it recommends micro-batch (latency in seconds to a few minutes) for Medallion, and explicitly advises that when ingestion comes from a message broker like Kafka, reading directly without an intermediate stage reduces complexity and latency. For sub-second, the documentation itself flags limitations in real-time mode that negatively affect throughput.&lt;/p&gt;
&lt;h3&gt;When it&amp;rsquo;s a prototype or short-lived analysis&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-its-a-prototype-or-short-lived-analysis&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-its-a-prototype-or-short-lived-analysis&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A quick data exploration. A model that will exist for three months. A one-off analysis that will turn into a number on a slide and never be consumed again.&lt;/p&gt;
&lt;p&gt;Forcing Medallion onto a prototype creates tables that will never be maintained, jobs nobody will monitor, and structure that will be abandoned in two weeks. The team spends time and energy organizing what was supposed to be disposable.&lt;/p&gt;
&lt;p&gt;A prototype needs to be quick to build and easy to throw away. Three layers make both harder.&lt;/p&gt;
&lt;h3&gt;When the team is small and the data is simple&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-the-team-is-small-and-the-data-is-simple&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-the-team-is-small-and-the-data-is-simple&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;A startup with 3 data engineers processing 500 GB doesn&amp;rsquo;t have the same problems as a bank with 50 engineers and 50 TB. The operational overhead of maintaining Bronze, Silver, and Gold, with all the tables, jobs, documentation, and monitoring that requires, can be unjustifiable when the real benefit is small.&lt;/p&gt;
&lt;p&gt;For small teams with one or two use cases, two layers (raw data and consumable data) or a solution with dbt directly on the source solve the problem without the extra complexity.&lt;/p&gt;
&lt;h2&gt;The anti-pattern nobody talks about&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-anti-pattern-nobody-talks-about&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-anti-pattern-nobody-talks-about&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I&amp;rsquo;ve seen one specific problem appear more than any other when Medallion doesn&amp;rsquo;t work well: Bronze gets exposed as a data product.&lt;/p&gt;
&lt;p&gt;Elliott Cordo, a data engineer with published work on data architecture, documents this as a direct anti-pattern: exposing the Bronze layer to consumers creates strong coupling between those using the data and the internal details of how it&amp;rsquo;s stored. When the source changes, every consumer breaks together.&lt;/p&gt;
&lt;p&gt;The second documented problem: when Silver is Bronze with a renamed field, and Gold is Silver with a GROUP BY, the intermediate layers add no real value. Analysts end up writing complex SQL in Gold or building parallel spreadsheets to compensate. Multiple teams implement the same metric in different ways, and the numbers start to diverge.&lt;/p&gt;
&lt;p&gt;In those cases, the pattern isn&amp;rsquo;t being applied, it&amp;rsquo;s being imitated.&lt;/p&gt;
&lt;h2&gt;The right question before deciding&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-right-question-before-deciding&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-right-question-before-deciding&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three questions define whether Medallion is the right architecture:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Are there multiple consumers with different needs?&lt;/strong&gt; If yes, a shared layer between them makes sense. If not, you&amp;rsquo;re creating separation without benefit.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is reprocessing data from the source expensive or impossible?&lt;/strong&gt; If yes, immutable Bronze is real protection. If you can reprocess without cost or history loss, the benefit shrinks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Does the latency of each layer fit the deadline the business demands?&lt;/strong&gt; If yes, Medallion works. If not, you need a different architecture for that use case.&lt;/p&gt;
&lt;p&gt;Three &amp;ldquo;yes&amp;rdquo;: Medallion is a solid choice. Two or fewer: worth questioning how many layers you actually need.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/12/medallion-architecture-quando-nao/images/decision-tree.png&#34; alt=&#34;Decision diagram: when to use Medallion Architecture&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What large companies actually use&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-large-companies-actually-use&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-large-companies-actually-use&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;An important detail that rarely shows up in the discussions: Netflix and Uber, two of the most referenced companies in data engineering, don&amp;rsquo;t use Bronze/Silver/Gold terminology.&lt;/p&gt;
&lt;p&gt;Netflix uses the WAP pattern (Write-Audit-Publish) with Apache Iceberg: data is written to a hidden snapshot, audited automatically, published if approved. The problem solved is the same (quality before exposure), but the implementation is different and doesn&amp;rsquo;t use Medallion&amp;rsquo;s three layers.&lt;/p&gt;
&lt;p&gt;Uber uses a transactional data lake with Apache Hudi, with raw, derived, and aggregated tables. The migration from full batch to incremental ETL cut pipeline time by 82% and cost by 78%, according to the Uber Engineering Blog in March 2023. But those numbers are from incremental ETL, not from the layered pattern itself.&lt;/p&gt;
&lt;p&gt;Microsoft adopted Medallion as Fabric&amp;rsquo;s official architecture in 2023 and is today the largest public case of institutional adoption. Even so, Microsoft&amp;rsquo;s own documentation guides: before building complex pipelines between layers, evaluate Materialized Lake Views, which manage transformations automatically without operational overhead.&lt;/p&gt;
&lt;h2&gt;What stays&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-stays&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-stays&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Medallion Architecture is a good pattern for the right problems: large teams, multiple consumers, critical data that needs protected history and progressive quality.&lt;/p&gt;
&lt;p&gt;It isn&amp;rsquo;t required. It isn&amp;rsquo;t universal. And when applied where it doesn&amp;rsquo;t fit, the cost is real: unnecessary latency, wasted storage, operational complexity without benefit.&lt;/p&gt;
&lt;p&gt;Architecture choices should start from the problem, not from the pattern. What does this pipeline need to solve? Who will consume it? What&amp;rsquo;s the acceptable deadline? Is reprocessing from the source expensive?&lt;/p&gt;
&lt;p&gt;If the answers point to Medallion, great. If they don&amp;rsquo;t, a simpler architecture will work better.&lt;/p&gt;
&lt;p&gt;Have you ever implemented Medallion somewhere it didn&amp;rsquo;t belong? What happened next? Tell me on &lt;a href=&#34;https://linkedin.com/in/thaisvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;newsletter&lt;/a&gt; for the next posts.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>LGPD and ML models: what to do with data that has already become model weights</title>
      <link>https://vazdeng.pages.dev/en/2026/05/02/lgpd-and-ml-models-what-to-do-with-data-that-has-already-become-model-weights/</link>
      <pubDate>Sat, 02 May 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/05/02/lgpd-and-ml-models-what-to-do-with-data-that-has-already-become-model-weights/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/05/02/lgpd-ml-modelos-treinados/cover_hu_739f7067a1194401.png" width="640" height="336"/>]]>
        
        &lt;p&gt;A data subject requested deletion. You deleted the row from the database. And the model?&lt;/p&gt;
&lt;p&gt;The weights of an ML model trained on personal data hold, in a non-explicit form, the contribution of every training record. Deleting the original data doesn&amp;rsquo;t erase that influence. Membership inference research can determine, with some probability, whether a specific CPF was part of a model&amp;rsquo;s training set. That qualifies as personal data under LGPD.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve seen most teams without a process for this scenario. Not for lack of intent: nobody set up the flow before training the first model.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/02/lgpd-ml-modelos-treinados/images/01-deletar-nao-apaga.png&#34; alt=&#34;Deleting the row doesn’t erase the model: the CPF enters training, the influence stays in the production weights, and Article 18 reaches the model&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What Article 18 actually requires&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-article-18-actually-requires&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-article-18-actually-requires&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Article 18, IV of LGPD grants the data subject the right to request anonymization, blocking, or erasure of data that is &amp;ldquo;unnecessary, excessive, or processed outside compliance.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The interpretation ANPD has been signaling in its public consultations on AI is that ML models are processors of personal data when the training data was personal at the time of processing. The production model inherits that classification.&lt;/p&gt;
&lt;p&gt;If a data subject requested deletion and you can demonstrate that their data was used in training, the right to erasure applies to the model too. Not just to the dataset.&lt;/p&gt;
&lt;p&gt;The law doesn&amp;rsquo;t specify how to execute that erasure. It specifies the expected result: the data subject should no longer have influence over the model&amp;rsquo;s decisions. How you get there is your technical problem.&lt;/p&gt;
&lt;h2&gt;The real technical problem&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-real-technical-problem&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-real-technical-problem&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three scenarios with different difficulty levels, that I&amp;rsquo;ve seen in practice.&lt;/p&gt;
&lt;p&gt;Genuinely anonymized data before training: if you applied real anonymization, not pseudonymization, before any ML processing, you&amp;rsquo;re outside LGPD&amp;rsquo;s scope for that data. Article 12 is clear: anonymized data isn&amp;rsquo;t personal data. But anonymization needs to be irreversible. K-anonymity with k=3 on financial transactions isn&amp;rsquo;t real anonymization.&lt;/p&gt;
&lt;p&gt;Pseudonymized data in training: you replaced the CPF with a token but kept the mapping. The data remains personal. The model was trained with that data and is now in production. A deletion request activates the full problem.&lt;/p&gt;
&lt;p&gt;Raw data in training, no treatment: the most common scenario in older models, trained before any regulatory concern. Also the hardest to solve.&lt;/p&gt;
&lt;h2&gt;What teams do in practice&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-teams-do-in-practice&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-teams-do-in-practice&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three reference approaches I use, with real trade-offs, none free.&lt;/p&gt;
&lt;p&gt;Full retraining without the data: you remove the record from the dataset, retrain from scratch or from an earlier checkpoint. It&amp;rsquo;s the cleanest legally, the most defensible in an audit, and the most expensive computationally. For models that take weeks to train, it&amp;rsquo;s impractical as a routine response.&lt;/p&gt;
&lt;p&gt;Selective machine unlearning: techniques that try to remove the influence of specific records without full retraining. SISA training (Sharded, Isolated, Sliced, Aggregated) and gradient-based unlearning reduce cost. The problem: most production implementations still lack formal certification that the erasure was effective. In a dispute with ANPD, &amp;ldquo;we used machine unlearning&amp;rdquo; without measurable evidence doesn&amp;rsquo;t settle it.&lt;/p&gt;
&lt;p&gt;Documenting impracticability and mitigating risk: LGPD allows, in some cases, continued processing when erasure is impossible and there&amp;rsquo;s a residual legal basis. Documenting that the model was trained with data that had a legal basis at the time, that retraining is technically unfeasible, and that mitigation measures were implemented can be the legally defensible answer. This needs legal opinion, not just technical analysis.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/02/lgpd-ml-modelos-treinados/images/02-tres-respostas.png&#34; alt=&#34;Three answers, none free: full retraining, selective machine unlearning, documenting impracticability, pros and cons of each&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;How to architect before training&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;how-to-architect-before-training&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#how-to-architect-before-training&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The right moment to solve this is before the first model goes to production, not after the first deletion request.&lt;/p&gt;
&lt;p&gt;Dataset versioning by data subject: maintain an index of which records were used in which training version. Without that index, you don&amp;rsquo;t even know which models need action when a data subject requests deletion.&lt;/p&gt;
&lt;p&gt;Separation of training data by consent: if part of the dataset came from explicit consent and part from legitimate interest, treat them as separate datasets from the start. When consent is revoked, you know exactly which subset is affected.&lt;/p&gt;
&lt;p&gt;Checkpoints labeled by dataset composition: if you use modular training, keep checkpoints with metadata on which shards were used. That reduces selective retraining cost from weeks to hours.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/05/02/lgpd-ml-modelos-treinados/images/03-antes-de-treinar.png&#34; alt=&#34;Architect before the first training run: dataset versioning by data subject, datasets split by legal basis, checkpoints labeled by composition, retraining drops from weeks to hours&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;The decision every team will have to make&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-decision-every-team-will-have-to-make&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-decision-every-team-will-have-to-make&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The scenario will show up: a data subject sends a deletion request, you delete the data, and someone asks what to do with the credit scoring model that used that CPF in training.&lt;/p&gt;
&lt;p&gt;The honest answer today is: it depends on which model, when it was trained, how the dataset was managed, and what the original legal basis for processing was.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s no longer acceptable is not having the answer. ANPD is building its position on AI and LGPD. Teams that have already documented their architectural decisions will be in a far better position than those improvising when guidance arrives.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Delta Lake or Parquet? You&#39;re asking the wrong question</title>
      <link>https://vazdeng.pages.dev/en/2026/04/30/delta-lake-or-parquet-youre-asking-the-wrong-question/</link>
      <pubDate>Thu, 30 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/04/30/delta-lake-or-parquet-youre-asking-the-wrong-question/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/04/30/delta-lake-vs-parquet/cover_hu_481920f272d5ed99.png" width="640" height="336"/>]]>
        
        &lt;p&gt;The question comes up every week in my team&amp;rsquo;s Slack: &amp;ldquo;should we use Delta Lake or Parquet?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Delta Lake isn&amp;rsquo;t a competing file format to Parquet. It&amp;rsquo;s a transactional management layer that stores data in Parquet files. You aren&amp;rsquo;t choosing between two formats. You&amp;rsquo;re deciding whether you need a transactional layer on top of your files.&lt;/p&gt;
&lt;p&gt;That distinction changes the decision criteria completely. And confusing the two in production costs real money.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/30/delta-lake-vs-parquet/images/01-camada-nao-formato.png&#34; alt=&#34;Delta Lake is not a format: it’s the transactional _delta_log on top of your Parquet files, the two layers together&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What Parquet doesn&amp;rsquo;t do&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-parquet-doesnt-do&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-parquet-doesnt-do&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Parquet solves one specific problem very well: storing data in a columnar, compressed format that&amp;rsquo;s efficient for analytical reads. It&amp;rsquo;s the right format for that.&lt;/p&gt;
&lt;p&gt;What Parquet doesn&amp;rsquo;t do: concurrency control. If two jobs write to the same partition at the same time, the result is non-deterministic. No transactions, no rollback, no conflict detection. The last writer wins. The other one disappears.&lt;/p&gt;
&lt;p&gt;At a fintech where I worked, with distributed ingestion pipelines, this wasn&amp;rsquo;t theoretical. It was the default scenario every time a streaming job and a backfill job ran together on the same table.&lt;/p&gt;
&lt;p&gt;In pipelines with simultaneous streaming and backfill, the scenario shows up without warning. The symptom is subtle: row counts look right, but values diverge from the previous day with no error in the log. The last writer overwrote the previous one. Silent, no rollback.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/30/delta-lake-vs-parquet/images/02-corrupcao-silenciosa.png&#34; alt=&#34;Silent corruption in pure Parquet: streaming and backfill on the same partition, non-deterministic result, zero errors in the log&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;What Delta Lake adds&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-delta-lake-adds&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-delta-lake-adds&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Delta Lake solves the concurrency problem with &lt;code&gt;_delta_log&lt;/code&gt;: a directory of JSON commits and Parquet checkpoints that records every transaction. Every writer registers what was added, what was removed, and the resulting version. Readers see consistent states, never partial ones.&lt;/p&gt;
&lt;p&gt;That enables four capabilities pure Parquet can&amp;rsquo;t offer:&lt;/p&gt;
&lt;p&gt;UPDATE, DELETE, and MERGE operations without rewriting the entire table. Delta marks affected files as removed and adds new ones. Old data remains accessible via time travel (&lt;code&gt;SELECT * FROM table VERSION AS OF 10&lt;/code&gt;), but doesn&amp;rsquo;t appear in current queries.&lt;/p&gt;
&lt;p&gt;Schema enforcement. If a pipeline tries to write a column with an incompatible type, the write fails before contaminating the table. With pure Parquet, you discover the problem at the consumer, not at the source.&lt;/p&gt;
&lt;p&gt;Controlled compaction via &lt;code&gt;OPTIMIZE&lt;/code&gt;. Streaming ingestion generates dozens of small files per hour. Delta consolidates these fragments without downtime, keeping the transaction log intact.&lt;/p&gt;
&lt;p&gt;Data skipping using min/max statistics per file. In a 2 TB table with 10,000 Parquet files, a date-filtered query potentially has to open every file to check metadata. Delta keeps min/max per column in the log and skips whole files without reading them.&lt;/p&gt;
&lt;h2&gt;When Delta Lake is overkill&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;when-delta-lake-is-overkill&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#when-delta-lake-is-overkill&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Delta Lake has a cost. The &lt;code&gt;_delta_log&lt;/code&gt; adds overhead on small writes. Checkpoints are generated every 10 commits by default. For immutable datasets, that cost has no return.&lt;/p&gt;
&lt;p&gt;Three scenarios where Parquet is the right choice:&lt;/p&gt;
&lt;p&gt;Reference datasets that never change. BACEN code tables, calendar tables, historical data sealed after processing. No concurrent writers, no updates. Pure Parquet, no log overhead.&lt;/p&gt;
&lt;p&gt;Export pipelines to external systems. You&amp;rsquo;re generating files to send to a partner, a legacy system, or an S3 bucket consumed by a tool that doesn&amp;rsquo;t read Delta. Parquet is the interoperability standard.&lt;/p&gt;
&lt;p&gt;Experiments and ephemeral data. A notebook that reads a CSV and saves a result. No need for versioning or transactions. Delta&amp;rsquo;s overhead adds nothing here.&lt;/p&gt;
&lt;h2&gt;The decision in three questions&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-decision-in-three-questions&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-decision-in-three-questions&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Before choosing the format, answer:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Does more than one process write to this table at the same time, or will it in the future? If yes, Delta Lake.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is the data updated, deleted, or subject to audit requirements? If yes, Delta Lake.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Is the table read-only and never modified after writing? Parquet is enough.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Most operational tables in a productive lakehouse answer &amp;ldquo;yes&amp;rdquo; to question one or two. Most lookup tables answer &amp;ldquo;yes&amp;rdquo; to question three.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/30/delta-lake-vs-parquet/images/03-decisao-3-perguntas.png&#34; alt=&#34;The decision in 3 questions: concurrent writes or update/audit requirements lead to Delta Lake; read-only tables stay on Parquet&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In the context of BACEN 521 compliance, which takes effect in October 2026, audit tables for financial transactions need time travel and schema enforcement. Using pure Parquet on those tables isn&amp;rsquo;t just inefficient. It&amp;rsquo;s a regulatory risk.&lt;/p&gt;
&lt;h2&gt;The real architectural decision&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-real-architectural-decision&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-real-architectural-decision&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Delta Lake isn&amp;rsquo;t an improved version of Parquet. It&amp;rsquo;s a different layer that solves a different problem.&lt;/p&gt;
&lt;p&gt;Parquet solves: how to store data efficiently for analytical reads.&lt;/p&gt;
&lt;p&gt;Delta Lake solves: how to guarantee consistency when multiple processes access the same data at the same time.&lt;/p&gt;
&lt;p&gt;The right question isn&amp;rsquo;t &amp;ldquo;which format should I use&amp;rdquo;. It&amp;rsquo;s &amp;ldquo;does this data need transactional control?&amp;rdquo; If it does, Delta Lake. If it doesn&amp;rsquo;t, Parquet. I&amp;rsquo;ve gone both ways across different projects. Picking the wrong one cost me on both sides.&lt;/p&gt;
&lt;p&gt;If you&amp;rsquo;ve already hit silent corruption from concurrency in Parquet, or chose Delta on something that later felt excessive, share the context in the comments.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Sharpe Ratio -1.14 is an Engineering Win, Not a Failure</title>
      <link>https://vazdeng.pages.dev/en/2026/04/23/quant-agent-negative-sharpe-engineering/</link>
      <pubDate>Thu, 23 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/04/23/quant-agent-negative-sharpe-engineering/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/04/23/sharpe-negativo-sucesso-engenharia/cover_hu_ccaca7f9ebe447b3.png" width="640" height="336"/>]]>
        
        &lt;p&gt;For 6 months, I built a quant agent for BTC/USDT trading.&lt;/p&gt;
&lt;p&gt;Goal: maximize returns.&lt;/p&gt;
&lt;p&gt;Result: Sharpe ratio of &lt;strong&gt;-1.14&lt;/strong&gt;. Not good.&lt;/p&gt;
&lt;p&gt;The system didn&amp;rsquo;t fail. It failed at one objective (alpha) and excelled at another (capital preservation).&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Architecture by layers&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;architecture-by-layers&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#architecture-by-layers&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Quant trading is complex. It&amp;rsquo;s not &amp;ldquo;buy here, sell there.&amp;rdquo; It&amp;rsquo;s this:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;L1: Ingestion        (real data)
L2: Processing       (signals)
L3: Intelligence     (predictions)
L4: Decision         (sizing)
L5: Execution        (minimize impact)
L6: Evaluation       (backtests)
L7: Compliance       (audit)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Each layer is independent. Each has fallbacks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/23/sharpe-negativo-sucesso-engenharia/images/01-sete-camadas.png&#34; alt=&#34;The quant agent’s 7 layers: ingestion, processing, intelligence, decision, execution, evaluation and compliance, each independent with fallbacks&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h3&gt;L1: Ingestion&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l1-ingestion&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l1-ingestion&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;- BinanceFetcher: OHLCV, funding rates, open interest, order book
- MacroFetcher: DXY, S&amp;amp;P 500 via yfinance
- GlassnodeFetcher: on-chain metrics&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Why 3 sources?&lt;/strong&gt; Triangulation. If Binance goes down, you still have macro + on-chain.&lt;/p&gt;
&lt;h3&gt;L2: Processing&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l2-processing&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l2-processing&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;32&amp;#43; technical indicators:
- RSI, MACD, Bollinger Bands (classics)
- ATR, Stochastic, Williams %R (volatility)
- Volume profile, Time-weighted moving average
- On-chain: MVRV, SOPR, Cumulative delta
- Macro: VIX-like crypto index

Everything normalized (z-score, min-max).
Everything temporally aligned (no forward-looking bias).&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;L3: Intelligence&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l3-intelligence&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l3-intelligence&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Gaussian HMM (Hidden Markov Model) with 3 states:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;BULL (uptrend)    → RSI &amp;gt; 60 &amp;#43; momentum &amp;#43; macro positive
SIDEWAYS (range)  → RSI 40-60 &amp;#43; low volatility
BEAR (downtrend)  → RSI &amp;lt; 40 &amp;#43; momentum negative&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;LightGBM regressor predicts returns on the next 4 candles (walk-forward).&lt;/p&gt;
&lt;p&gt;You don&amp;rsquo;t need 60% accuracy to have alpha. You need &lt;em&gt;consistency&lt;/em&gt;. A model that&amp;rsquo;s right 45% of the time but with low drawdown beats one that&amp;rsquo;s 70% accurate with 30% max DD.&lt;/p&gt;
&lt;h3&gt;L4: Decision&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l4-decision&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l4-decision&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Quarter Kelly sizing. Not full Kelly (too aggressive).&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;Position size = (edge * odds) / odds_ratio
Capped at 2% of portfolio (max risk per trade)

Guardrails (non-negotiable):
- Max drawdown: 15%
- Circuit breaker: 3 consecutive losses = pause
- Kill switch: manual override always available&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;L5: Execution&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l5-execution&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l5-execution&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Almgren-Chriss (minimize market impact):&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;Don&amp;#39;t execute 100% in 1 candle.
Break it into 5-10 small orders.
Use TWAP/VWAP for better timing.
Check liquidity before each order.&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;L6: Evaluation&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l6-evaluation&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l6-evaluation&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Walk-forward backtesting (no data leakage):&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;Train window: 60 days
Test window: 5 days
Roll forward: shift 5 days, repeat

Metrics:
- Sharpe, Sortino, Calmar ratios
- Max drawdown
- Win rate
- Recovery factor&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;L7: Compliance&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;l7-compliance&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#l7-compliance&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;- KillSwitch thread-safe (emergency)
- Auditor append-only in JSONL (immutable)
- Telegram notifications (real-time alerts)
- 202 tests (Python, pytest)
- CI/CD (GitHub Actions)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;The insight:&lt;/strong&gt; Quant engineering isn&amp;rsquo;t about &amp;ldquo;predicting prices.&amp;rdquo; It&amp;rsquo;s about building a &lt;strong&gt;system&lt;/strong&gt; that&amp;rsquo;s tested, auditable, and fails gracefully (minimal drawdown).&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Bug That Revealed Everything&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-bug-that-revealed-everything&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-bug-that-revealed-everything&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Initially, Sharpe was &lt;strong&gt;+0.66&lt;/strong&gt;. Looked good.&lt;/p&gt;
&lt;p&gt;Then I found &lt;strong&gt;data leakage in the HMM&lt;/strong&gt;: the model was seeing the future during training.&lt;/p&gt;
&lt;p&gt;A simple oversight:&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# WRONG: trains with all data (future data leaks)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;hmm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;all_indicators&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# RIGHT: trains only with past up to time T&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;n&#34;&gt;hmm&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;fit&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;n&#34;&gt;indicators_until_date_T&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;After fixing: Sharpe dropped to &lt;strong&gt;-1.14&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;That moment was crucial: &lt;strong&gt;real &amp;raquo; spurious&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/23/sharpe-negativo-sucesso-engenharia/images/02-leakage-sharpe.png&#34; alt=&#34;Data leakage: a spurious &amp;#43;0.66 Sharpe with the model seeing the future becomes a real -1.14 after fixing one line of the fit&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I could have:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ignored the bug and shipped (risk: fraud)&lt;/li&gt;
&lt;li&gt;Abandoned the project (risk: missed learning)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Instead, I documented the fix, rewrote the tests, and asked the right question: &amp;ldquo;What does this system &lt;em&gt;actually&lt;/em&gt; solve?&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Tradeoff: Alpha vs Capital Preservation&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-tradeoff-alpha-vs-capital-preservation&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-tradeoff-alpha-vs-capital-preservation&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Let&amp;rsquo;s look at the numbers (out-of-sample, walk-forward):&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Metric&lt;/th&gt;
          &lt;th&gt;Quant Agent&lt;/th&gt;
          &lt;th&gt;Buy &amp;amp; Hold&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Sharpe ratio&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;-1.14&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;-0.04&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Max drawdown&lt;/td&gt;
          &lt;td&gt;&lt;strong&gt;0.29%&lt;/strong&gt;&lt;/td&gt;
          &lt;td&gt;26.24%&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Win rate&lt;/td&gt;
          &lt;td&gt;1/7 windows&lt;/td&gt;
          &lt;td&gt;4/7 windows&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Read that again.&lt;/p&gt;
&lt;p&gt;The agent has no alpha. But it reduces drawdown by &lt;strong&gt;~90x&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/23/sharpe-negativo-sucesso-engenharia/images/03-drawdown-90x.png&#34; alt=&#34;Out-of-sample max drawdown: buy &amp; hold at 26.24% against the quant agent’s 0.29%, 90x less drawdown, capital preservation over alpha&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Ask yourself: which scenario would you prefer?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scenario 1:&lt;/strong&gt; You buy and hold. In one year, there&amp;rsquo;s one day where you lose 26% of everything. The next day, you recover 15%. Do you sleep?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scenario 2:&lt;/strong&gt; You&amp;rsquo;re running the agent. Max loss is 0.29% on any given day. You sleep better.&lt;/p&gt;
&lt;p&gt;Capital preservation &amp;gt; chasing alpha.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Framework vs Outcome&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;framework-vs-outcome&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#framework-vs-outcome&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;The code didn&amp;rsquo;t &amp;ldquo;fail.&amp;rdquo; It &lt;em&gt;solved a different problem&lt;/em&gt; than planned.&lt;/p&gt;
&lt;p&gt;Systems thinking:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Original goal:&lt;/strong&gt; Generate positive returns (alpha)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Problem discovered:&lt;/strong&gt; Alpha is rare (even for professionals)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Emergent solution:&lt;/strong&gt; Risk management is consistent&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actual result:&lt;/strong&gt; A capital preservation system&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sometimes, failing at your original goal is the universe&amp;rsquo;s way of showing you the real one.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;The Technical Stack&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;the-technical-stack&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#the-technical-stack&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;For devs, here&amp;rsquo;s what worked:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What worked:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Python + SQLAlchemy (robust ORM)&lt;/li&gt;
&lt;li&gt;asyncio (true concurrency, non-blocking I/O)&lt;/li&gt;
&lt;li&gt;pytest (202 tests passing)&lt;/li&gt;
&lt;li&gt;Postgres (append-only auditing, compliance)&lt;/li&gt;
&lt;li&gt;Windows Task Scheduler (low-cost orchestration)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;What was challenging:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;HMM on non-stationary data (quant is &lt;em&gt;hard&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Market microstructure (Almgren-Chriss is complex)&lt;/li&gt;
&lt;li&gt;Real-time data latency (lag = real slippage)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Final stack:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;Data ingestion:  Binance API &amp;#43; Glassnode &amp;#43; yfinance
ML stack:        scikit-learn (HMM), LightGBM (regression)
Backend:         FastAPI (optional, current: local scheduler)
Database:        Postgres 16 &amp;#43; JSONL audit trail
Notifications:   Telegram bot &amp;#43; Discord webhook
Infrastructure:  Cheap VPS (1 vCPU, 4GB RAM, 50GB NVMe)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Runs on &lt;strong&gt;a cheap machine&lt;/strong&gt;. No Kubernetes, no scary AWS bills.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Lasting lessons&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;lasting-lessons&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#lasting-lessons&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;h3&gt;1. Test First (TDD)&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;1-test-first-tdd&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#1-test-first-tdd&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;202 tests = confidence. You refactor without fear.&lt;/p&gt;
&lt;p&gt;No tests? Silent failures. You discover them in production.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;pre&gt;&lt;code&gt;Each feature has an associated test:
- test_hmmpredict.py (model validation)
- test_kelly_sizing.py (risk management)
- test_market_impact.py (execution)
- test_audit_trail.py (compliance)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3&gt;2. Auditing is Design&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;2-auditing-is-design&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#2-auditing-is-design&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;JSONL append-only logs saved me when I questioned results.&lt;/p&gt;
&lt;div class=&#34;hextra-code-block hx:relative hx:mt-6 hx:first:mt-0 hx:group/code&#34;&gt;

&lt;div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-json&#34; data-lang=&#34;json&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;&amp;#34;timestamp&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;2026-04-22T10:30:00&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;action&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;BUY&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;size&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.05&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;price&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;65000&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;reason&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;BULL_regime_high_momentum&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nt&#34;&gt;&amp;#34;timestamp&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;2026-04-22T11:45:00&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;action&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;s2&#34;&gt;&amp;#34;CLOSE&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;pnl&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mi&#34;&gt;50&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt; &lt;span class=&#34;nt&#34;&gt;&amp;#34;drawdown&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt; &lt;span class=&#34;mf&#34;&gt;0.0015&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class=&#34;hextra-code-copy-btn-container hx:opacity-0 hx:transition hx:group-hover/code:opacity-100 hx:flex hx:gap-1 hx:absolute hx:m-[11px] hx:right-0 hx:top-0&#34;&gt;
  &lt;button
    class=&#34;hextra-code-copy-btn hx:group/copybtn hx:cursor-pointer hx:transition-all hx:active:opacity-50 hx:bg-primary-700/5 hx:border hx:border-black/5 hx:text-gray-600 hx:hover:text-gray-900 hx:rounded-md hx:p-1.5 hx:dark:bg-primary-300/10 hx:dark:border-white/10 hx:dark:text-gray-400 hx:dark:hover:text-gray-50&#34;
    title=&#34;Copy code&#34;
    aria-label=&#34;Copy code&#34;
    data-copied-label=&#34;Copied!&#34;
  &gt;
    &lt;div class=&#34;hextra-copy-icon hx:group-[.copied]/copybtn:hidden hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
&lt;div class=&#34;hextra-success-icon hx:hidden hx:group-[.copied]/copybtn:block hx:pointer-events-none hx:h-4 hx:w-4&#34;&gt;&lt;/div&gt;
  &lt;/button&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;You can trace &lt;em&gt;why&lt;/em&gt; each decision was made.&lt;/p&gt;
&lt;h3&gt;3. Constraints Generate Innovation&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;3-constraints-generate-innovation&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#3-constraints-generate-innovation&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Quarter Kelly sizing is more conservative than full Kelly. But it was more effective.&lt;/p&gt;
&lt;p&gt;Constraints (2% max risk, 15% max DD) forced creativity in decision-making.&lt;/p&gt;
&lt;p&gt;Too much freedom = overfitting.&lt;/p&gt;
&lt;h3&gt;4. Real-Time is Different from Backtesting&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;4-real-time-is-different-from-backtesting&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#4-real-time-is-different-from-backtesting&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Walk-forward validation prevents surprises.&lt;/p&gt;
&lt;p&gt;Your model might be 70% accurate in backtest, but in production? 45%. Why?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slippage (you don&amp;rsquo;t get the exact price)&lt;/li&gt;
&lt;li&gt;Latency (0.5s delay = different price)&lt;/li&gt;
&lt;li&gt;Spread (bid/ask widens in volatility)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Real-time doesn&amp;rsquo;t forgive.&lt;/p&gt;
&lt;h3&gt;5. Failure is Learning&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;5-failure-is-learning&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#5-failure-is-learning&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Data leakage (-1.14 vs +0.66) was the most valuable discovery.&lt;/p&gt;
&lt;p&gt;Fixing that bug = I learned more than from 10 books on quant.&lt;/p&gt;
&lt;p&gt;Don&amp;rsquo;t fear &amp;ldquo;failures&amp;rdquo; that teach.&lt;/p&gt;
&lt;h3&gt;6. Simplicity &amp;gt; Complexity&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;6-simplicity--complexity&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#6-simplicity--complexity&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;3 states in the HMM worked better than 10+ features.&lt;/p&gt;
&lt;p&gt;6 months building. Result: simple.&lt;/p&gt;
&lt;p&gt;Time inversion: 95% building, 5% simplifying. But that 5% = the code that actually runs in production.&lt;/p&gt;
&lt;h3&gt;7. Capital Preservation &amp;gt; Chasing Alpha&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;7-capital-preservation--chasing-alpha&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#7-capital-preservation--chasing-alpha&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Your goal should be: &amp;ldquo;Don&amp;rsquo;t lose money.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Alpha (extra returns) is a bonus.&lt;/p&gt;
&lt;p&gt;Most quants invert it: &amp;ldquo;I&amp;rsquo;ll chase alpha, tolerate losses.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Wrong.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What Comes Next&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-comes-next&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-comes-next&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;This agent won&amp;rsquo;t generate overnight wealth.&lt;/p&gt;
&lt;p&gt;(If anyone promises that, run.)&lt;/p&gt;
&lt;p&gt;But it solves a real problem:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&amp;ldquo;How do I build a robust decision system in Python?&amp;rdquo;&lt;/p&gt;

&lt;/blockquote&gt;
&lt;p&gt;Next steps for you:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The code:&lt;/strong&gt; project closed for now. The architecture described above (HMM + LightGBM + Kelly + HRP, train/production separation, event-based vs polling) is what matters to replicate the approach.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adapt it:&lt;/strong&gt; For stocks, commodities, crypto (framework is agnostic)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Realize:&lt;/strong&gt; How hard quant is. Respect those who do it well.&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h2&gt;What&amp;rsquo;s Your Metric?&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;whats-your-metric&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#whats-your-metric&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Sharpe is useful. But maybe you optimize for something else:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Maximum wealth in minimum time?&lt;/strong&gt; (time allocated)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Minimum drawdown?&lt;/strong&gt; (peace of mind)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Minimum capital needed?&lt;/strong&gt; (accessibility)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pick your metric. Build for it. Validate with real data.&lt;/p&gt;
&lt;p&gt;Not his choice. Not the trend. Yours.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Sharpe -1.14 is a marketing failure. But it&amp;rsquo;s an engineering win.&lt;/p&gt;
&lt;p&gt;If the goal was to learn how to build a robust, tested, auditable, scalable system, mission accomplished.&lt;/p&gt;
&lt;p&gt;Your next objective is yours.&lt;/p&gt;
&lt;p&gt;Reply on &lt;a href=&#34;https://linkedin.com/in/thacvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Substack newsletter&lt;/a&gt; to get the next posts.&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>Real data engineering content in Portuguese is rare. I&#39;m going to help change that.</title>
      <link>https://vazdeng.pages.dev/en/2026/04/18/real-data-engineering-portuguese/</link>
      <pubDate>Sat, 18 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/04/18/real-data-engineering-portuguese/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/04/18/engenharia-dados-em-portugues/cover_hu_70018923dc641386.png" width="640" height="336"/>]]>
        
        &lt;p&gt;The kind of data engineering content in Portuguese where you can tell the person actually lived what they&amp;rsquo;re writing about, that&amp;rsquo;s hard to find.&lt;/p&gt;
&lt;p&gt;Search right now. You&amp;rsquo;ll find a lot of solid material to start with: translated articles from English blogs, tutorials grounded in the official docs, courses teaching Pandas on simple datasets. All of that has its place, it&amp;rsquo;s where most people start, and the people producing it are doing important work.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s still hard to find is someone telling you how they decided to use Delta Lake instead of Parquet in an environment processing hundreds of millions of daily transactions. Or when Medallion Architecture helps and when it just gets in the way. Or how LGPD (Brazil&amp;rsquo;s data privacy law) actually changes the way you design an ingestion layer.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the gap I want to help fill.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/18/engenharia-dados-em-portugues/bookshelf.png&#34; alt=&#34;Empty bookshelf labeled “Data engineering · Português” with a silhouetted figure placing the first book&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;h2&gt;Who I am, by what I&amp;rsquo;ve built&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;who-i-am-by-what-ive-built&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#who-i-am-by-what-ive-built&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;I won&amp;rsquo;t list certificates. I&amp;rsquo;ll tell you what I&amp;rsquo;ve shipped.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m a senior data engineer with 8+ years of experience. I started in data quality at a major Brazilian bank, then moved to a global-scale Brazilian fintech building ETL pipelines, worked on a big-tech project in Silicon Valley through an international tech consultancy, and today I&amp;rsquo;m back in the Brazilian banking sector. (Full résumé on the &lt;a href=&#34;https://vazdeng.pages.dev/en/sobre/&#34;&gt;/sobre/&lt;/a&gt; page.)&lt;/p&gt;
&lt;p&gt;My core stack is Databricks. Not because I read the docs. Because it&amp;rsquo;s what runs in production where I&amp;rsquo;ve worked.&lt;/p&gt;
&lt;p&gt;In 2026 I started a master&amp;rsquo;s in applied computational methods. My research is on AI-driven predictive monitoring for critical operational systems. Everything I learn there I plan to bring here, translated into something useful for engineers working with real data.&lt;/p&gt;
&lt;h2&gt;Why crypto entered the story&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;why-crypto-entered-the-story&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#why-crypto-entered-the-story&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;A few years ago I started studying on-chain analytics. And I noticed something that few people seem to be saying clearly: crypto, in large part, is a data engineering problem that&amp;rsquo;s still poorly solved.&lt;/p&gt;
&lt;p&gt;The data is all there. On-chain, open, public. But most people investing in crypto don&amp;rsquo;t know how to process it, and many data engineers still aren&amp;rsquo;t looking at it.&lt;/p&gt;
&lt;p&gt;So I decided to build a crypto AI agent from scratch. In public, documenting every architecture decision. Using the same tools I use at work: real pipelines, rigorous backtesting, actual statistical models. No hype, no get-rich-quick promises.&lt;/p&gt;
&lt;h2&gt;What you&amp;rsquo;ll find here&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-youll-find-here&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-youll-find-here&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three tracks, one newsletter.&lt;/p&gt;
&lt;p&gt;The first is &lt;strong&gt;production data engineering&lt;/strong&gt;: Databricks, Delta Lake, Spark, dbt, Airflow. Real architecture decisions, mistakes I made and what I learned, Brazilian context where it&amp;rsquo;s relevant (LGPD in practice, cloud cost reality, what data actually looks like inside financial institutions).&lt;/p&gt;
&lt;p&gt;The second is the &lt;strong&gt;crypto AI agent&lt;/strong&gt;, built in public. Architecture, code, backtesting, on-chain analysis. Every step documented. If something breaks, you&amp;rsquo;ll know why.&lt;/p&gt;
&lt;p&gt;The third is the &lt;strong&gt;master&amp;rsquo;s research translated to practice&lt;/strong&gt;. What academic research has to say about the problems you face every day. No filter, no academic jargon.&lt;/p&gt;
&lt;p&gt;Published in Portuguese and English, every week.&lt;/p&gt;
&lt;p&gt;Hit reply and tell me: what&amp;rsquo;s the hardest data problem you&amp;rsquo;re dealing with right now? I read everything.&lt;/p&gt;
&lt;p&gt;Thais Vaz&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Newsletter on Substack →&lt;/a&gt;&lt;/p&gt;

      </description>
    </item>
    
    <item>
      <title>LGPD at the ingestion layer: 4 principles that change your architecture</title>
      <link>https://vazdeng.pages.dev/en/2026/04/16/lgpd-at-ingestion-layer/</link>
      <pubDate>Thu, 16 Apr 2026 00:00:00 +0000</pubDate>
      
      <guid>https://vazdeng.pages.dev/en/2026/04/16/lgpd-at-ingestion-layer/</guid>
      <description>
        
        
        
        <![CDATA[<img src="https://vazdeng.pages.dev/2026/04/16/lgpd-ingestao-de-dados/cover_hu_8477b73067869427.png" width="640" height="336"/>]]>
        
        &lt;p&gt;Most data teams treat privacy law as something to solve &amp;ldquo;later&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;First the pipeline gets built, the data lands in the lake, the dashboards start shipping. Then one day a data subject request shows up asking for deletion of personal data. And the team finds out it doesn&amp;rsquo;t know where that ID lives, how many copies sit in Bronze, how many ML models were trained on it.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s too late.&lt;/p&gt;
&lt;p&gt;LGPD (Brazil&amp;rsquo;s data privacy law, similar in spirit to GDPR) isn&amp;rsquo;t compliance at the end of the pipeline. It&amp;rsquo;s a design constraint that starts at the first byte you ingest. There are four principles that, if you build them into the ingestion layer, prevent almost every downstream pain.&lt;/p&gt;
&lt;h2&gt;Principle 1: minimize at the source, not at the destination&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;principle-1-minimize-at-the-source-not-at-the-destination&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#principle-1-minimize-at-the-source-not-at-the-destination&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Art. 6, III of LGPD requires &lt;strong&gt;necessity&lt;/strong&gt;: only process data that is adequate and limited to the purpose.&lt;/p&gt;
&lt;p&gt;The practical translation is simple. Don&amp;rsquo;t ingest what you won&amp;rsquo;t use.&lt;/p&gt;
&lt;p&gt;Sounds obvious, but it isn&amp;rsquo;t. Most pipelines ingest entire tables (including IDs, phone numbers, addresses, emails) &amp;ldquo;because it&amp;rsquo;s in the source&amp;rdquo;. Then compliance shows up, asks for the mapping of these fields, and discovers 80% of them were never consumed by anyone.&lt;/p&gt;
&lt;p&gt;The right pattern is to apply schema filtering before persistence. In the ingestion pipeline, you explicitly define which fields enter the lake. Whatever doesn&amp;rsquo;t enter never becomes your retention problem, anonymization problem, audit problem.&lt;/p&gt;
&lt;p&gt;The question worth asking before each field is: &lt;em&gt;what concrete use case needs this data?&lt;/em&gt;. If the answer is &amp;ldquo;I dunno, could be useful&amp;rdquo;, then it isn&amp;rsquo;t needed.&lt;/p&gt;
&lt;h2&gt;Principle 2: pseudonymize from the first byte&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;principle-2-pseudonymize-from-the-first-byte&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#principle-2-pseudonymize-from-the-first-byte&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three terms that look alike and aren&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Anonymization&lt;/strong&gt; is data made irreversible. Nobody can be identified anymore. It&amp;rsquo;s the only state LGPD treats as out of scope (Art. 12).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pseudonymization&lt;/strong&gt; is identity replaced by a code, but reversible via a separate mapping table. Still personal data (Art. 13, §4). Reduces risk, but doesn&amp;rsquo;t remove the obligation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tokenization&lt;/strong&gt; is a specific pseudonymization pattern with deterministic tokens, useful for preserving joins without exposing the original data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://vazdeng.pages.dev/2026/04/16/lgpd-ingestao-de-dados/lifecycle.png&#34; alt=&#34;Lifecycle of a personal data field passing through anonymization, pseudonymization and tokenization&#34;  loading=&#34;lazy&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The pattern that works is to tokenize at ingestion. Bronze never sees raw data. It sees the deterministic token. The &lt;code&gt;token ↔ original&lt;/code&gt; mapping lives in an isolated table, with encryption at rest, audited access, and its own retention policy.&lt;/p&gt;
&lt;p&gt;This solves three problems at once. You can join tables in the lake without exposing the original data. Right to erasure becomes a &lt;code&gt;DELETE&lt;/code&gt; in the mapping, no need to touch Bronze. And analysts and ML models work with pseudonymized data by default, reducing the risk surface.&lt;/p&gt;
&lt;h2&gt;Principle 3: lineage is a requirement, not a feature&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;principle-3-lineage-is-a-requirement-not-a-feature&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#principle-3-lineage-is-a-requirement-not-a-feature&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;When a data subject request shows up (Art. 18, right of access, correction, deletion), you have 15 days to respond. Without complete lineage, that deadline becomes a nightmare.&lt;/p&gt;
&lt;p&gt;Real lineage answers three questions for any personal data. Where did it come from? Source system, original field, ingestion timestamp. What transformations did it go through? Pipeline steps, applied rules, derivations. Where is it now? Tables, trained models, dashboards that consume it.&lt;/p&gt;
&lt;p&gt;Tools like OpenLineage, DataHub and Databricks Unity Catalog deliver this, but only if you instrument from ingestion onward. Adding lineage after the pipeline is already running is ten times more expensive than adding it before.&lt;/p&gt;
&lt;p&gt;The practical test is direct: can you, in under an hour, list every table and model that contains the ID &lt;code&gt;123.456.789-00&lt;/code&gt;? If you can&amp;rsquo;t, your lineage isn&amp;rsquo;t LGPD-ready.&lt;/p&gt;
&lt;h2&gt;Principle 4: retention by purpose, not by table&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;principle-4-retention-by-purpose-not-by-table&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#principle-4-retention-by-purpose-not-by-table&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Art. 15 says processing ends when the purpose is fulfilled. Art. 16 completes: after that, data must be deleted.&lt;/p&gt;
&lt;p&gt;In data engineering practice, this means each piece of data has its own clock. You can&amp;rsquo;t define a single &amp;ldquo;retention equals 5 years&amp;rdquo; policy for all tables. Some purposes require months, others years, others are indefinite (under different legal bases).&lt;/p&gt;
&lt;p&gt;Patterns that work: tables partitioned by processing date, with &lt;code&gt;VACUUM&lt;/code&gt; or &lt;code&gt;TRUNCATE PARTITION&lt;/code&gt; at the end of the cycle. A purpose map documented in code, a YAML that defines, per table and per field, which purpose justifies it, which legal basis, which deadline. And automated expiration jobs, no relying on manual process: configure retention policies that run themselves.&lt;/p&gt;
&lt;p&gt;Delta Lake, BigQuery and Snowflake all have mechanisms for this. The real work is translating legal purpose into technical configuration, and that&amp;rsquo;s the work nobody wants to do, but it determines whether you clash with the regulator or not.&lt;/p&gt;
&lt;h2&gt;What data engineers need to align with legal&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-data-engineers-need-to-align-with-legal&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-data-engineers-need-to-align-with-legal&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;Three conversations engineering can&amp;rsquo;t outsource.&lt;/p&gt;
&lt;p&gt;The first is the legal basis of each data. Consent? Legitimate interest? Contract execution? Each has different technical implications. Right to revoke, for example, only exists under consent.&lt;/p&gt;
&lt;p&gt;The second is the concrete purpose of each pipeline. &amp;ldquo;Analytics&amp;rdquo; doesn&amp;rsquo;t count. Which business decision does this data support?&lt;/p&gt;
&lt;p&gt;The third is the response process for subject requests. Who receives? What&amp;rsquo;s the flow? What&amp;rsquo;s the internal SLA? This must be documented, tested, and have an owner.&lt;/p&gt;
&lt;p&gt;If these three conversations haven&amp;rsquo;t happened yet, your personal-data pipeline is running on compliance debt.&lt;/p&gt;
&lt;h2&gt;What stays&lt;span class=&#34;hx:absolute hx:-mt-20&#34; id=&#34;what-stays&#34;&gt;&lt;/span&gt;
    &lt;a href=&#34;#what-stays&#34; class=&#34;subheading-anchor&#34; aria-label=&#34;Permalink for this section&#34;&gt;&lt;/a&gt;&lt;/h2&gt;&lt;p&gt;LGPD isn&amp;rsquo;t a checklist at the end. It&amp;rsquo;s a design constraint that changes four things. What you ingest (minimization). How you ingest (pseudonymization). What you track (lineage). How long you keep (retention by purpose).&lt;/p&gt;
&lt;p&gt;Teams that treat it as &amp;ldquo;we&amp;rsquo;ll solve it later&amp;rdquo; pay the entire tech debt on the first subject request that arrives. Teams that treat it as a design constraint from the first byte don&amp;rsquo;t even notice it&amp;rsquo;s there, because it&amp;rsquo;s just how things work.&lt;/p&gt;
&lt;p&gt;The difference between the two isn&amp;rsquo;t legal. It&amp;rsquo;s engineering.&lt;/p&gt;
&lt;p&gt;What&amp;rsquo;s the trickiest data subject request your team has ever dealt with? Reply on &lt;a href=&#34;https://linkedin.com/in/thacvaz&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LinkedIn&lt;/a&gt; or subscribe to the &lt;a href=&#34;https://vazdeng.substack.com&#34;target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Substack&lt;/a&gt; for the next posts.&lt;/p&gt;

      </description>
    </item>
    
  </channel>
</rss>
