<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mikelsagardia.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mikelsagardia.io/" rel="alternate" type="text/html" /><updated>2026-03-13T10:34:36+00:00</updated><id>https://mikelsagardia.io/feed.xml</id><title type="html">Mikel Sagardia</title><subtitle>This site chronicles my observations in the fast-evolving landscape of data science, covering topics related to  AI/ML, computer vision, NLP, 3D, robotics... and more!</subtitle><entry><title type="html">Applying Parameter-Efficient Fine-Tuning (PEFT) to a Large Language Model (LLM)</title><link href="https://mikelsagardia.io/blog/llm-peft-lora-fine-tuning.html" rel="alternate" type="text/html" title="Applying Parameter-Efficient Fine-Tuning (PEFT) to a Large Language Model (LLM)" /><published>2026-03-06T08:30:00+00:00</published><updated>2026-03-06T08:30:00+00:00</updated><id>https://mikelsagardia.io/blog/llm-peft-lora-fine-tuning</id><content type="html" xml:base="https://mikelsagardia.io/blog/llm-peft-lora-fine-tuning.html"><![CDATA[<!--
Blog Post 1: How Are Large Language Models (LLMs) Built?
Subtitle: A Conceptual Guide for Developers

Blog Post 2: Applying Parameter-Efficient Fine-Tuning (PEFT) to a Large Language Model (LLM)
Subtitle: When We Need to Adapt LLMs to Specific Tasks and Domains
-->

<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  A Conceptual Guide for Developers &amp; ML Practitioners
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/llms/scifi_parrots_dalle3.png" alt="Two cheerful macaw parrots dressed in Star Wars and Star Trek outfits." width="1000" />
<small style="color:grey">Two <a href="https://dl.acm.org/doi/10.1145/3442188.3445922">stochastic parrots</a> dressed up like Star Wars and Star Trek characters; same parrot, different costumes and roles. Image generated using <a href="https://openai.com/index/dall-e-3/">Dall-E 3</a>; prompt: <i> Wide landscape cartoon illustration of two red-blue-yellow macaws with sunglasses on tree branches in a bright green forest. Left parrot dressed as a Jedi with robe and blue lightsaber, right parrot dressed as a classic Star Trek Vulcan officer in a gold uniform. Bold, vibrant vector style.</i>
</small>
</p>

<p>In my <a href="https://mikelsagardia.io/blog/how-are-llms-built.html">previous post</a> I explained how LLMs are built, and how they work. In this post, I will try to explain how to adapt LLMs easily to specific <em>tasks</em> and <em>domains</em> using <a href="https://github.com/huggingface/peft">HuggingFace’s <code class="language-plaintext highlighter-rouge">peft</code> library</a>. As explained on the official site, <a href="https://huggingface.co/docs/peft/en/index">PEFT (Parameter-Efficient Fine-Tuning)</a> is a family of techniques that</p>

<blockquote>
  <p>“only fine-tune a small number of (extra) model parameters — significantly decreasing computational and storage costs — while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.”</p>
</blockquote>

<p>In summary, I cover the following topics in this post:</p>

<ul>
  <li>What <em>task</em> and <em>domain</em> adaptation of LLMs is, and which techniques are commonly used for it.</li>
  <li>How PEFT/LoRA works, and how it reduces the number of trainable parameters by orders of magnitude.</li>
  <li>Explanation of a <a href="https://github.com/mxagar/llm_peft_fine_tuning_example/blob/main/llm_peft.ipynb">Jupyter Notebook</a> that implements PEFT/LoRA on a <a href="https://huggingface.co/docs/transformers/en/model_doc/distilbert">DistilBERT model</a> for a text classification task, using the <a href="https://huggingface.co/datasets/fancyzhx/ag_news">AG News</a> dataset.</li>
</ul>

<p>Let’s start!</p>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
You can find this post's accompanying code in <a href="https://github.com/mxagar/diffusion-examples/ddpm">this GitHub repository</a>. If you don't know very well how LLMs work or what embeddings are, I recommend reading my previous post <a href="https://mikelsagardia.io/blog/how-are-llms-built.html">How Are Large Language Models (LLMs) Built?</a> before diving into this one.
</strong>
</div>
<div style="height: 30px;"></div>

<h2 id="why-and-how-should-we-adapt-llms">Why and How Should We Adapt LLMs?</h2>

<p>First of all, we should define some terminology:</p>

<ul>
  <li>A <em>Task</em>: a specific problem we want to solve. The task is usually defined by the <em>input</em> and the <em>output</em> formats. Typically, LLMs are trained on the general task of <em>language modeling</em>: predicting the next word/token given an input sequence (i.e., the context); as such, they are able to generate coherent text related to the input. However, we can change their output layers (also known as <em>heads</em>) to perform other tasks, such as <em>text classification</em> (e.g., <em>sentiment analysis</em> and <em>topic classification</em>), <em>token classification</em> (e.g., <em>named entity recognition</em> or NER), etc.</li>
  <li>A <em>Domain</em>: the specific area or context to which the training texts belong and in which the task needs to be performed. Typically, LLMs are trained on a wide variety of texts from the Internet, which makes them generalists. However, we may want to adapt them to specific domains, such as <em>medicine</em>, <em>finance</em>, <em>legal</em>, etc. The more niche the domain, the more we may need to adapt the LLM to it to learn style, jargon, and specific knowledge.</li>
</ul>

<p>This <em>task</em> and <em>domain</em> adaptation, although referred to as <em>fine-tuning</em> in the LLM world, is known as <em>transfer learning</em> in the context of computer vision. <a href="https://arxiv.org/abs/1801.06146">Howard and Ruder (2017)</a> showed that a language model trained on a large corpus can be adapted for smaller corpora and other downstream tasks.</p>

<p>One common approach in the <a href="https://huggingface.co/docs/peft/en/index">PEFT</a> library is the <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation (or LoRA, introduced by Hu et al., 2021)</a>, which I cover in more detail in the next section. In a nutshell: LoRA freezes the pre-trained weight matrices $W$ and adds to them new matrices $dW$, which are the ones that are trained. These $dW$ matrices are factored as the multiplication of two low-rank matrices; that trick reduces trainable parameters by orders of magnitude and maintains or matches full fine-tuning performance on many benchmarks.</p>

<p>There are other ways to adapt LLMs which I won’t cover here, such as:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2203.02155">RLHF (Reinforcement Learning with Human Feedback)</a>: This technique was used to align the initial ChatGPT model (GPT 3.5) with human preferences. Initially, human annotators ranked outputs of a GPT model. Then, these annotations were used to train a reward model (RM) to automatically predict the output score. And finally, the GPT model (<em>policy</em>) was trained using the <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization (PPO) algorithm</a>, based on the conversation history (<em>state</em>) and the outputs it produced (<em>actions</em>), and using the reward model (<em>reward</em>) as the evaluator.</li>
  <li><a href="https://arxiv.org/abs/2005.11401">RAG (Retrieval Augmented Generation)</a>: This method consists in outsourcing the domain-specific memory of LLMs. In an offline ingestion phase, the knowledge is chunked and indexed, often as embedding vectors. In the real-time generation phase, the user asks a question, which is encoded and used to retrieve the most similar indexed chunks; then, the LLM is prompted to answer the question by using the found similar chunks, i.e., the retrieved data is injected in the query. RAGs reduce hallucinations and have been extensively implemented recently.</li>
</ul>

<p>In my experience, usually PEFT/LoRA and RAG are the most used techniques and they can be used in combination:</p>

<ul>
  <li>PEFT/LoRA makes sense when we need to approach a task different than <em>language modeling</em> (i.e., next token prediction), or when we have a very specific domain, such as <em>medicine</em> or <em>finance</em>, which is not well represented in the general training data of the LLM.</li>
  <li>RAG is more useful when we have a task that can be solved by retrieving specific information, such as <em>question answering</em> or <em>summarization</em>, and when we have a large amount of domain-specific data that changes constantly. Most chatbots that are used in production for customer support, for instance, are RAG-based.</li>
</ul>

<h3 id="how-does-peftlora-work">How Does PEFT/LoRA Work?</h3>

<p>When we apply Low-Rank Adaptation (LoRA), we basically decompose a weight matrix into a multiplication of low-rank matrices that have fewer parameters.</p>

<p>Let’s consider a pre-trained weight matrix $W$; instead of changing it directly, we add to it a weight offset $dW$ as follows:</p>

\[\hat{W} = W + dW,\]

<p>where</p>

<ul>
  <li>$\hat{W}$ represents the adapted weight matrix $(d, f)$</li>
  <li>and $dW$ is a weight offset to be learned, of shape $(d, f)$.</li>
</ul>

<p>However, we do not operate directly with the weight offset $dW$; instead, we factor it as the multiplication of two low-rank matrices:</p>

\[dW = A \cdot B,\]

<p>where</p>

<ul>
  <li>$A$ is of shape $(d, r)$,</li>
  <li>$B$ is of shape $(r, f)$,</li>
  <li>and $r « d, f$.</li>
</ul>

<p>The key idea is that during training we freeze $W$ while we learn $dW$; however, instead of learning the full-sized $dW$, we learn the much smaller matrices $A$ and $B$. The forward pass of the model is modified as follows:</p>

\[y = x \cdot W = x \cdot (W + dW) = x \cdot (W + A \cdot B).\]

<p>The proportion of weights in $dW$ as compared to $W$ is the following:</p>

<ul>
  <li>Weights of $W$: $d \cdot f$</li>
  <li>Weights of $A$ and $B$: $r \cdot (d + f)$</li>
  <li>Proportion: $r\cdot\frac{d + f}{d \cdot f}$</li>
</ul>

<p>Note that the number of trainable parameters is reduced by controlling the rank value $r$; for instance, if we set $r=4$, we can reduce the number of trainable parameters by more than <code class="language-plaintext highlighter-rouge">100x</code> for a weight matrix of size $(4096, 4096)$.</p>

<p>LoRA is not applied to all weight matrices, but usually the library (<code class="language-plaintext highlighter-rouge">peft</code>) decides where to apply it; e.g.: projection matrices $Q$ and $V$ in attention blocks, MLP layers, etc. And, after training, we can merge $W + dW$, so there is no latency added!</p>

<p>In practice, LoRA assumes that the task-specific update to a large weight matrix lies in a low-dimensional subspace — and therefore can be efficiently represented with low-rank matrices.</p>

<p>In addition to LoRA, <strong>quantization</strong> is often applied to further reduce the model size and speed up inference. Quantization consists in reducing the precision of the weights from 32-bit floating point values to 16-bit or even 4-bit representations (as in QLoRA); in other words, high-precision floats are approximated using only <code class="language-plaintext highlighter-rouge">k</code> bits. This is achieved by scaling and mapping the original values to a smaller discrete set, sometimes combined with truncating less significant information. Quantization can be easily applied using the library <a href="https://github.com/bitsandbytes-foundation/bitsandbytes">bitsandbytes</a>, which is very well integrated with the HuggingFace ecosystem.</p>

<h2 id="implementation-notebook">Implementation Notebook</h2>

<p>Thanks to the <a href="https://github.com/huggingface/peft"><code class="language-plaintext highlighter-rouge">peft</code></a> library, applying PEFT/LoRA to an LLM is very easy. The <a href="https://github.com/mxagar/llm_peft_fine_tuning_example">Github repository</a> I have prepared contains the Jupyter Notebook <a href="https://github.com/mxagar/llm_peft_fine_tuning_example/blob/main/llm_peft.ipynb"><code class="language-plaintext highlighter-rouge">llm_peft.ipynb</code></a>, in which I provide an example.</p>

<p>There, I fine-tune the <a href="https://arxiv.org/abs/1910.01108">DistilBERT</a> pre-trained model; DistilBERT is a smaller version of the encoder-only <a href="https://arxiv.org/abs/1810.04805">BERT</a> that has been distilled to reduce its size and computational requirements, while maintaining good performance. An alternative could have been <a href="https://arxiv.org/abs/1907.11692">RoBERTa</a>, which was trained roughly on <code class="language-plaintext highlighter-rouge">10x</code> more data than BERT and has approximately twice the parameters of DistilBERT. We could use other models, too, e.g., generative decoder transformers like <a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">GPT-2</a>, although in general RoBERTa seems to have better performance for classification tasks. GPT-2 is similar in size to RoBERTa.</p>

<p>The dataset I use is <a href="https://huggingface.co/datasets/fancyzhx/ag_news"><code class="language-plaintext highlighter-rouge">ag_news</code></a>, which consists of roughly 127,600 news texts, each of them with a label related to its associated topic: <code class="language-plaintext highlighter-rouge">'World', 'Sports', 'Business', 'Sci/Tech'</code> (perfectly balanced). Thus, the <em>task</em> head is <em>text classification</em> (with 4 mutually exclusive categories) and the <em>domain</em> is <em>news</em>.</p>

<p>The notebook is structured in clear sections and comments, which I won’t fully reproduce here; the core steps are the following:</p>

<ul>
  <li>Dataset splitting: I divide the 127,600 samples into the sets <code class="language-plaintext highlighter-rouge">train</code> (108k samples), <code class="language-plaintext highlighter-rouge">test</code> (7.6k), and <code class="language-plaintext highlighter-rouge">validation</code> (12k).</li>
  <li>Tokenization: The <code class="language-plaintext highlighter-rouge">AutoTokenizer</code> is instantiated with the <code class="language-plaintext highlighter-rouge">distilbert-base-uncased</code> pre-trained subword tokenizer.</li>
  <li>Feature exploration: some exploratory data analysis is performed.</li>
  <li>Model setup: the <code class="language-plaintext highlighter-rouge">AutoModelForSequenceClassification</code> is instantiated with the <code class="language-plaintext highlighter-rouge">distilbert-base-uncased</code> pre-trained model, and the <code class="language-plaintext highlighter-rouge">PeftModel</code> is instantiated with the LoRA configuration.</li>
  <li>Training: the <code class="language-plaintext highlighter-rouge">Trainer</code> class is instantiated with the model, the <code class="language-plaintext highlighter-rouge">TrainingArguments</code>, and the datasets; then, the <code class="language-plaintext highlighter-rouge">train()</code> method is called to start training.</li>
  <li>Evaluation: we use the <code class="language-plaintext highlighter-rouge">evaluate()</code> method of the <code class="language-plaintext highlighter-rouge">Trainer</code> to evaluate the model on the test set, and we compute our custom metrics (accuracy, precision, recall, and F1), as defined in <code class="language-plaintext highlighter-rouge">compute_metrics()</code>.</li>
</ul>

<p>The feature exploration reveals that learning the classification task is going to be quite easy for the model. The function <code class="language-plaintext highlighter-rouge">extract_hidden_states()</code> is used to extract the last hidden states computed by the model, after each sample is passed through it. Then, these sample embeddings are mapped to 2D using <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP</a>, and plotted in a hexagonal plot colored by class. As we can see, each class occupies a different region in the embedding space without any fine-tuning — that is, the model already has a good understanding of the differences between the classes.</p>

<p align="center">
<img src="/assets/llms/ag_news_embedding_class_plot.png" alt="Hexagonal plot of the AG News embeddings according to their classes." width="1000" />
<small style="color:grey">A hexagonal plot of the embeddings from the <a href="https://huggingface.co/datasets/fancyzhx/ag_news">AG News dataset</a> according to their classes. The embeddings are the last hidden states of the <a href="https://huggingface.co/docs/transformers/en/model_doc/distilbert">DistilBERT</a> model, and they were reduced to 2D using <a href="https://umap-learn.readthedocs.io/en/latest/">UMAP</a>. Image by the author.</small>
</p>

<p>The key aspect is the model setup for training, which is very straightforward thanks to the HuggingFace ecosystem. The code snippet below shows all the steps:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Quantization config (4-bit for minimal memory usage)
# WARNING: This requires the `bitsandbytes` library to be installed 
# and Intel CPU and/or 'cuda', 'mps', 'hpu', 'xpu', 'npu'
</span><span class="n">bnb_config</span> <span class="o">=</span> <span class="n">BitsAndBytesConfig</span><span class="p">(</span>
    <span class="n">load_in_4bit</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>                      <span class="c1"># Activate 4-bit quantization
</span>    <span class="n">bnb_4bit_use_double_quant</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>         <span class="c1"># Use double quantization for better accuracy
</span>    <span class="n">bnb_4bit_compute_dtype</span><span class="o">=</span><span class="s">"bfloat16"</span><span class="p">,</span>      <span class="c1"># Use bf16 if supported, else float16
</span>    <span class="n">bnb_4bit_quant_type</span><span class="o">=</span><span class="s">"nf4"</span><span class="p">,</span>              <span class="c1"># Quantization type: 'nf4' is best for LLMs
</span><span class="p">)</span>

<span class="c1"># Transformer model: we re-instantiate it to apply LoRA
# We should get a warning about the model weights not being initialized for some layers
# This is because we have appended the classifier head and we haven't trained the model yet
</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForSequenceClassification</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
    <span class="s">"distilbert-base-uncased"</span><span class="p">,</span>
    <span class="n">num_labels</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">id2label</span><span class="p">),</span>
    <span class="n">id2label</span><span class="o">=</span><span class="n">id2label</span><span class="p">,</span>
    <span class="n">label2id</span><span class="o">=</span><span class="n">label2id</span><span class="p">,</span>
    <span class="n">quantization_config</span><span class="o">=</span><span class="n">bnb_config</span><span class="p">,</span>
    <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span>  <span class="c1"># Optional: distributes across GPUs if available
</span><span class="p">)</span>

<span class="c1"># LoRA configuration
# We need to check the target modules for the specific model we are using (see below)
# - For distilbert-base-uncased, we use "q_lin" and "v_lin" for the attention layers
# - For bert-base-uncased, we would use "query" and "value"
# The A*B weights are scaled with lora_alpha/r
</span><span class="n">lora_config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
    <span class="n">r</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>                                   <span class="c1"># Low-rank dimensionality
</span>    <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>                          <span class="c1"># Scaling factor
</span>    <span class="n">target_modules</span><span class="o">=</span><span class="p">[</span><span class="s">"q_lin"</span><span class="p">,</span> <span class="s">"v_lin"</span><span class="p">],</span>      <span class="c1"># Submodules to apply LoRA to (model-specific)
</span>    <span class="n">lora_dropout</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span>                       <span class="c1"># Dropout for LoRA layers
</span>    <span class="n">bias</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span>                            <span class="c1"># Do not train bias
</span>    <span class="n">task_type</span><span class="o">=</span><span class="n">TaskType</span><span class="p">.</span><span class="n">SEQ_CLS</span>              <span class="c1"># Task type: sequence classification
</span><span class="p">)</span>

<span class="c1"># Get the PEFT model with LoRA
</span><span class="n">lora_model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">lora_config</span><span class="p">)</span>

<span class="c1"># Define training arguments
</span><span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
    <span class="n">learning_rate</span><span class="o">=</span><span class="mf">2e-3</span><span class="p">,</span>
    <span class="n">weight_decay</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span>
    <span class="n">num_train_epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">eval_strategy</span><span class="o">=</span><span class="s">"steps"</span><span class="p">,</span>
    <span class="n">save_strategy</span><span class="o">=</span><span class="s">"steps"</span><span class="p">,</span>
    <span class="n">eval_steps</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
    <span class="n">save_steps</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
    <span class="c1"># This seems to be a bug for PEFT models: we need to specify 'labels', not 'label'
</span>    <span class="c1"># as the explicit label column name
</span>    <span class="c1"># If we are not using PEFT, we can ignore this argument
</span>    <span class="n">label_names</span><span class="o">=</span><span class="p">[</span><span class="s">"labels"</span><span class="p">],</span>  <span class="c1"># explicitly specify label column name
</span>    <span class="n">output_dir</span><span class="o">=</span><span class="s">"./checkpoints"</span><span class="p">,</span>
    <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
    <span class="n">per_device_eval_batch_size</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span>
    <span class="n">load_best_model_at_end</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">logging_dir</span><span class="o">=</span><span class="s">"./logs"</span><span class="p">,</span>
    <span class="n">report_to</span><span class="o">=</span><span class="s">"tensorboard"</span><span class="p">,</span>  <span class="c1"># enable TensorBoard, if desired
</span><span class="p">)</span>

<span class="c1"># Initialize the Trainer
</span><span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="n">lora_model</span><span class="p">,</span>  <span class="c1"># Transformer + Adapter (LoRA)
</span>    <span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
    <span class="n">train_dataset</span><span class="o">=</span><span class="n">tokenized_dataset</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
    <span class="n">eval_dataset</span><span class="o">=</span><span class="n">tokenized_dataset</span><span class="p">[</span><span class="s">"validation"</span><span class="p">],</span>
    <span class="n">processing_class</span><span class="o">=</span><span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">compute_metrics</span><span class="o">=</span><span class="n">compute_metrics</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>After training, the model achieves an F1 score of <code class="language-plaintext highlighter-rouge">0.90</code> on the test set (compared to <code class="language-plaintext highlighter-rouge">0.16</code> before fine-tuning), which is a very good result for this task.</p>

<p>Other aspects are covered in the notebook, such as:</p>

<ul>
  <li>The training can be monitored using <a href="https://www.tensorflow.org/tensorboard">TensorBoard</a>.</li>
  <li>A <code class="language-plaintext highlighter-rouge">predict()</code> custom function is provided, which takes an input text, tokenizes it, passes it through the model, and decodes the predicted label.</li>
  <li>LoRA weights are merged and the model is persisted. Merging the LoRA weights consists in computing every $dW$ and adding them to the corresponding $W$; as mentioned before, after merging, the model can be used for inference without any latency increase.</li>
  <li>Some error analysis is performed by looking at the misclassified samples.</li>
  <li>Finally, model packaging is addressed using ONNX. This is also straightforward thanks to the HuggingFace &amp; PyTorch ecosystem, yet essential to be able to deploy the model in production.</li>
</ul>

<h2 id="summary-and-conclusion">Summary and Conclusion</h2>

<p>In this post, I have explained how to adapt LLMs to specific tasks and domains using Parameter-Efficient Fine-Tuning (PEFT), and more concretely, <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation (or LoRA, introduced by Hu et al., 2021)</a>. This technique allows us to train only a small number of parameters while maintaining good performance, which makes it accessible to train and store large language models on consumer hardware.</p>

<p>I have used the classification task applied to the <a href="https://huggingface.co/datasets/fancyzhx/ag_news">AG News</a> dataset, but many more tasks are possible: token classification (e.g., named entity recognition), question answering, summarization, etc.</p>

<p><br /></p>

<blockquote>
  <p>Which task and domain would you like to adapt an LLM to?</p>
</blockquote>

<p><br /></p>

<p>I think that the <a href="https://huggingface.co/">HuggingFace</a> ecosystem is incredible, as it offers plethora of pre-trained models, datasets, and libraries that make it very easy to work with LLMs, from research to production.</p>

<p>If you would like to deepen your understanding of the topic, consider checking these additional resources:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)</a></li>
  <li><a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora">Hugging Face LoRA conceptual guide</a></li>
  <li><a href="https://github.com/mxagar/tool_guides/tree/master/hugging_face">HuggingFace Guide: <code class="language-plaintext highlighter-rouge">mxagar/tool_guides/hugging_face</code></a></li>
  <li>My personal notes on the O’Reilly book <a href="https://github.com/mxagar/nlp_with_transformers_nbs">Natural Language Processing with Transformers, by Lewis Tunstall, Leandro von Werra and Thomas Wolf (O’Reilly)</a></li>
  <li>My personal notes and guide for the <a href="https://github.com/mxagar/generative_ai_udacity/">Generative AI Nanodegree from Udacity</a></li>
</ul>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/llm-peft-lora-fine-tuning.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/llm-peft-lora-fine-tuning.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="AI" /><category term="engineering," /><category term="large" /><category term="language" /><category term="models," /><category term="llm," /><category term="machine" /><category term="learning," /><category term="text" /><category term="generation," /><category term="generative" /><category term="AI," /><category term="deep" /><category term="attention," /><category term="fine-tuning," /><category term="PEFT," /><category term="LoRA," /><category term="quantization" /><summary type="html"><![CDATA[&lt;!– Blog Post 1: How Are Large Language Models (LLMs) Built? Subtitle: A Conceptual Guide for Developers]]></summary></entry><entry><title type="html">How Are Large Language Models (LLMs) Built?</title><link href="https://mikelsagardia.io/blog/how-are-llms-built.html" rel="alternate" type="text/html" title="How Are Large Language Models (LLMs) Built?" /><published>2026-02-28T08:30:00+00:00</published><updated>2026-02-28T08:30:00+00:00</updated><id>https://mikelsagardia.io/blog/how-are-llms-built</id><content type="html" xml:base="https://mikelsagardia.io/blog/how-are-llms-built.html"><![CDATA[<!--
Blog Post 1: How Are Large Language Models (LLMs) Built?
Subtitle: A Conceptual Guide for Developers

Blog Post 2: Applying Parameter-Efficient Fine-Tuning (PEFT) to a Large Language Model (LLM)
Subtitle: When We Need to Adapt LLMs to Specific Tasks and Domains
-->

<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  A Conceptual Guide for Developers &amp; ML Practitioners
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/llms/stochastic_parrot_dalle3.png" alt="A cheerful macaw parrot wearing sunglasses says 42." width="1000" />
<small style="color:grey">Large Language Models (LLMs) have been called <a href="https://dl.acm.org/doi/10.1145/3442188.3445922">stochastic parrots</a> by some; in any case, they seem to be here to stay &mdash; and to be honest, I find them quite useful, if properly used. Image generated using <a href="https://openai.com/index/dall-e-3/">Dall-E 3</a>; prompt: <i> Wide, landscape cartoon illustration of a happy, confident red-blue-yellow macaw wearing black sunglasses, perched on a tree branch in a green forest, with a white comic speech bubble saying <a href="https://simple.wikipedia.org/wiki/42_(answer)">"42"</a>
.</i>
</small>
</p>

<p>The release of <a href="https://openai.com/blog/chatgpt">ChatGPT</a> in November 2022 revolutionized everyday life in much of the developed world. In a similar way that Google convinced us the Internet was truly useful — and that we needed their search engine — or Apple introduced the first genuinely usable smartphone that made the digital world ubiquitous, OpenAI came up with the next logical step: assistant chatbots based on Large Language Models (LLMs). Language models already existed, but OpenAI’s chat-based user interface, combined with the emergent capabilities of their huge models, led to the perfect killer app: an ever-ready genie that <em>seems</em> to confidently know the answer to everything.</p>

<p><br /></p>

<blockquote>
  <p>It feels like <em>“ask ChatGPT”</em> has become the new <em>“google it”</em>.</p>
</blockquote>

<p><br /></p>

<p>Current LLMs are based on the <strong>Transformer</strong> architecture, introduced by Google in the seminal work <a href="https://arxiv.org/abs/1706.03762"><em>Attention Is All You Need</em> (Vaswani et al. 2017)</a>. Before that, <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTMs or Long short-term memory networks (Hochreiter &amp; Schmidhuber, 1997)</a> were the state-of-the-art sequence models for Natural Language Processing (NLP). In fact, many of the concepts exploited by the Transformer were developed using LSTMs as the backbone, and one could argue that LSTMs are, in some respects, more sophisticated models than the Transformer itself — if you’d like an example of an LSTM-based language modeler, you can check this <a href="https://mikelsagardia.io/blog/text-generation-rnn.html">TV script generator of mine</a>.</p>

<p>However, the Transformer presented some major <em>practical advantages</em> that enabled a paradigm shift:</p>

<ul>
  <li>Its <em>self-attention</em> mechanism made it possible to convert inherently sequential tasks into <em>parallelizable</em> ones.</li>
  <li>Its uncomplicated, modular architecture made it easy to scale up and adapt to <em>many different tasks</em>.</li>
</ul>

<p>Simultaneously, <a href="https://arxiv.org/abs/1801.06146">Howard &amp; Ruder (2018)</a> demonstrated that <em>transfer learning</em> worked not only in computer vision, but also for NLP: they showed that a language model pre-trained on a large corpus could be fine-tuned for smaller corpora and other downstream tasks.</p>

<p>And that’s how the way to the current LLMs was paved. Nowadays, Transformer-based LLMs excel in <em>everything</em> NLP-related: text generation, summarization, question answering, code generation, translation, and so on.</p>

<h2 id="the-original-transformer-its-inputs-components-and-siblings">The Original Transformer: Its Inputs, Components and Siblings</h2>

<p>Before describing the components of the Transformer, we need to explain how text is represented for computers. In practice, text is converted into a <strong>sequence of feature vectors</strong> ${x_1, x_2, …}$, each of dimension $m$ (the <em>embedding size</em> or <em>dimension</em>). This is done in the following steps:</p>

<ol>
  <li><strong><a href="https://en.wikipedia.org/wiki/Large_language_model#Tokenization">Tokenization</a></strong>: The text is split into discrete elements called <em>tokens</em>. Tokens are units with an identifiable meaning for the model and typically include words or sub-words, as well as punctuation and special symbols.</li>
  <li><strong>Vocabulary construction</strong>: A vocabulary containing all $n$ unique tokens is defined. It provides a mapping between each token string and a numerical identifier (token ID).</li>
  <li><strong><a href="https://en.wikipedia.org/wiki/One-hot">One-hot vectors</a></strong>: Each token is mapped to its token ID. Conceptually, this corresponds to a one-hot vector of size $n$, although in practice models operate directly on token IDs. In a one-hot vector, all cells have the value $0$ except the cell which corresponds to the token ID of the represented word, which contains the value $1$.</li>
  <li><strong><a href="https://en.wikipedia.org/wiki/Word_embedding">Embedding vectors</a></strong>: Token IDs (i.e., one-hot vectors) are mapped to dense embedding vectors using an embedding layer. This layer acts as a learnable lookup table (or equivalently, a linear projection of a one-hot vector), producing vectors of size $m$, with $m \ll n$. These embedding vectors are simply arrays of floating-point values. Typical reference values are $n \approx 100{,}000$ and $m \approx 500$.</li>
</ol>

<p align="center">
<img src="/assets/llms/text_embeddings.png" alt="Text Embeddings" width="1000" />
<small style="color:grey">A word/token can be represented as a one-hot vector (sparse) or as an embedding vector (dense). Embedding vectors allow to capture semantics in their directions and make possible a more efficient processing. Image by the author.
</small>
</p>

<p>By the way, embeddings can be created for images, too, as I explain in <a href="https://mikelsagardia.io/blog/diffusion-for-developers.html">this post on diffusion models</a>. In general, they have some very nice properties:</p>

<ul>
  <li>They build up a compact space, in contrast to the sparse one-hot vector space.</li>
  <li>They are continuous and differentiable.</li>
  <li>If semantics are properly captured, words with similar meanings will point in similar directions. As a consequence, we can perform arithmetic operations with them, such that algebraic operations (<code class="language-plaintext highlighter-rouge">+, -</code>) can be applied to words; for instance, the word <code class="language-plaintext highlighter-rouge">queen</code> is expected to be close to <code class="language-plaintext highlighter-rouge">king - man + woman</code>.</li>
</ul>

<p align="center">
<img src="/assets/llms/text_image_embeddings.png" alt="Arithmetics with Text and Image Embeddings" width="1000" />
<small style="color:grey">Embeddings can be computed for every modality (image, text, audio, video, etc.); we can even create multi-modal embedding spaces. If the embedding vectors capture meaning properly, similar concepts will have vectors to similar directions. As a consequence, we will be able to apply some algebraic operations on them. Image by the author.
</small>
</p>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p>The original Transformer was designed for language translation and it has two parts:</p>

<ul>
  <li>The <strong>encoder</strong>, which converts the input sequence (e.g., a sentence in English) into hidden states or context.</li>
  <li>The <strong>decoder</strong>, which generates an output sequence (e.g., the translated sentence in Spanish) using as guidance some of the output hidden states of the encoder.</li>
</ul>

<p align="center">
<img src="/assets/llms/llm_simplified.png" alt="LLM Simplified Architecture" width="1000" />
<small style="color:grey">Simplified architecture of the original <a href="https://arxiv.org/abs/1706.03762">Transformer</a> designed for language translation. Highlighted: inputs (sentence in English), outputs (hidden states and translated sentence in Spanish), and main parts (the encoder and the decoder).
</small>
</p>

<p>Using as reference the figure above, here’s how the Transformer works:</p>

<ul>
  <li>
    <p>The encoder and the decoder are subdivided in <code class="language-plaintext highlighter-rouge">N</code> <em>encoder/decoder blocks</em> each; these blocks pass their hidden state outputs as inputs for the successive ones.</p>
  </li>
  <li>
    <p>The input of the first encoder block are the embedding vectors of the input text sequence. <em>Positional encodings</em> are added in the beginning to inject information about token order, since the self-attention layers inside the blocks (see next section) are position-agnostic. In the original paper, positional encoding vectors were $\mathbf{R} \rightarrow \mathbf{R}^n$ sinusoidal mappings: each unique scalar yielded a unique and different vector, thanks to systematically applying sinusoidal functions to the scalar. However, in practice learned positional embeddings are often used instead.</p>
  </li>
  <li>
    <p>For the translation task the encoder input contains the representation of the full original text sequence; meanwhile, the decoder produces the output sequence token by token, but it always has access to the full and final encoder hidden states (the context).</p>
  </li>
  <li>
    <p>The <em>decoder blocks</em> work in a similar way as the <em>encoder blocks</em>; the last <em>decoder block</em> produces the final set of hidden states, which are mapped to output token probabilities using a linear layer followed by a softmax function (i.e., we have a classification head over the vocabulary).</p>
  </li>
</ul>

<p>Soon after the publication of the original encoder-decoder Transformer designed for the language translation task, two related, important Transformers were introduced:</p>

<ul>
  <li><a href="https://arxiv.org/abs/1810.04805"><strong>BERT</strong>: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)</a>, which is an implementation of the <strong>encoder-only</strong> part of the original Transformer.</li>
  <li><a href="https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf"><strong>GPT</strong>: Improving Language Understanding by Generative Pre-Training (Radford et al., 2018)</a>, an implementation of the <strong>decode-only</strong> part of the original Transformer.</li>
</ul>

<p>BERT-like <em>encoder-only</em> transformers are commonly used to generate <em>feature vectors</em> $x$ of texts which can be used in downstream applications such as text or token/word classification. If the Encoder is trained separately, the sequence text is shown to the architecture with a masked token which needs to be predicted. This scheme is called <em>masked language modeling</em>.</p>

<p>GPT-like <em>decoder-only</em> transformers are commonly used as <em>generative models</em> to predict the next token in a sequence, given all the previous tokens (i.e., the context, which includes the prompt). During training, the model is shown sequences of text and learns to predict each token based on the preceding ones.</p>

<p>The full <em>encoder-decoder</em> architecture is not as common as the other two currently, but it is used in some specific models for text-to-text tasks, such as summarization and translation. Examples include <a href="https://arxiv.org/abs/1910.10683">T5 (Raffel et al., 2019)</a> and <a href="https://arxiv.org/pdf/1910.13461">BART (Lewis et al., 2019)</a>.</p>

<h2 id="deep-dive-into-the-transformer-architecture">Deep Dive into the Transformer Architecture</h2>

<p>So far, we’ve seen the big picture of the Transformer architecture and its subtypes (encoder-decoder, encoder-only, decoder-only).</p>

<blockquote>
  <p>But what’s inside those encoder and decoder blocks? Just Attention, normalization, and linear mappings. Let’s see them in detail.</p>
</blockquote>

<p align="center">
<img src="/assets/llms/transformer_annotated.png" alt="Transformer Architecture, Annotated" width="1000" />
<small style="color:grey">The Transformer architecture with all its components. Image from the orinal paper by <a href="https://arxiv.org/abs/1706.03762">Vaswani et al. (2017)</a>, modified by the author.
</small>
</p>

<p>As we can see in the figure above, each of the <code class="language-plaintext highlighter-rouge">N</code> encoder and decoder blocks are composed of the following sub-components:</p>

<ul>
  <li><strong>Multi-Head Self-Attention modules</strong>: The core component of the Transformer. It allows the model to focus on different parts of the input sequence when processing each token. Multiple attention heads enable the model to capture various relationships and dependencies in the data. More on this below :wink:</li>
  <li><strong>Skip connections, Add &amp; Norm</strong>: These are <a href="https://arxiv.org/abs/1512.03385">residual (skip) connections</a> followed by <a href="https://en.wikipedia.org/wiki/Normalization_(machine_learning)#Layer_normalization">layer normalization</a>. Residual connections help to avoid vanishing gradients in deep networks by allowing gradients to flow directly through the skip connections. Normalizing the inputs across the features dimension stabilizes and accelerates training.</li>
  <li><strong>Feed-Forward Neural Network</strong> (FFNN, i.e., several concatenated linear mappings): A fully connected feed-forward network applied independently to each position. It consists of two linear transformations with a <a href="https://en.wikipedia.org/wiki/Rectified_linear_unit">ReLU</a> activation in between, allowing the model to learn complex representations.</li>
</ul>

<p>The key contribution of the Transformer architecture is the <strong>Self-Attention</strong> mechanism. Attention was introduced by <a href="https://arxiv.org/abs/1409.0473">Bahdanau et al. (2014)</a> and it allows the model to weigh the importance of different tokens in the input sequence when processing each token. In practice for the Transformer, similarities of the tokens in the sequence are computed simultaneously (i.e., dot product) and used to weight and sum the embeddings in successive steps.</p>

<p>We can see there are different types of attention modules in the Transformer:</p>

<ul>
  <li>Self-Attention in the encoder blocks: Each token attends to the similarities of <em>all</em> tokens in the input sequence. It’s called self-attention, because the similarities of the input tokens only are used, i.e., without any interaction with the decoder. For more information, keep reading below.</li>
  <li>Masked Self-Attention in the decoder blocks: Each token attends to <em>all previous</em> tokens in the output sequence (masked to prevent attending to future tokens).</li>
  <li>Encoder-Decoder Cross-Attention in the decoder blocks: Each token in the output sequence attends to <em>all tokens in the encoder-input sequence</em>. In other words, all final hidden states from the encoder are used in the attention computation.</li>
</ul>

<p>Additionally, each attention module is implemented as a <strong>Multi-Head Attention</strong> mechanism. This means that multiple attention heads are used in parallel. The following figure shows brief overview of how this works.</p>

<p align="center">
<img src="/assets/llms/llm_attention_architecture.png" alt="LLM Attention Architecture" width="1000" />
<small style="color:grey">The LLM (Self-)Attention module, annotated. Image by the author.
</small>
</p>

<p>The <strong>Self-Attention Head</strong> is the core implementation of the attention mechanism in the Transformer. Each multi-head attention module contains $n$ self-attention heads, which operate in parallel. The input embedding sequence $Z$ is passed to each of these $n$ self-attention heads, where the following occurs:</p>

<ul>
  <li>We transform the original embeddings $Z$ into $Q$ (query), $K$ (key), and $V$ (value). The transformation is performed by linear/dense layers ($W_Q$, $W_K$, $W_V$), which consist of the learned weights. These <em>query</em>, <em>key</em>, and <em>value</em> variables come from classical <a href="https://en.wikipedia.org/wiki/Information_retrieval">information retrieval</a>; as described in <a href="https://www.oreilly.com/library/view/natural-language-processing/9781098136789/">NLP with Transformers (Tunstall et al., 2022)</a>, using the analogy to a recipe they can be interpreted as follows:
    <ul>
      <li>$Q$, <em>queries</em>: ingredients in the recipe.</li>
      <li>$K$, <em>keys</em>: the shelf-labels in the supermarket.</li>
      <li>$V$, <em>values</em>: the items in the shelf.</li>
    </ul>
  </li>
  <li>$Q$ and $K$ are used to compute similarity scores between token embeddings (<em>self</em> dot-product), and then we use those similarity scores to weight the values $V$, so the relevant information is amplified. This can be expressed mathematically with the popular and simple <em>attention</em> formula:
\(Y = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V,\)
where
    <ul>
      <li>$Y$ are the <em>contextualized embeddings</em>,</li>
      <li>and $d_k$ is the dimension of the key vectors (used for scaling), which is the same as the embedding size divided by the number of heads (head dimension).</li>
    </ul>
  </li>
</ul>

<p>Then, these $Y_1, …, Y_n$ contextualized embeddings are concatenated and linearly transformed to yield the final output of the multi-head attention module. The output of the first multi-head self-attention module is the input of the next one, and so on, until all $N$ blocks process embedding sequences. Note that the output embeddings from each encoder block have the same size as the input embeddings, so the encoder block stack has the function of <em>transforming</em> those embeddings with the attention mechanism.</p>

<p><br /></p>

<blockquote>
  <p>I hope it is now clear why the Transformer paper is titled <em>Attention Is All You Need</em>: It turns out that successively focusing and transforming the embeddings via the attention mechanism produces the magic in the LLMs.</p>
</blockquote>

<p><br /></p>

<p>Finally, let’s see some typical size values, for reference:</p>

<ul>
  <li>Embedding size: 768, 1024, …, 2048.</li>
  <li>Sequence length (context, number of tokens): 128, 256, …, 8192.</li>
  <li>Number of layers/blocks, $N$: 12, 24, 36, 48.</li>
  <li>Number of attention heads, $n$: 12, 16, 20, 32.</li>
  <li>Head dimension: typically, embedding size divided by number of heads.</li>
  <li>Feed-Forward Network (FFN) inner dimension: 2048, 4096, …, 10240.</li>
  <li>Vocabulary size, $m$: 30,000; 50,000; 100,000; 200,000.</li>
  <li>Total number of parameters: from 110 million (e.g., BERT-base) to 175 billion (e.g., GPT-3), and much more!</li>
</ul>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
If you are interested on an implementation of the Transformer, you can check <a href="https://github.com/mxagar/nlp_with_transformers_nbs/blob/main/03_transformer-anatomy.ipynb">this notebook</a>, where I modified the code from the official repository of the book <a href="https://www.oreilly.com/library/view/natural-language-processing/9781098136789/">NLP with Transformers (Tunstall et al., 2022)</a>. In the same <a href="github.com/mxagar/nlp_with_transformers_nbs/">repository</a>, you'll find many other notebooks related to NLP with Transformers.
</strong>
</div>
<div style="height: 30px;"></div>

<h2 id="using-the-transformer-outputs">Using the Transformer Outputs</h2>

<p>There are many ways in which the outputs of the Transformer can be used, depending on the task and the architecture (some of these ways were mentioned above already):</p>

<ul>
  <li>Encoder-decoder models (e.g., <a href="https://arxiv.org/abs/1910.10683">T5 (Raffel et al., 2019)</a>) have been used for <em>text-to-text</em> tasks, such as <em>translation</em> and <em>summarization</em>. However, big enough decoder-only models (e.g., <a href="https://arxiv.org/abs/2005.14165">GPT-3 (Brown et al., 2020)</a>) have shown remarkable performance in these tasks, too, and have become more popular nowadays.</li>
  <li>Encoder-only models (e.g., <a href="https://arxiv.org/abs/1810.04805">BERT (Devlin et al., 2018)</a>) are commonly used to generate <em>feature vectors</em> of texts, which can be used in downstream applications such as text or token/word classification, or even regression. We just need to attach the proper mapping head to the output of the encoder (e.g., a linear layer for classification) and fine-tune the model on the specific task.</li>
  <li>Decoder-only models (e.g., <a href="https://arxiv.org/abs/2005.14165">GPT-3 (Brown et al., 2020)</a>) are commonly used as <em>generative models</em> to predict the next token in a sequence, given all the previous tokens (i.e., the context, which includes the prompt).</li>
</ul>

<p>Probably, the most common way to interact with LLMs for the layman user is the latter: decoder-only <em>generative models</em>. As mentioned, these models generate one word/token at a time, so we feed their outputs back as inputs for successive generations (hence, they are called <em>autoregressive</em>). In that scheme, we need to consider the following questions:</p>

<ol>
  <li><em>Which tokens are considered as candidates every generation?</em> (token sampling)</li>
  <li><em>What strategy is used to select and chain the tokens?</em> (token search during decoding)</li>
</ol>

<p>Recall that the output of the generative model is an array of probabilities, specifically, a float value $p \in [0,1]$ for each item in the vocabulary set $V$. A naive approach would be to</p>

<ol>
  <li>consider all token probabilities as candidates ${p_1, p_2, …}$ (full distribution sampling),</li>
  <li>and select the token with the highest probability at each generation step: $\mathrm{token} = V(\mathrm{argmax}{p_1, p_2, …})$ (greedy search decoding).</li>
</ol>

<p>However, such a naive approach often leads to repetitive and dull text generation, as described by <a href="https://arxiv.org/abs/1904.09751">Holtzman et al. (2019)</a>. To mitigate this issue, these parameters and strategies are often used:</p>

<ul>
  <li>Temperature: we apply the <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a> function to the output logits $z_i = \mathrm{ln}(p_i)$ by using a <em>temperature</em> variable $T$ in the exponent (<a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzman distribution</a>): $p_i’ = \mathrm{softmax}(z_i / T)$. That changes the $p$ values as follows:
    <ul>
      <li>$T = 1$: no change, same as in the original output.</li>
      <li>$T &lt; 1$: small $p$-s become smaller, larger $p$-s become larger; that means we get a more peaked distribution, i.e., less creativity and more coherence, because the most likely words are going to be chosen.</li>
      <li>$T &gt; 1$: small $p$-s become bigger, larger $p$-s become smaller; that yields a more homogeneous distribution, which leads to more creativity and diversity, because any word/token could be chosen.</li>
    </ul>
  </li>
  <li>Top-$k$ and top-$p$: instead of considering all tokens each with their $p$ (with or without $T$), we reduce it to the $k$ most likely ones and select from them using the distribution we have; similarly, with a top-$p$, we can select the first tokens that cumulate up to a certain $p$-threshold and choose from them.</li>
  <li>Beam search decoding (as oposed to greedy search): we select a number of beams $b$ and keep track of the most probable next tokens building a tree of options. The most likely paths/beams are chosen, ranking the beams with their summed log probabilities. The higher the number of beams, the better the quality, but the computational effort explodes. Beam search sometimes suffers from repetitive generation; one way to avoid that is using n-gram penalty, i.e., we penalize the repetition of n-grams. This is commonly used in summarization and machine translation.</li>
</ul>

<p align="center">
<img src="/assets/llms/token_sampling.png" alt="Token Sampling" width="1000" />
<small style="color:grey">Token sampling strategies in LLMs. The LLM outputs a probability for each of the tokens in the vocabulary. If we apply a temperature <i>T</i>, top-<i>k</i>, or top-<i>p</i> strategy, we modify the distribution from which the next token is sampled. With <i>T &gt; 1</i>, the distribution becomes more uniform, leading to more diverse outputs (since tokens have a more similar probability); in contrast, with <i>T &lt; 1</i>, the distribution becomes more peaked, leading to more focused and coherent outputs. If we set top-<i>k</i> to be 3, we only consider the three most likely tokens for sampling; similarly, with a top-<i>p</i> threshold of 80%, we consider the smallest set of tokens whose cumulative probability is at least 80%. Image by the author.
</small>
</p>

<h2 id="additional-relevant-concepts">Additional Relevant Concepts</h2>

<p>My goal with this post was to explain in plain but still technical words how LLMs work internally. In that sense, I guess I have already given the best I could and I should finish the text. However, there are some additional details that probably fit nicely as appendices here. Thus, I have decided to include them with a brief description and some references, for the readers who optionally want to go deeper into the topic.</p>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p><strong>Context Size</strong> — This refers to the maximum number of words/tokens that the model can consider as input at once, i.e., the input sequence length or <code class="language-plaintext highlighter-rouge">seq_len</code>. If we look at the attention mechanism figure above, we will see that the learned weight matrices are independent of the context size; however, the attention computation itself scales quadratically with sequence length due to the $QK^T$ operation. This is a major bottleneck in terms of memory and speed, and it’s the main reason why the initial LLMs had a fixed and shorter context size ($512$ - $4,096$ tokens). In recent years, the research community has explored new methods to alleviate that limitation, introducing techniques such as <a href="https://arxiv.org/abs/2004.05150">sparse attention</a>, <a href="https://arxiv.org/abs/2006.16236">linearized attention</a>, <a href="https://arxiv.org/abs/2006.04768">low-rank approximations</a>, and other mathematical/architectural/system tricks. These enable larger context sizes (up to $1,000,000$ tokens in the case of <a href="https://gemini.google.com/app">Gemini Pro</a>).</p>

<p><strong>Distillation and Quantization</strong> — As their name indicates, Large Language Models are <em>large</em>, and that makes them difficult to deploy in production environments. Two techniques to overcome that are <em>distillation</em> and <em>quantization</em>. When we distill a model, we train a smaller student model to mimic the behavior of a larger, slower but better performing teacher (i.e., the original LLM). This is achieved, among other techniques, by using the teacher’s output probabilities as soft labels when training the student. A notable example of distillation is <a href="https://arxiv.org/abs/1910.01108">DistilBERT (Sanh et al., 2019)</a>, which achieves around 97% of BERT’s performance, but with 40% less memory and 60% faster inference. On the other hand, <em>quantization</em> consists in representing the weights with lower precision, i.e., <code class="language-plaintext highlighter-rouge">float32 -&gt; int8</code> ($32/8 = 4$ times smaller models). The models not only become smaller, but the operations can be done faster (even 100x faster), and the accuracy is sometimes similar.</p>

<p><strong>Emergent Abilities</strong> — As described by <a href="https://arxiv.org/abs/2206.07682">Wei et al. (2022)</a>, <em>“emergent abilities are those that are not present in smaller models, but appear in larger ones”</em>. In other words, they are capabilities that arise, but which were not explicitly trained. This often referred to as <em>zero-shot</em> or <em>few-shot</em> learning, because the model can perform tasks without any or with very few examples, as demonstrated by <a href="https://arxiv.org/abs/2005.14165">GPT-3 (Brown et al., 2020)</a>, and they start to appear in the 10-100 billion parameter range (GPT-3 had 175 billion parameters). Examples of emergent abilities include arithmetic, commonsense reasoning, and even some forms of creativity.</p>

<p><strong>Scaling Laws</strong> — Kaplan et al. published in 2020 the interesting paper <a href="https://arxiv.org/abs/2001.08361">Scaling Laws for Neural Language Models</a>, which describes how the performance of language models scales. They discovered that there is a power-law relationship between the model’s performance measured in terms of loss $L$, the required compute $C$, the dataset size $D$ and the model size $N$ (number of parameters): $L(X) \sim X^{-\alpha}$, with $X \in {N, C, D}$ and $\alpha \in [0.05, 0.1]$. In other words, when model size $N$, dataset size $D$, or training compute $C$ are scaled independently (and are not bottlenecks), the training loss $L$ decreases approximately as a power law of each quantity. In that sense, we can use these scaling laws to extrapolate model performance without building them, but theoretically! Similarly, for a fixed compute budget, there is an optimal trade-off between model size and dataset size. These insights led to the development of more efficient training strategies and architectures, such as the ones explored in the <a href="https://arxiv.org/abs/2203.15556">Chinchilla study (Hoffman et al., 2022)</a>, which suggest that smaller models trained on more data can achieve better performance than larger models trained on less data. Finally, note that training compute is roughly proportional to $6 \times N \times D$, while inference compute scales linearly with model size and generated sequence length.</p>

<p><strong>RLHF: Reinforcement Learning with Human Feedback</strong> — OpenAI presented <a href="https://arxiv.org/abs/2203.02155">InstructGPT (Ouyang et al., 2022)</a> shortly before releasing their popular <a href="https://chatgpt.com">ChatGPT</a>. This paper explains how the initial chatbot model GPT 3.5 was aligned with human preferences using <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a>. They followed 3 major steps: (1) First, a GPT model was fine-tuned with human-written conversation input-output pairs. (2) Then, the GPT model produced several answers to a set of prompts and human annotators ranked these outputs from best to worst. These annotations were used to train a reward model (RM) to automatically predict the output score. (3) Finally, the GPT model (<em>policy</em>) was trained using the <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">Proximal Policy Optimization (PPO) algorithm</a>, based on the conversation history (<em>state</em>) and the outputs it produced (<em>actions</em>), and using the reward model (<em>reward</em>) as the evaluator.</p>

<p><strong>PEFT: Parameter-Efficient Fine-Tuning</strong> — <a href="https://arxiv.org/abs/2106.09685">Low-Rank Adaptation of Large Language Models (or LoRA by Hu et al., 2021)</a> consists in applying a mathematical trick during the fine-tuning of LLMs to make the process much more efficient. The pre-trained weight matrices $W$ are frozen and we add to them the new matrices $dW = A \cdot B$, which are the ones trained. These are factored as $dW = A \cdot B$, having $A$ and $B$ a much lower rank. The trick reduces trainable parameters by orders of magnitude and maintains or matches full fine-tuning performance on many benchmarks. Therefore, it has become a standard method for domain adaptation and instruction tuning. One popular implementation is the <a href="https://github.com/huggingface/peft"><code class="language-plaintext highlighter-rouge">peft</code></a> library from <a href="https://huggingface.co/docs/peft/index">HuggingFace</a>.</p>

<p><strong>RAG: Retrieval Augmented Generation</strong> — LLMs have humongous amounts of general knowledge encoded in their parameters, but need to be fine-tuned for specific domains. That process is cumbersome and often inefficient, particularly when domain-specific information changes frequently. The work <a href="https://arxiv.org/abs/2005.11401">Retrieval-Augmented Generation (RAG) for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)</a> addressed such settings by using non-parametric information, i.e., they outsource the domain-specific memory. It works as follows: in an offline ingestion phase, the knowledge is chunked and indexed, often as embedding vectors. In the real-time generation phase, the user asks a question, which is encoded and used to retrieve the most similar indexed chunks; then, the LLM is prompted to answer the question by using the found similar chunks, i.e., the retrieved data is injected in the query. RAGs reduce hallucinations and have been extensively implemented recently.</p>

<p><strong>Reasoning Models</strong> — Wei et al. showed that <a href="https://arxiv.org/abs/2201.11903">Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022)</a>. In other words, prompting the model to <em>think step by step</em> improves its performance on math and reasoning tasks, suggesting that reasoning abilities are partly latent in large models. That simple yet powerful idea sparked research into prompting strategies and powered the fine-tuning with objectives that encourage multi-step inference, structured thinking, and tool use. One of the first popular open source reasoning models was <a href="https://huggingface.co/deepseek-ai/DeepSeek-R1">DeepSeek-R1</a>, but most of the current models have all improved reasoning capabilities, either by scale or via fine-tuning with reasoning objectives.</p>

<p><strong>Agents</strong> — The improvement of reasoning capabilities and tool usage boosted the development of the so-called <em>agentic workflows</em>. An agent is basically an LLM which has tool access and is allowed to perform actions, e.g., read our emails and do some processing with them, like classify or even answer the trivial ones. Libraries like <a href="https://www.langchain.com">LangChain</a> and <a href="https://www.langchain.com/langgraph">LangGraph</a> have made it relatively easy to build multi-agent systems that perform increasingly complex workflows, and frameworks like <a href="https://openclaw.ai">OpenClaw</a> have enabled the creation of personalized assistants. The future is being automated, and agents seem to be a key part of that — security issues aside.</p>

<h2 id="summary-and-some-final-thoughts">Summary and Some Final Thoughts</h2>

<p>Large Language Models are not magic. At their core, they are stacks of linear transformations, normalization layers, and attention mechanisms applied to sequences of embedding vectors. And yet, by scaling those simple components to unprecedented sizes (in terms of parameters, data, and compute) they exhibit capabilities that feel surprisingly powerful.</p>

<p>In this post, I have:</p>

<ul>
  <li>Reviewed how text is converted into embedding vectors.</li>
  <li>Described the original encoder-decoder Transformer and its encoder-only and decoder-only variants.</li>
  <li>Taken a closer look at the self-attention mechanism and multi-head attention.</li>
  <li>Discussed how decoding strategies such as temperature, top-$k$, top-$p$, and beam search influence text generation.</li>
  <li>Briefly touched on important side concepts such as context length, scaling laws, RLHF, PEFT/LoRA, RAG, reasoning models, and agents.</li>
</ul>

<p>If you want to go deeper into the topic, I recommend the following resources:</p>

<ul>
  <li><a href="https://arxiv.org/abs/1706.03762">The original paper: <em>Attention Is All You Need (Vaswani et al., 2017)</em></a></li>
  <li><a href="https://github.com/mxagar/nlp_with_transformers_nbs">My notes of the great book <em>NLP with Transformers (Tunstall et al., 2022)</em></a></li>
  <li><a href="https://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer (Jay Alammar)</a></li>
  <li><a href="https://nlp.seas.harvard.edu/annotated-transformer/">The Annotated Transformer (Harvard NLP)</a></li>
  <li><a href="https://github.com/karpathy/minGPT">A minimal PyTorch re-implementation of the OpenAI GPT (Andrej Karpathy)</a></li>
</ul>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p>LLMs have increased my productivity significantly. I use them extensively for research, text editing, and programming. However, I still think that they are <em>expert systems <strong>for experts</strong></em>: when used without proper guidance, the quality of their output can be quite mediocre — and in some cases, even worse: I have seen many instances of dull AI-generated texts and bloated, unmaintainable code. I am aware, of course, that they are improving at a rapid pace, though.</p>

<p>Overall, I am optimistic. In the same way that the Internet increased our overall productivity — while replacing some jobs and creating new ones — I think LLMs will probably have a similar (or even greater) net-positive effect. For instance, I do not believe that Software Engineers will disappear. Rather, they will likely shift toward tasks related to architecture, orchestration, integration, and maintenance. Junior roles and inexperienced professionals seem to be the most affected at the moment, but they will also be able to learn faster with these tools than we did before. As the ecosystem stabilizes, they may end up being in even higher demand. And, at the end of the day, everyone will want a human responsible for any AI-generated outcome.</p>

<p>I am confident that we will find ways to mitigate risks such as <em>dependence</em>, <em>personal data harvesting</em> and <em>automated control/surveillance</em>, just as we invented gyms to stay fit and healthy, or engineered locks and cryptographic systems to protect our privacy. At the moment, it is hard for me to believe that a Transformer-based model can intentionally go rogue and cause harm on its own, because I cannot conceive of any <em>consciousness</em> in them — at least not in the sense of <em>“I am aware that I exist here and now, and I have some purpose and agency”</em>. I see LLMs as systems that simulate patterns from their training data. In contrast, humans maintain (and constantly update) a world model, and use language as a tool to interact with that world model. An LLM has no internal state and it can’t learn in real-time — it can only simulate <em>“small breaths”</em> by producing tokens; one could even argue that it even <em>“dies”</em> after producing each token, or at least once the final <code class="language-plaintext highlighter-rouge">&lt;STOP&gt;</code> token is emitted.</p>

<p>At the same time, granting an LLM-based agent unrestricted access to personal information and powerful tools could indeed be irresponsible — perhaps comparable to giving a monkey a machine gun. Yet we have faced similar situations in the past: when technologies become powerful, we introduce safeguards, norms, and regulation to govern their use.</p>

<p>Let’s see what the future awaits us. I believe it will be exciting, and that we will find ways to navigate the limitations and risks of this technology, as we’ve done in the past.</p>

<p><br /></p>

<blockquote>
  <p>How have LLMs impacted your life? How do you think they will change the world in the next 5-10 years?</p>
</blockquote>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/how-are-llms-built.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/how-are-llms-built.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="AI" /><category term="engineering," /><category term="large" /><category term="language" /><category term="models," /><category term="llm," /><category term="machine" /><category term="learning," /><category term="text" /><category term="generation," /><category term="generative" /><category term="AI," /><category term="deep" /><category term="attention" /><summary type="html"><![CDATA[&lt;!– Blog Post 1: How Are Large Language Models (LLMs) Built? Subtitle: A Conceptual Guide for Developers]]></summary></entry><entry><title type="html">An Introduction to Image Generation with Diffusion Models (2/2)</title><link href="https://mikelsagardia.io/blog/diffusion-hands-on.html" rel="alternate" type="text/html" title="An Introduction to Image Generation with Diffusion Models (2/2)" /><published>2026-01-22T10:30:00+00:00</published><updated>2026-01-22T10:30:00+00:00</updated><id>https://mikelsagardia.io/blog/diffusion-hands-on</id><content type="html" xml:base="https://mikelsagardia.io/blog/diffusion-hands-on.html"><![CDATA[<!--
Blog Post 1  
Title: An Introduction to Image Generation with Diffusion Models (1/2)  
Subtitle: A Conceptual Guide for Developers & ML Practitioners

Blog Post 2  
Title: An Introduction to Image Generation with Diffusion Models (2/2)  
Subtitle: Hands-On Examples with Hugging Face
-->

<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  Hands-On Examples with HuggingFace
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/diffusion/ai_drawing_ai_dallev3.png" alt="An AI drawing an AI drawing an AI. Image generated using Dalle-E 3" width="1000" />
<small style="color:grey">An AI drawing an AI drawing an AI... Image generated using 
<a href="https://openai.com/index/dall-e-3/">Dall-E 3</a>. Prompt: <i>A friendly humanoid robot sits at a wooden table in a bright, sunlit room, happily drawing on a sketchbook. Soft light colors, landscape, peaceful, productive, and joyful atmosphere. The robot is drawing an image of itself drawing, creating a recursive effect. Large window in the background with greenery outside, warm natural lighting.</i>
</small>
</p>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
This is the second post of a series of two.
You can find the <a href="https://mikelsagardia.io/blog/diffusion-for-developers.html">first part here</a>.
Also, you can find the accompanying code in <a href="https://github.com/mxagar/diffusion-examples/tree/main/diffusers">this GitHub repository</a>.
</strong>
</div>
<div style="height: 30px;"></div>

<p>In just a few years, image generation has gone from <em>“cool demo”</em> to an almost ubiquitous tool. <a href="https://arxiv.org/abs/1312.6114">Variational Autoencoders (VAEs - Kingma &amp; Welling, 2013)</a> were followed by <a href="https://arxiv.org/abs/1406.2661">Generative Adversarial Networks (GANs - Goodfellow et al., 2014)</a>, and finally <a href="https://arxiv.org/abs/2006.11239">Denoising Diffusion Probabilistic Models (DDPMs - Ho et al., 2020)</a> became the dominant approach, leading to systems like <a href="https://arxiv.org/pdf/2307.01952">Stable Diffusion XL (Podell et al., 2023)</a> or <a href="https://arxiv.org/abs/2205.11487">Imagen &amp; Nano Banana</a>.</p>

<p>In the <a href="https://mikelsagardia.io/blog/diffusion-for-developers.html">first post of this series</a>, I explain how these model families work and I walk through a minimal DDPM implementation.
That DDPM is trained on car images and produces outputs like these:</p>

<p align="center">
<img src="/assets/diffusion/car_generation_best_model.png" alt="Eight Samples Generated by a DDPM" width="1000" />
<small style="color:grey">
Output of a <a href="https://arxiv.org/abs/2006.11239">Denoising Diffusion Probabilistic Model (Ho et al., 2020)</a> consisting of 54 million parameters, trained on the <a href="https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset">Stanford Cars Dataset</a> (16,185 color images resized to <code>64x64</code> pixels) for 300 epochs. Check the complete implementation <a href="https://github.com/mxagar/diffusion-examples/tree/main/ddpm">here</a>.
</small>
</p>

<p>In this second part, I’ll focus on <strong>the practical use of diffusion models</strong>, specifically, on using the invaluable tools provided by <a href="https://huggingface.co/">HuggingFace</a>. To that end, I’ve divided the post in three parts:</p>

<ol>
  <li>A brief <a href="#introduction-to-huggingface">introduction to HuggingFace</a>.</li>
  <li>A hands-on dive into some examples with <a href="#huggingface-diffusers">HuggingFace Diffusers</a>.</li>
  <li>A small <a href="#in-painting-application">in-painting application</a> that puts everything together.</li>
</ol>

<p>Let’s go!</p>

<h2 id="a-very-brief-introduction-to-huggingface">A Very Brief Introduction to HuggingFace</h2>

<p><a href="https://huggingface.co">HuggingFace</a> has become one of the most important hubs in the machine learning community. It provides a collaborative environment where state-of-the-art <strong>datasets</strong> and <strong>models</strong> can be <strong>shared</strong>, <strong>explored</strong> and even <strong>tried</strong> directly from the browser (via <em>Spaces</em>). Beyond models and datasets, HuggingFace offers two particularly powerful resources:</p>

<ul>
  <li><a href="https://huggingface.co/learn"><strong>courses</strong></a>, covering key domains and techniques such as computer vision, natural language processing, audio, agents, 3D processing, reinforcement learning, and more;</li>
  <li>and a rich ecosystem of <strong>libraries</strong> that let us work with datasets and models end-to-end, across modalities. The most relevant ones for this post are:
    <ul>
      <li><a href="https://huggingface.co/docs/datasets/en/index"><code class="language-plaintext highlighter-rouge">datasets</code></a>: access, share, and process audio, text, and image datasets.</li>
      <li><a href="https://huggingface.co/docs/transformers/en/index"><code class="language-plaintext highlighter-rouge">transformers</code></a>: training and inference for text, vision, audio, video, and multimodal models.</li>
      <li><a href="https://huggingface.co/docs/diffusers/en/index"><code class="language-plaintext highlighter-rouge">diffusers</code></a>: pre-trained diffusion models for generating images, videos, and audio.</li>
    </ul>
  </li>
</ul>

<p>All of these can be installed easily in a Python environment:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>datasets transformers diffusers[<span class="s2">"torch"</span><span class="o">]</span> accelerate gradio
</code></pre></div></div>

<p align="center">
<img src="/assets/diffusion/hugging_face_screenshot.png" alt="HuggingFace Screeshot" width="1000" />
<small style="color:grey">
Screenshot of the <a href="https://huggingface.co">HuggingFace</a> portal, showing available models sorted by their popularity.
</small>
</p>

<p>In practice, <em>discriminative</em> models (across all modalities) and <em>generative</em> models for text are usually handled via the <code class="language-plaintext highlighter-rouge">transformers</code> library. On the other hand, generative <em>diffusion</em> models are managed through <code class="language-plaintext highlighter-rouge">diffusers</code>.</p>

<p>Models can be browsed and selected directly from the HuggingFace website, where they can be filtered by several criteria. One of the most useful is the <a href="https://huggingface.co/docs/transformers/main/main_classes/pipelines#transformers.pipeline.task">task</a>, for example:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">sentiment-analysis</code></li>
  <li><code class="language-plaintext highlighter-rouge">text-generation</code></li>
  <li><code class="language-plaintext highlighter-rouge">summarization</code></li>
  <li><code class="language-plaintext highlighter-rouge">translation</code></li>
  <li><code class="language-plaintext highlighter-rouge">audio-classification</code></li>
  <li><code class="language-plaintext highlighter-rouge">image-to-text</code></li>
  <li><code class="language-plaintext highlighter-rouge">object-detection</code></li>
  <li><code class="language-plaintext highlighter-rouge">image-segmentation</code></li>
  <li>…</li>
</ul>

<p>If we click on a model we will land on its <strong>model card</strong> page, which typically includes evaluation metrics, references, licensing information, and often a short code snippet showing how to load and run the model.</p>

<h3 id="pipelines">Pipelines</h3>

<p>The easiest way to run inference with most HuggingFace models is through the <code class="language-plaintext highlighter-rouge">pipeline</code> interface. While each task has its own specifics, the overall pattern is remarkably consistent. As an example, here’s how a <code class="language-plaintext highlighter-rouge">text-generation</code> pipeline looks:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">transformers</span>

<span class="c1"># Load the pipeline
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">transformers</span><span class="p">.</span><span class="n">pipeline</span><span class="p">(</span>
    <span class="s">"text-generation"</span><span class="p">,</span>  <span class="c1"># task
</span>    <span class="n">model</span><span class="o">=</span><span class="s">"Organization/ConcreteModel"</span><span class="p">,</span>  <span class="c1"># change to real model, e.g.: "openai-community/gpt2"
</span><span class="p">)</span>

<span class="c1"># Define the input (prompt)
</span><span class="n">messages</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"system"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"You are an AI who can draw AIs."</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="s">"What's the best technique to draw an AI?"</span><span class="p">},</span>
<span class="p">]</span>

<span class="c1"># Generate output (text)
</span><span class="n">outputs</span> <span class="o">=</span> <span class="n">pipe</span><span class="p">(</span>
    <span class="n">messages</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Display output (text)
</span><span class="k">print</span><span class="p">(</span><span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s">"generated_text"</span><span class="p">][</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
</code></pre></div></div>

<p><br /></p>

<p>From this (deliberately simplified) example, we can extract a common workflow:</p>

<ul>
  <li>First, a model pipeline is loaded, by defining the task family (e.g., <code class="language-plaintext highlighter-rouge">text-generation</code>) as well as the concrete model name (e.g., <code class="language-plaintext highlighter-rouge">openai-community/gpt2</code>) we want to use.</li>
  <li>Then, we need to define the input to the pipeline; the input depends on the task at hand: if we want to classify an image, we need to load an image; if we want to generate text, we need an initial prompt of conversation history, etc.</li>
  <li>Finally, we pass the input to the pipeline and collect the output. The output format again depends on the task.</li>
</ul>

<p>Instead of relying on the generic <code class="language-plaintext highlighter-rouge">pipeline</code> abstraction, we can also load a specific model class directly. This is particularly common when working with diffusion models. For example, a typical <code class="language-plaintext highlighter-rouge">text-to-image</code> setup using <code class="language-plaintext highlighter-rouge">diffusers</code> looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">diffusers</span> <span class="kn">import</span> <span class="n">ConcreteModel</span>   <span class="c1"># change to real model, e.g.: AutoPipelineForText2Image
</span>
<span class="c1"># Load the pipeline
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">ConcreteModel</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
    <span class="s">"Organization/ConcreteModel"</span><span class="p">,</span>  <span class="c1"># change to real model, e.g.: "stabilityai/sdxl-turbo"
</span>    <span class="p">...</span>
<span class="p">)</span>

<span class="c1"># Define the input (prompt)
</span><span class="n">prompt</span> <span class="o">=</span> <span class="s">"An AI drawing an AI"</span>

<span class="c1"># Generate output (image)
</span><span class="n">image</span> <span class="o">=</span> <span class="n">pipe</span><span class="p">(</span>
    <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span>
    <span class="p">...</span>
<span class="p">).</span><span class="n">images</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

<span class="c1"># Save output (image)
</span><span class="n">image</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">"example.png"</span><span class="p">)</span>
</code></pre></div></div>

<p><br /></p>

<h3 id="more-information-on-the-huggingface-ecosystem">More Information on the HuggingFace Ecosystem</h3>

<p>This brief overview barely scratches the surface of the HuggingFace ecosystem. In the next sections, I’ll focus on concrete, ready-to-use examples that build directly on these ideas.</p>

<p>If you’d like to explore further, here are some additional resources:</p>

<ul>
  <li><a href="https://github.com/mxagar/tool_guides/tree/master/hugging_face">My guide on HuggingFace</a>, which covers topics such as:
    <ul>
      <li>Combining models with Pytorch/Tensorflow code.</li>
      <li>More complex pre- and post-processing steps for each task/modality, e.g.: tokenization, encoding, etc.</li>
      <li>Fine-tuning pre-trained models for different tasks by adding custom heads.</li>
      <li>Saving/loading fine-tuned models locally, as well as exporting them as ONNX for production.</li>
      <li>Examples with generative models of all modalities and conditioning types: <code class="language-plaintext highlighter-rouge">text-generation</code>, <code class="language-plaintext highlighter-rouge">text-to-image</code>, <code class="language-plaintext highlighter-rouge">text-to-video</code>, etc.</li>
    </ul>
  </li>
  <li>A comprehensive example in which I <a href="https://github.com/mxagar/llm_peft_fine_tuning_example">fine-tune a Large Language Model (LLM)</a> to perform a custom text classification task.</li>
  <li>My notes on the exceptional book <a href="https://github.com/mxagar/nlp_with_transformers_nbs">Natural Language Processing (NLP) with Transformers (Tunstall, von Werra &amp; Wolf — O’Reilly)</a>, written by the co-founders of HuggingFace — highly recommended if you want to use <code class="language-plaintext highlighter-rouge">transformers</code> effectively.</li>
</ul>

<h2 id="huggingface-diffusers-in-practice">HuggingFace Diffusers in Practice</h2>

<p>Let’s now move from concepts to code and run a few concrete examples using the <code class="language-plaintext highlighter-rouge">diffusers</code> library. For this section, I’ve prepared a companion notebook:</p>

<p>:point_right: <a href="https://github.com/mxagar/diffusion-examples/blob/main/diffusers/diffusers_and_co.ipynb"><code class="language-plaintext highlighter-rouge">diffusers/diffusers_and_co.ipynb</code></a></p>

<p>In this post, I’ll focus on showing and discussing the results produced by different models. If you want to see the full (and commented) code, I recommend opening the notebook alongside the article.</p>

<blockquote>
  <p>:warning: <strong>Hardware note</strong>: To run the notebook locally, you’ll need a <a href="https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html">GPU setup with at least 12 GB of VRAM</a>. As an alternative, you can use a <a href="https://colab.research.google.com/">Google Colab instance</a> with a NVIDIA T4, or similar.</p>
</blockquote>

<h3 id="stable-diffusion-xl-turbo">Stable Diffusion XL Turbo</h3>

<p>The first example in the notebook covers a <em>conditioned</em> image generation task, specifically <code class="language-plaintext highlighter-rouge">text-to-image</code>, using the <a href="https://huggingface.co/stabilityai/sdxl-turbo">Stable Diffusion XL Turbo</a> model. The code closely follows the patterns introduced in the previous section:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">diffusers</span> <span class="kn">import</span> <span class="n">AutoPipelineForText2Image</span>

<span class="c1"># Load the SDXL-Turbo text-to-image pipeline
</span><span class="n">pipe</span> <span class="o">=</span> <span class="n">AutoPipelineForText2Image</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
    <span class="s">"stabilityai/sdxl-turbo"</span><span class="p">,</span> 
    <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span> 
    <span class="n">variant</span><span class="o">=</span><span class="s">"fp16"</span>
<span class="p">)</span>

<span class="n">prompt</span> <span class="o">=</span> <span class="s">"""
A friendly humanoid robot sits at a wooden table in a bright, sunlit room, happily drawing on a sketchbook.
Soft light colors, landscape, peaceful, productive, and joyful atmosphere.
The robot is drawing an image of itself drawing, creating a recursive effect.
Large window in the background with greenery outside, warm natural lighting.
"""</span>

<span class="c1"># Seed for reproducibility
</span><span class="n">rand_gen</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="mi">148607185</span><span class="p">)</span>

<span class="c1"># Generate an image based on the text prompt
</span><span class="n">image</span> <span class="o">=</span> <span class="n">pipe</span><span class="p">(</span>
    <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span> 
    <span class="n">num_inference_steps</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="c1"># 1 for sdxl-turbo, 25-50 for SD
</span>    <span class="n">guidance_scale</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="c1"># 1 for sdxl-turbo, 6-10 for SD
</span>    <span class="n">negative_prompt</span><span class="o">=</span><span class="p">[</span><span class="s">"overexposed"</span><span class="p">,</span> <span class="s">"underexposed"</span><span class="p">],</span> 
    <span class="n">generator</span><span class="o">=</span><span class="n">rand_gen</span>
<span class="p">).</span><span class="n">images</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>

<p><br /></p>

<p>The result is already quite impressive, but it also clearly reveals its synthetic nature. Subtle artifacts appear in areas like eyes and fingers, and some mechanical structures lack global consistency or realism — typical issues when pushing generation speed to the extreme.</p>

<p align="center">
<img src="/assets/diffusion/robot_painting_sdxl_turbo.png" alt="A friendly humanoid robot drawing itself." width="1000" />
<small style="color:grey">
Image generated with <a href="https://huggingface.co/stabilityai/sdxl-turbo">SDXL Turbo</a>.
Prompt: <i>A friendly humanoid robot sits at a wooden table in a bright, sunlit room, happily drawing on a sketchbook. Soft light colors, landscape, peaceful, productive, and joyful atmosphere. The robot is drawing an image of itself drawing, creating a recursive effect. Large window in the background with greenery outside, warm natural lighting.</i>
</small>
</p>

<p><a href="https://huggingface.co/stabilityai/sdxl-turbo">Stable Diffusion XL Turbo</a> is a real-time <code class="language-plaintext highlighter-rouge">text-to-image</code> diffusion model derived from <a href="https://arxiv.org/pdf/2307.01952">Stable Diffusion XL (SDXL)</a>. Its key feature is that it can generate images in as few as one to four denoising steps. Unlike traditional diffusion models (which often require dozens of inference steps) SDXL Turbo prioritizes latency and interactivity, while still preserving much of SDXL’s visual quality.</p>

<p>This speedup is achieved through <a href="https://arxiv.org/abs/2311.17042">Adversarial Diffusion Distillation (ADD)</a>:</p>

<ul>
  <li>A large, high-quality SDXL model acts as a teacher.</li>
  <li>The Turbo model is trained to match the teacher’s output distribution.</li>
  <li>An adversarial objective helps close the quality gap introduced by aggressive step reduction.</li>
</ul>

<p>In short, a large model is distilled into a much faster one, enabling real-time image generation in creative tools and user interfaces.</p>

<h3 id="playground-v2">Playground V2</h3>

<p>An interesting alternative to SDXL Turbo is <a href="https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic">Playground V2</a>. This model also targets high-quality image generation with fewer inference steps, but it takes a different approach: it prioritizes visual quality and aesthetics and it does not rely on distillation during training. Using the same prompt, Playground V2 produces a different output:</p>

<p align="center">
<img src="/assets/diffusion/robot_painting_playground_v2.png" alt="A friendly humanoid robot drawing itself." width="1000" />
<small style="color:grey">
Model: <a href="https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic">Playground V2</a>.
Same prompt ad before: <i>A friendly humanoid robot sits at a wooden table...</i>
</small>
</p>

<h3 id="combining-models">Combining Models</h3>

<p>Diffusion models don’t have to be used in isolation — they can also be chained together! In the next example, SDXL Turbo first generates an image of a puppy. That image is then used as conditioning input for the <code class="language-plaintext highlighter-rouge">image-to-image</code> model <a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-prior">Kandinsky 2.2</a>. The result is an exaggerated image of a dog, but I think it showcases the potential of building such compositional pipelines.</p>

<p align="center">
<img src="/assets/diffusion/dog_drawing_sdlx_turbo_kandinsky.png" alt="A friendly humanoid dog." width="1000" />
<small style="color:grey">
Left image generated by <a href="https://huggingface.co/stabilityai/sdxl-turbo">SDXL Turbo</a>.
Right image generated by <a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-prior">Kandinsky Prior 2.2</a>.
Left prompt: <i>A painting of a friendly dog painted by a child.</i>
Right prompt: <i>A photo of a friendly dog. High details, realistic (negative: low quality, bad quality).</i>
</small>
</p>

<p>Kandinsky is a multimodal diffusion model that separates <em>semantic understanding</em> from <em>image generation</em>. Unlike SDXL-style models, which directly condition image generation on text embeddings, Kandinsky uses a two-stage architecture:</p>

<ul>
  <li>Prior model, which maps text (and optionally images) into a shared latent space that represents high-level semantics.</li>
  <li>Decoder model (diffusion), which takes these semantic embeddings and generates the final image via a diffusion process.</li>
</ul>

<p>This explicit separation makes Kandinsky particularly well suited for compositional pipelines. One such pipeline consists in <em>in-painting</em>, i.e.: we ask the model to generate a sub-image on a provided initial image. Here’s how it works:</p>

<ul>
  <li>Mask definition: A binary mask specifies which regions of the image should be regenerated (white) and which should remain fixed (black).</li>
  <li>Latent conditioning: The unmasked parts of the image are encoded and injected into the diffusion process, anchoring the generation spatially.</li>
  <li>Semantic guidance via the prior: Text prompts and optional image context guide what should appear in the masked regions.</li>
  <li>Diffusion-based regeneration: Noise is added only in the masked area, and the model denoises it while respecting both the surrounding visual context and the semantic intent from the prompt.</li>
</ul>

<p>Because Kandinsky reasons at a semantic level first, inpainting results tend to be context-aware: lighting, perspective, and style are usually consistent with the original image, even when the prompt introduces new elements.</p>

<p>Here’s an example with the popular oil painting <a href="https://en.wikipedia.org/wiki/Girl_with_a_Pearl_Earring"><em>The Girl with the Pearl Ear-ring</em> from Vermeer</a>. Unfortunately, the <em>pearl ear-ring</em> doesn’t survive the process :sweat_smile:</p>

<p align="center">
<img src="/assets/diffusion/vermeer_girl_mask_inpainting_kandinsky.png" alt="A friendly humanoid dog." width="1000" />
<small style="color:grey">
Model: <a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder-inpaint">Kandinsky Inpaint 2.2</a>.
Prompt: <i>Oil painting of a woman wearing a surgical mask, Vermeer (negative: bad anatomy, deformed, ugly, disfigured).</i>
I obtained the image from the Wikipedia and draw the mask manually.
Check <a href="https://www.bbc.com/news/uk-england-bristol-52382500">this piece from Banksy</a>, if you would like to know how this could be done differently.
</small>
</p>

<h2 id="building-proof-of-concept-applications-zero-shot-segmentation-and-in-painting">Building Proof-of-Concept Applications: Zero-Shot Segmentation and In-Painting</h2>

<p>As shown in the notebook <a href="https://github.com/mxagar/diffusion-examples/blob/main/diffusers/diffusers_and_co.ipynb"><code class="language-plaintext highlighter-rouge">diffusers/diffusers_and_co.ipynb</code></a>, running different models for isolated tasks is already quite straightforward. This naturally leads to the next question:</p>

<blockquote>
  <p>What if we combine several models to build small, interactive applications?</p>
</blockquote>

<p>Along these lines, I implemented a simple proof-of-concept: <a href="https://github.com/mxagar/diffusion-examples/tree/main/inpainting_app"><code class="language-plaintext highlighter-rouge">inpainting_app</code></a>. The idea behind it is to chain <strong>segmentation</strong> and <strong>diffusion-based in-painting</strong> into a single workflow:</p>

<ul>
  <li>First, we load an image and select a few points on the region we want to modify (typically the foreground).</li>
  <li>Next, the <a href="https://huggingface.co/docs/transformers/en/model_doc/sam">Segment Anything Model (SAM) from Meta</a> generates a segmentation mask for that region. Everything outside the mask is treated as background.
SAM is a vision transformer capable of zero-shot segmentation, but it still requires some minimal guidance (points or a bounding box) to specify the region of interest.</li>
  <li>Finally, we select either the foreground or the background region and run the in-painting version of <a href="https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1">SDXL model</a>. The selected region is regenerated according to a text prompt, while remaining visually consistent with the rest of the image.</li>
</ul>

<p>As before, if you plan to run the app locally you’ll need a <a href="https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html">GPU setup with at least 12 GB of VRAM</a> :sweat_smile:.</p>

<h3 id="ui-and-application-structure">UI and Application Structure</h3>

<p>The application is built using <a href="https://www.gradio.app/">Gradio</a>, a Python library similar to <a href="https://streamlit.io/">Streamlit</a> which builds nice-looking, web-based GUIs. Since Gradio is developed by HuggingFace, it integrates seamlessly with the models used here.</p>

<p>If you want a deeper introduction to Gradio, you can check my <a href="https://github.com/mxagar/tool_guides/tree/master/gradio">Gradio Quickstart Guide</a>, where I cover the basics and several advanced patterns.</p>

<p>The structure of the app is intentionally simple:</p>

<ul>
  <li>The GUI and the app structure are controlled by <a href="https://github.com/mxagar/diffusion-examples/blob/main/inpainting_app/app.py"><code class="language-plaintext highlighter-rouge">app.py</code></a>. The entry point is <code class="language-plaintext highlighter-rouge">app.generate_app()</code>, which takes two functions as inputs:
    <ul>
      <li>a function that performs image segmentation given a set of user-selected points,</li>
      <li>and a function that runs in-painting given an image, a mask, and a prompt.</li>
    </ul>
  </li>
  <li>The notebook <a href="https://github.com/mxagar/diffusion-examples/blob/main/inpainting_app/inpainting.ipynb"><code class="language-plaintext highlighter-rouge">inpainting.ipynb</code></a> defines and prepares those input functions:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">run_segmentation(raw_image, input_points, processor, model, ...) -&gt; input_mask</code></li>
      <li><code class="language-plaintext highlighter-rouge">run_inpainting(raw_image, input_mask, prompt, pipeline, ...) -&gt; generated_image</code></li>
    </ul>
  </li>
  <li>Internally, <code class="language-plaintext highlighter-rouge">app.generate_app()</code> creates a <code class="language-plaintext highlighter-rouge">gradio.Blocks</code> layout, which is composed of <code class="language-plaintext highlighter-rouge">gradio.Row()</code> sections that contain the UI widgets: image canvases, sliders, text boxes, buttons, etc. These widgets are connected to callback functions; for instance: when we select points in the uploaded <code class="language-plaintext highlighter-rouge">raw_image</code>, the callback <code class="language-plaintext highlighter-rouge">on_select()</code> is invoked, which under the hood executes <code class="language-plaintext highlighter-rouge">run_segmentation()</code> using the uploaded <code class="language-plaintext highlighter-rouge">raw_image</code> and the selected <code class="language-plaintext highlighter-rouge">input_points</code>.</li>
</ul>

<p>While everything could be packaged into standalone modules, keeping part of the logic in a notebook makes experimentation much easier and encourages rapid iteration.</p>

<h3 id="the-result">The Result</h3>

<p>When the application is launched via <code class="language-plaintext highlighter-rouge">app.generate_app()</code>, the user sees the following UI in <code class="language-plaintext highlighter-rouge">http://localhost:8080</code>:</p>

<p align="center">
<img src="/assets/diffusion/app_gui.png" alt="App GUI." width="1000" />
<small style="color:grey">
The Graphical User Interface (GUI) or our application.
</small>
</p>

<p>So how does it perform? Let’s look at an example.</p>

<p align="center">
<img src="/assets/diffusion/monalisa_inpainting.png" alt="Monalisa In-Painting." width="1000" />
<small style="color:grey">
Monalisa re-imagined. <a href="https://huggingface.co/docs/transformers/en/model_doc/sam">SAM (Segment Anything Model)</a> is used to segment foreground (green) &amp; background (yellow), and <a href="https://huggingface.co/diffusers/stable-diffusion-xl-1.0-inpainting-0.1
">Stable Diffusion XL Inpainting</a> to re-generate the selected region.
Prompt (applied to the background): <i>A fantasy landscape with flying dragons (negative: artifacts, low quality, distortion).</i>
</small>
</p>

<p>I think that the result shows that the pipeline produces a visually coherent image: the new background blends naturally with the original painting’s lighting, perspective, and color palette.
Despite the strong semantic change introduced by the prompt, the Mona Lisa remains intact and consistent, which highlights how well segmentation and diffusion-based in-painting can work together even in artistic, non-photographic domains.</p>

<h2 id="wrap-up">Wrap Up</h2>

<p>In this second post, we moved from theory to practice and explored how modern diffusion models can be used out of the box with <a href="https://huggingface.co/">HuggingFace</a> tools. I covered how to run state-of-the-art <code class="language-plaintext highlighter-rouge">text-to-image</code> models with diffusers, how different diffusion architectures trade off speed and quality, and how combining models enables more powerful workflows such as segmentation-aware in-painting.</p>

<p>Beyond individual examples, the main takeaway is how composable today’s generative models have become. By chaining pre-trained components—segmentation, conditioning, and diffusion—we can quickly prototype creative and practical applications without training models from scratch.</p>

<p>If you want to dig deeper, here are some useful starting points:</p>

<ul>
  <li>
    <p>:point_right: <a href="https://mikelsagardia.io/blog/diffusion-for-developers.html">Conceptual background on diffusion models (Part 1 of this series)</a></p>
  </li>
  <li>
    <p>:point_right: <a href="https://github.com/mxagar/diffusion-examples/tree/main/diffusers">Code for this post (Diffusers examples)</a></p>
  </li>
  <li>
    <p>:point_right: <a href="https://github.com/mxagar/diffusion-examples/tree/main/inpainting_app">In-painting application (SAM + SDXL)</a></p>
  </li>
  <li>
    <p>:point_right: <a href="https://github.com/mxagar/tool_guides/tree/master/hugging_face">My guide on HuggingFace</a></p>
  </li>
</ul>

<p>I’m curious to hear your thoughts:</p>

<p><br /></p>

<blockquote>
  <p>What real-world or creative use case do you think would benefit most from this kind of segmentation-guided in-painting app? Do you know some businesses using similar pipelines?</p>
</blockquote>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/diffusion-hands-on.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/diffusion-hands-on.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="AI" /><category term="engineering," /><category term="diffusion," /><category term="machine" /><category term="learning," /><category term="image" /><category term="generation," /><category term="generative" /><category term="AI," /><category term="deep" /><category term="model" /><category term="training," /><category term="inference," /><category term="hugging" /><category term="face," /><category term="diffusers," /><category term="sdxl," /><category term="stable" /><category term="in-painting," /><category term="sam," /><category term="segmentation" /><summary type="html"><![CDATA[&lt;!– Blog Post 1 Title: An Introduction to Image Generation with Diffusion Models (1/2) Subtitle: A Conceptual Guide for Developers &amp; ML Practitioners]]></summary></entry><entry><title type="html">An Introduction to Image Generation with Diffusion Models (1/2)</title><link href="https://mikelsagardia.io/blog/diffusion-for-developers.html" rel="alternate" type="text/html" title="An Introduction to Image Generation with Diffusion Models (1/2)" /><published>2026-01-20T10:30:00+00:00</published><updated>2026-01-20T10:30:00+00:00</updated><id>https://mikelsagardia.io/blog/diffusion-for-developers</id><content type="html" xml:base="https://mikelsagardia.io/blog/diffusion-for-developers.html"><![CDATA[<!--
Blog Post 1  
Title: An Introduction to Image Generation with Diffusion Models (1/2)  
Subtitle: A Conceptual Guide for Developers & ML Practitioners

Blog Post 2  
Title: An Introduction to Image Generation with Diffusion Models (2/2)  
Subtitle: Hands-On Examples with Hugging Face
-->

<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  A Conceptual Guide for Developers &amp; ML Practitioners
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/diffusion/felix-rottmann-0S6kUgMT-l8-unsplash.jpg" alt="Milky way over Dolomites mountains covered in snow: Photo by @felixrottmann from Unsplash" width="1000" />
<small style="color:grey">Fascinating Milky Way over Dolomite mountains. Was all this spontaneous generation? Which is its <i>latent space</i>? Photo by <a href="https://unsplash.com/photos/a-snowy-mountain-with-stars-in-the-sky-0S6kUgMT-l8">@felixrottmann from Unsplash</a>.</small>
</p>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
This is the first post of a series of two.
You can find the <a href="https://mikelsagardia.io/blog/diffusion-hands-on.html">second part here</a>.
Also, you can find the accompanying code <a href="https://github.com/mxagar/diffusion-examples/ddpm">this GitHub repository</a>.
</strong>
</div>
<div style="height: 30px;"></div>

<p>I still find it fascinating that machine learning models are able to learn from examples, even after working with them for more than a decade.<br />
But what truly feels like magic to me are image and video generation models.</p>

<p>Driven by that fascination, I decided to write a short series of posts explaining how these models work, both in theory and in practice.</p>

<p>In <strong>this first post</strong>, you will:</p>

<ul>
  <li>Learn how image <em>generation</em> differs from image <em>discrimination</em> (e.g., image classification).</li>
  <li>Understand how <em>Diffusion</em> models compare to <em>Generative Adversarial Networks</em> (GANs) and <em>Variational Autoencoders</em> (VAEs).</li>
  <li>Learn how <em>Denoising Diffusion Probabilistic Models</em> (DDPMs) work at both conceptual and mathematical levels.</li>
  <li>See a full, minimal PyTorch implementation of a DDPM that generates car images using a consumer-grade GPU.</li>
</ul>

<p>In the <a href="https://mikelsagardia.io/blog/diffusion-hands-on.html"><strong>second and final post</strong></a>, I’ll move on to practical examples using the Hugging Face libraries.</p>

<p>Let’s get started!</p>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p>In machine learning, any sample or data instance can be represented as feature vector $x$.
These features could be the RGB values of an image’s pixels or the words (tokens) of a text represented as vocabulary indices.
In deep learning, these vectors are often transformed into <em>embeddings</em> or <em>latent</em> vectors, which are compressed representations that still retain differentiable semantic meaning.</p>

<p align="center">
<img src="/assets/diffusion/embeddings.png" alt="Image and Text Embeddings" width="1000" />
<small style="color:grey">Any sample or data point of any modality can be represented as an <i>n</i>-dimensional vector <i>x</i> in machine learning; in the figure, images and words (tokens) are represented as 2D vector embeddings. These embeddings contain conceptual information in a compressed form. When semantics and/or similarities between samples are captured, algebraic operations can be used with the vectors, resulting in coherent, logical outputs. Image by the author.
</small>
</p>

<p>Up until recently, mainly <strong>discriminative models</strong> have been used, which predict properties of those embeddings.
These models are trained with annotated data, for instance class labels: $x$ belongs to class $y = $ <em>cat</em> or <em>dog</em>.
The model then learns decision boundaries that allow it to predict the class of new, unseen samples.
Mathematically, we can represent that as $y = f(x)$, or better as $p(y \mid x)$,
i.e., the probability $p$ of each class $y$ given the instance/sample $x$.</p>

<p>On the other hand, in the past years, <strong>generative models</strong> have gained popularity.
These models do not explicitly capture decision boundaries; instead, they learn the probability distribution of the data.
As a result, they can sample from these distributions and generate new, unseen examples.
Following the same mathematical notation, we can say that the models learn $p(x)$ or $p(x, y)$, if the classes are considered.</p>

<p align="center">
<img src="/assets/diffusion/discriminative_vs_generative.png" alt="Discriminative vs. Generative Models" width="1000" />
<small style="color:grey">A dataset of 2D samples $x$ (i.e., 2 features) used to fit a discriminative and a generative model.
Discriminative models learn decision boundaries and are able to predict the class $y$ of new, unseen instance.
Generative models learn the data distribution $p(x)$ and are able to sample new unseen instances.
Image by the author.
</small>
</p>

<p>In terms of <em>ease of control</em>, generative models can be of two main types:</p>

<ul>
  <li><em>Unconditional</em>, $p(x)$: These models learn the data distribution $p(x)$ and blindly create samples from it, without much control. You can check <a href="https://thispersondoesnotexist.com/">these artificial faces</a>, as an example.</li>
  <li><em>Conditional</em>, $p(x \mid \textrm{condition})$: They generate new samples conditioned on an input we provide, e.g., a class, a text prompt or an image.
In the realm of the text modality, probably the most well-known generative model is OpenAI’s <a href="https://openai.com/index/chatgpt/">(Chat)GPT</a>, which is able to produce words (tokens), and subsequently conversations, conditioned by a prompt or user instruction.
When it comes to the modality of images, it’s difficult to point to a single winner, but common models are
<a href="https://openai.com/index/dall-e-3/">Dall-E</a>, <a href="https://www.midjourney.com/">Midjourney</a>, or <a href="https://en.wikipedia.org/wiki/Stable_Diffusion">Stable Diffusion</a> — all of them are <code class="language-plaintext highlighter-rouge">text-to-image</code> conditional models.</li>
</ul>

<p>In terms of the <em>modalities</em> they can work with, generative models can be:</p>

<ul>
  <li><em>Uni-modal</em>: These models can handle/produce samples of a single modality, e.g., text or images.</li>
  <li><em>Multi-modal</em>: They are able to work with instances of different modalities simultaneously.
They can achieve that by creating a common <em>latent space</em> for all modalities, or mappings between them.
Latent spaces are compressed vector spaces that capture the semantics of the vectors that form them.
As a result, given a text-image multimodal model, we can ask questions about the content of an image.
Notable examples are <a href="https://openai.com/research/gpt-4v-system-card">GPT4-Vision</a> and <a href="https://huggingface.co/spaces/badayvedat/LLaVA">LLaVA</a>.</li>
</ul>

<blockquote>
  <p>Discriminative models learn to predict specific properties of a data sample (e.g., a class or a value), whereas generative models learn the data distribution and are able to sample it.
Additionally, this sampling can often be conditioned by a prompt.</p>
</blockquote>

<h2 id="why-diffusion-replaced-gans-for-image-generation">Why Diffusion Replaced GANs for Image Generation</h2>

<p>There are three main families of generative approaches for image generation:</p>

<ul>
  <li>Variational Autoencoders (VAEs)</li>
  <li>Generative Adversarial Networks (GANs)</li>
  <li>Denoising Diffusion Probabilistic Models (Diffusers)</li>
</ul>

<p><a href="https://en.wikipedia.org/wiki/Autoencoder"><strong>Autoencoders</strong></a> are architectures that compress the input $x$ into a lower-dimensional latent vector $z$ and then they expand it again to try to recreate $x$. The compression side is called <em>encoder</em>, the middle layer which produces the latent vector is the <em>bottleneck</em>, and the expansion side is named the <em>decoder</em>. As mentioned, the final output $x’$ tries to approximate $x$ as closely as possible; the gradient of the reconstruction error is used to update the weights of all layers. Many types of layers and configurations can be used for the encoder &amp; decoder parts; e.g., with images often <a href="https://en.wikipedia.org/wiki/Convolutional_layer">convolutional layers</a>, <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">pooling</a>, <a href="https://arxiv.org/abs/1207.0580">dropout</a>, and <a href="https://en.wikipedia.org/wiki/Normalization_(machine_learning)">batch normalization</a> are used to compress the image, whereas the expansion usually is implemented with <a href="https://d2l.ai/chapter_computer-vision/transposed-conv.html">transpose convolutions</a>.</p>

<p><a href="https://arxiv.org/abs/1312.6114"><strong>Variational Autoencoders (VAEs)</strong></a> are autoencoders in which the elements of the latent $z$ vector are Gaussian distributions, i.e., for each latent element, they produce a mean and a variance, and then a value is sampled from that element distribution to produce the latent values. The practical effect is that VAEs produce latent spaces in which interpolation results in much smaller discontinuities than in non-variational autoencoders.</p>

<p>VAEs have been typically implemented for compression, denoising and anomaly detection; even though they can generate new samples using only their decoder, they usually produce less realistic results. However, they are fundamental to understand generative models, since they intuitively introduce many of the concepts later revisited by subsequent approaches. If you’d like to know more practical details, you can have a look at <a href="https://github.com/mxagar/generative_ai_book/tree/main/notebooks/03_vae">these examples</a>.</p>

<p><a href="https://arxiv.org/abs/1406.2661"><strong>Generative Adversarial Networks (GANs)</strong></a> were presented by Goodfellow et al. in 2014 and they represented a significant advancement in realistic image generation. They have two components: a <em>generator</em> (decoder-like) and a <em>discriminator</em> (encoder-like), but they are arranged and trained differently, as shown in the figure below.</p>

<p>The <em>generator</em> $G$ tries to generate realistic images as if they belonged to the real data distribution, starting with latent vector $z$ expanded from a noise seed. On the other hand, the <em>discriminator</em> $D$ tries to determine whether an image $x$ is real or fake (i.e., generated: $x’ = G(z)$). Usually, $D$ and $G$ have mirrored architectures and their layers are equivalent to the ones used in VAEs.</p>

<p>The training phase looks as follows:</p>

<ul>
  <li>First, $D$ is trained: we create batches of real images $x$ and batches of fake images $x’ = G(z)$, pass them to the discriminator $D$ (i.e., we get $D(x), D(G(z))$), and compute the error with respect to the correct labels. That error is backpropagated to update only the weights of $D$.</li>
  <li>Second, $G$ is trained: we create new fake images with the generator $G$ and pass them to the discriminator $D$. The prediction error is backpropagated to update only the weights of $G$.</li>
  <li>Both steps are alternated for several iterations, until the error metrics no longer improve.</li>
</ul>

<p>Once the model is trained, the inference is done with the generator $G$ alone.</p>

<p>GANs are notoriously difficult to train, <a href="https://github.com/soumith/ganhacks">due to several factors</a> that are out of the scope of this blog post. Fortunately, <a href="https://arxiv.org/abs/1606.03498">guidelines</a> which aid the training process have been proposed. Also, method improvements have been presented, such as the <a href="https://arxiv.org/abs/1704.00028">Wasserstein GAN with Gradient Penalty</a>, which alleviates the major training difficulties, and <a href="https://arxiv.org/abs/1411.1784">conditional GANs</a>, which provide control to the user during generation (e.g., create male or female faces).</p>

<p align="center">
<img src="/assets/diffusion/vae_and_gan.png" alt="VAEs and GANs" width="1000" />
<small style="color:grey">Variational Autoencoders (VAEs, left) and Generative Adversarial Networks (GANs, right) were the most popular generative models up until the advent of Diffusers in the past years. VAEs learn to compress and decompress inputs with an *encoder-decoder* architecture which produces a *latent* space with the compressed samples. GANs learn to produce realistic samples adversarially: they generate fake samples (with the *generator*) and try to fool a binary classifier (the *discriminator*) which needs to differentiate between real and fake samples.
</small>
</p>

<p>Finally, we arrive at the <a href="https://arxiv.org/abs/2006.11239"><strong>Denoising Diffusion Probabilistic Models (Ho et al., 2020)</strong></a>, presented by Ho et al. in 2020.
In just a few years, they have outperformed GANs for image generation and have become the standard method for the task. The core idea is that we train a model which takes</p>

<ul>
  <li>a noisy image $x_t$ (in the beginning it is a pure random noise map)</li>
  <li>and an associated noise variance $\beta_t$ (in the beginning it will be a high variance value)</li>
</ul>

<p>and it predicts the noise map $\epsilon_t$ overlaid on the image, so that we can subtract it from the noisy image and progressively recover a clean sample $x_0$.
The process is performed in small, gradual steps, and following a noise rate schedule which decreases the value of $\beta_t$.</p>

<p>As we can see in the figure below, two iterative phases are distinguished, which consist each of them in $T$ steps:</p>

<ol>
  <li><strong>Forward diffusion, used during training</strong> — Starting with a real clean image $x_0$, we add a noise map $\epsilon$ to it, generated from a variance value $\beta$. Then, we pass the noisy image through a <em>U-Net</em> model, which should predict the added noise map $\epsilon$. The error is backpropagated to update the weights. The image at step $t$ contains not only the noise added in the previous step, but also the noise accumulated from prior steps. The forward process is done gradually in around $T = 1000$ steps. Early DDPMs used linear schedules, while cosine schedules later became standard due to improved stability.</li>
  <li><strong>Reverse diffusion, used during inference</strong> — We perform the inference starting with a pure, random noise map. In each step, we pass the noisy image through the <em>U-Net</em> to predict the step noise map $\epsilon_t$, subtract it from the image $x_t$ and obtain the next, less noisy image $x_{t-1}$. The process is repeated for around $T \in [20,100]$ steps, until we get a clear new image $x_0$.</li>
</ol>

<p align="center">
<img src="/assets/diffusion/diffusion_idea.png" alt="Denoising Diffusion" width="1000" />
<small style="color:grey">In denoising diffusion models a *U-Net* encoder-decoder model is trained to predict the noise in an image. To that end, during training (forward diffusion), noise is gradually added to an image and we query the model to predict the noise map. During inference (reverse diffusion), we start with a pure noise map and query the model to remove the noise step by step &mdash; until we get a clean new image!
</small>
</p>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p>So which of these approaches should we use?</p>

<p>To answer that question, we need to consider that generative models are usually evaluated in terms of three competing properties, which lead to a so-called <a href="https://arxiv.org/pdf/2112.07804">generative learning trilemma (Xiao et al., 2022)</a>:</p>

<ul>
  <li><strong>Quality</strong>: if the distributions of the generated images and real images are close, the quality is considered good. In practice, pre-trained CNNs can be used to create image embeddings, leading to vector distributions. Then, the difference between the distributions is measured with the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance metric</a>. GANs and Diffusers have a particularly good quality, whereas VAEs have often a lesser one.</li>
  <li><strong>Coverage</strong>: this measures how diverse the captured distributions are, i.e., the number of modes or peaks in the learned distribution; for instance, in a dataset of dog images, we would expect as many dog breeds as possible, which would be represented as many dense regions differentiable from each other. VAEs and Diffusers have good coverage, whereas GANs tend to deliver less diverse results.</li>
  <li><strong>Speed</strong>: this refers to the sampling speed, i.e., how fast we can create new images. GANs and VAEs are the fastest approaches, while Diffusers require longer computation times.</li>
</ul>

<p>As we can see, there seems to be no all-powerful method that wins in all three metrics. However, <a href="https://arxiv.org/abs/2112.10752">Rombach et al. (2021)</a> and <a href="https://arxiv.org/abs/2307.01952">Podell et al. (2023)</a> presented and improved the <strong>Stable Diffusion</strong> approach, which is a very good trade-off (arguably the best so far). This method applies diffusion, but in the latent space, achieving much faster speed values — I explain more about it in the next section.</p>

<p align="center">
<img src="/assets/diffusion/impossible_triangle.png" alt="Impossible Triangle" width="1000" />
<small style="color:grey">Generative learning trilemma: sample diversity coverage, generation quality and generation speed are competing properties of generative methods &mdash; or is <a href="https://arxiv.org/abs/2307.01952">Stable Diffusion</a> the solution to that trilemma?
Image reproduced by the author, but based on the work by <a href="https://arxiv.org/pdf/2112.07804">Xiao et al., 2022</a>.
</small>
</p>

<h2 id="how-ddpms-actually-work-forward-noise-reverse-denoising">How DDPMs Actually Work: Forward Noise, Reverse Denoising</h2>

<p>Now, let’s go deeper into the topic of <a href="https://arxiv.org/abs/2006.11239"><strong>Denoising Diffusion Probabilistic Models (Ho et al., 2020)</strong></a>.</p>

<p>I have already introduced the three main components of diffusion models:</p>

<ol>
  <li>The denoising <em>U-Net</em>: a model that learns to extract the noise map $\epsilon_t$ of a noisy image $x_t$.</li>
  <li>The <em>forward diffusion</em> phase used during <em>training</em>, in which we start with a real noise-free image $x_0$ and add noise $\epsilon$ to it step by step. At each step $t$, we train the <em>U-Net</em> to learn how to predict the approximation $\epsilon_{\theta, t}$ of the ground-truth noise $\epsilon_t$ we have added to the image: $\epsilon_{\theta, t} \approx \epsilon_t$.</li>
  <li>The <em>reverse diffusion</em> phase used during <em>inference</em>, in which we start with a random noise map $x_{T}$ and remove step by step noise $\epsilon_{\theta}$ from it using the trained <em>U-Net</em>. It is intuitively easy to understand why small steps are also required in the reverse phase, too: it is much easier to improve an image with slight noise than to reconstruct a clear image from pure randomness.</li>
</ol>

<p>Let’s unpack each one of them to better understand how diffusion works.</p>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
Note that this section has a dedicated repository in which all the models and formulae are implemented: <a href="https://github.com/mxagar/diffusion-examples/tree/main/ddpm">github.com/mxagar/diffusion-examples/ddpm</a>. Some comments and examples from the implementation are provided in the last section.
</strong>
</div>
<div style="height: 30px;"></div>

<h4 id="denoising-u-net">Denoising <em>U-Net</em></h4>

<p>The <a href="https://arxiv.org/abs/1505.04597">U-Net (Ronneberger et al., 2015)</a> was originally created for image segmentation tasks (specifically in the medicine domain): the architecture contracts the input image into a latent tensor, which is then expanded using symmetric decoder layers; as a result, the model outputs a map with the same size as the input image (width and height) in which we obtain values for each pixel, e.g., pixel-wise classification or image segmentation.</p>

<p>In the particular case of the denoising <em>U-Net</em>, we have these two <em>inputs</em>:</p>

<ul>
  <li>The noisy image $x_t$ at step $t$.</li>
  <li>The variance scalar $\beta_t$ at step $t$. The variance scalar is expanded into a vector using sinusoidal embeddings. Sinusoidal embeddings can be seen as an $\mathbf{R} \rightarrow \mathbf{R}^n$ mapping in which for each unique scalar we obtain a unique and different vector, thanks to systematically applying sinusoidal functions to the scalar. It is related to the sinusoidal embedding from the <a href="https://arxiv.org/abs/1706.03762">Transformers paper (Vaswani et al., 2017)</a>.</li>
</ul>

<p>On the other hand, the <em>output</em> of the model is the noise map at step $t$: $\epsilon_{\theta, t}$. Subtracting this estimate from the noisy image $x_t$ yields a denoised approximation that moves the sample one step closer to a clean image. Repeating this process over many small steps allows the model to gradually transform pure noise into a realistic sample $x_0$.</p>

<p align="center">
<img src="/assets/diffusion/denoising_unet.png" alt="Denoising U-Net" width="1000" />
<small style="color:grey">
Denoising <i>U-Net</i>.
Image reproduced by the author, but based on the book <a href="https://www.oreilly.com/library/view/generative-deep-learning/9781098134174/">Generative Deep Learning (O'Reilly)</a> by David Foster.
</small>
</p>

<p>As in every <em>U-Net</em>, the initial tensor is progressively reduced in spatial size while its channels are increased; then, the reduced tensor is expanded to have a larger spatial size but fewer channels. The final tensor has the same shape as the input image. The architecture consists of these blocks:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">ResidualBlock</code>: basic block used throughout the network that performs batch normalization and two convolutions, while adding a skip connection between input and output, as presented in the <a href="https://arxiv.org/abs/1512.03385">ResNet architecture (Het et al., 2015)</a>. Residual blocks learn the identity map and allow for deeper networks, since the vanishing gradient issue is alleviated.</li>
  <li><code class="language-plaintext highlighter-rouge">DownBlock</code>: two <code class="language-plaintext highlighter-rouge">ResidualBlocks</code> are used, as well as an average pooling so that the image size is decreased while increasing the number of channels.</li>
  <li><code class="language-plaintext highlighter-rouge">UpBlock</code>: upsampling is applied to the feature map to increase its spatial size and two <code class="language-plaintext highlighter-rouge">ResidualBlocks</code> are applied so that the channels are decreased.</li>
  <li>Skip connections: the output of each <code class="language-plaintext highlighter-rouge">ResidualBlock</code> in a <code class="language-plaintext highlighter-rouge">DownBlock</code> is passed to the associated <code class="language-plaintext highlighter-rouge">UpBlock</code> with the same tensor size, where the tensors are concatenated.</li>
</ul>

<p>Often two networks are maintained: the usual one with the weights computed during gradient descent and the <em>Exponential Moving Average (EMA)</em> network, which contains the EMA of the weights. The EMA network is less susceptible to training spikes and fluctuations.</p>

<h4 id="forward-diffusion">Forward Diffusion</h4>

<p>In the <em>forward diffusion</em>, or training phase, we add noise $\epsilon_{t-1}$ to an image $x_{t-1}$ to obtain a noisier image $x_t$. The process is governed by this equation:</p>

\[x_t = q(x_t \mid x_{t-1})
= x_{t-1}\sqrt{1-\beta_t} + \epsilon_{t-1}\sqrt{\beta_t}
= \mathcal{N}(x_{t-1}\sqrt{1-\beta_t}, \beta_t I)\]

<p>where:</p>

<ul>
  <li>$\beta_t$ is the variance scalar at step $t$; typically $\beta \in [0.0001, 0.02]$,</li>
  <li>$\epsilon \sim \mathcal{N}(0,I)$, i.e., it is a 2D Gaussian map with mean 0 and variance 1,</li>
  <li>and $I$ is the identity matrix.</li>
</ul>

<p>As a further step, a <em>re-parametrization</em> of $\beta$ is carried out, which transforms the forward diffusion equation from $q(x_t \mid x_{t-1})$ into $q(x_t \mid x_0)$; that is, any noisy image $x_t$ can be computed directly from the original, noise-free image $x_0$.</p>

<p>That <em>re-parametrization</em> is defined as</p>

\[\bar{\alpha_t} = \prod_{i=0}^{t}{\alpha_i},\,\,\, \alpha_t = 1 - \beta_t\]

<p>Its interpretation is the following:</p>

<ul>
  <li>$\bar{\alpha}$ represents the fraction of variance due to the signal (the original image $x_0$);</li>
  <li>$1-\bar{\alpha}$ represents the fraction of variance due to the noise ($\epsilon$).</li>
</ul>

<p>By properly expressing $\beta$ as function of $\alpha$, we obtain the <strong><em>forward diffusion</em> equation used in practice</strong>:</p>

\[x_t = q(x_t \mid x_0)
= x_0\sqrt{\bar{\alpha_t}} + \epsilon\sqrt{1 - \bar{\alpha_t}}
= \mathcal{N}(x_0\sqrt{\bar{\alpha}_t}, (1-\bar{\alpha}_t) I)\]

<p>Given this equation:</p>

<ul>
  <li>We pick the real noise-free image $x_0$.</li>
  <li>We add noise at step $t$ using $\epsilon_t$ to obtain $x_t$.</li>
  <li>We let the <em>U-Net</em> predict $\epsilon_{\theta,t}$ as approximation to $\epsilon_t$, and backpropagate the error.</li>
</ul>

<p>Finally, note that diffusion schedules control the signal and noise ratios in such a way that during training</p>

<ul>
  <li>the signal ratio decreases to 0 following a linear or cosine-based function,</li>
  <li>and the noise ratio increases up to 1 following the complementary function.</li>
</ul>

<h4 id="reverse-diffusion">Reverse Diffusion</h4>

<p>During the <em>reverse diffusion</em> phase or inference, we generate images iteratively following a reverse schedule analogous to the one introduced in the previous section. The equation of the reverse process can be obtained by inverting the forward equation and has the following form:</p>

\[x_{t-1} = p(x_{t-1} \mid x_t) = 
\frac{1}{\sqrt{\alpha_t}} (x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\epsilon_{\theta}) + \sigma_t z\]

<p>Here,</p>

<ul>
  <li>$\epsilon_{\theta}$ is the noise map predicted by the <em>U-Net</em> for the pair ($x_t$, $\beta_t$);</li>
  <li>$z$ is a 2D Gaussian defined as $z \sim \mathcal{N}(0,I)$;</li>
  <li>$\sigma_t^2 = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$ is a noise variance that decreases as inference progresses.</li>
</ul>

<p>The term $\sigma_t z$ is an experimentally added component that provides control over the generation:</p>

<ul>
  <li>we allow more freedom to explore in the early steps, enabling a broader variation of images (more random noise),</li>
  <li>but then narrow down to the details.</li>
</ul>

<h4 id="conditioning">Conditioning</h4>

<p>If we fit the model to a dataset of car images, we will be able to generate random car images. But what if we would like to control the type of cars we want to obtain, for instance, <em>red sports cars</em>? That can be achieved with <strong>conditioning</strong>.</p>

<p>The most common form of <strong>conditioning</strong> is done with <em>text</em>: we provide a prompt/description of the image we want to obtain. As a first step, that text is converted into an embedding vector using a text encoder trained with paired image-text data using a contrastive objective (e.g., <a href="https://arxiv.org/abs/2103.00020">CLIP by Radford et al., 2021</a>). Then, the resulting vector is provided to the <em>U-Net</em> at several stages:</p>

<ul>
  <li>During <em>training</em>, we inject the embedding vector into different layers of the <em>U-Net</em> using cross attention, reinforcing the conditioning. Additionally, we remove the text conditioning in some random steps so that the model learns unconditional generation.</li>
  <li>
    <p>During <em>inference</em>, the <em>U-Net</em> produces the noise map $\epsilon$ with and without text conditioning: $\epsilon_{\textrm{cond}}, \epsilon_{\textrm{uncond}}$. The difference added by the conditioned noise map is amplified (by a factor $\lambda$) to push the final prediction in the direction of the conditioning; mathematically, considering $\epsilon$ is a vector/tensor, this is expressed (and implemented) as follows:</p>

    <p>$\epsilon_{\textrm{final}} = \epsilon_{\textrm{uncond}} + \lambda \cdot (\epsilon_{\textrm{cond}} - \epsilon_{\textrm{uncond}})$</p>
  </li>
</ul>

<p>Thanks to these modifications, we are able to obtain our red sports car instead of a green truck.</p>

<h4 id="stable-diffusion">Stable Diffusion</h4>

<p>Finally, let’s consider two practical aspects of diffusion models discussed so far:</p>

<ul>
  <li>Many forward passes are needed to generate our noise-free image.</li>
  <li>Denoising occurs in pixel-space, which has a relatively large dimensionality.</li>
</ul>

<p>What if we could apply forward and reverse diffusion in a smaller space to accelerate the process? That is exactly what is achieved by <a href="https://arxiv.org/abs/2112.10752">Rombach et al. (2021)</a> and <a href="https://arxiv.org/abs/2307.01952">Podell et al. ( 2023)</a>, who introduced and later improved a <strong>latent diffusion</strong> method, also known as <strong>Stable Diffusion</strong>. In essence, latent diffusion models are diffusion models wrapped by autoencoders:</p>

<ul>
  <li>The encoder creates a latent vector.</li>
  <li>In the forward diffusion phase, we add noise to the vector and learn to denoise it.</li>
  <li>In the reverse diffusion phase, we remove the noise with the trained <em>U-Net</em>.</li>
  <li>Then, finally, the decoder expands the denoised latent vector to get the image.</li>
</ul>

<p>Working in the latent space is much faster, because the sizes of the manipulated vectors are much smaller (around 16 times smaller, compared to images); thus, we also require smaller models.</p>

<p>Stable Diffusion is one of the most popular latent diffusion models; due to the <a href="https://stability.ai/news/stability-ai-sdxl-turbo">latest the advances</a> by the team behind it, it has become one of the strongest approaches in terms of</p>

<ul>
  <li>Ease of conditioning</li>
  <li>Quality of output</li>
  <li>Diversity</li>
  <li>… and speed!</li>
</ul>

<h4 id="example-implementation-of-ddpm">Example Implementation of DDPM</h4>

<p>The implementation <a href="https://github.com/mxagar/diffusion-examples/tree/main/ddpm">repository</a> contains the code necessary to train a diffuser and use it to generate new images.</p>

<p>The dataset used in the example is the <a href="https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset">Stanford Cars Dataset</a>. It contains 16,185 color images categorized into 196 classes, which are resized to <code class="language-plaintext highlighter-rouge">64x64</code>.</p>

<p align="center">
<img src="/assets/diffusion/cars_dataset_samples.png" alt="Cars Dataset Samples" width="1000" />
<small style="color:grey">
In the example implementation, the <a href="https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset">Stanford Cars Dataset</a> is used.
The dataset consists of 16,185 color images across 196 classes; however, class labels are ignored and the images are resized to <code>64x64</code> pixels.
The figure shows 8 resized samples.
</small>
</p>

<p>The mini-project is composed of two main files:</p>

<ul>
  <li>The module <a href="https://github.com/mxagar/diffusion-examples/blob/main/ddpm/unet.py"><code class="language-plaintext highlighter-rouge">unet.py</code></a>, taken from <a href="https://github.com/labmlai/annotated_deep_learning_paper_implementations">labmlai/annotated_deep_learning_paper_implementations</a>. This module defines the <em>U-Net</em> model which is able to predict the noise of an image after training.</li>
  <li>The notebook <a href="https://github.com/mxagar/diffusion-examples/blob/main/ddpm/ddpm.ipynb"><code class="language-plaintext highlighter-rouge">ddpm.ipynb</code></a>, where the dataset preparation, model setup, and training are implemented. Some parts were modified from the course material of the <a href="https://www.udacity.com/course/generative-ai--nd608">Udacity Generative AI Nanodegree</a>.</li>
</ul>

<p>The formulas of the DDPM paper, as well as the forward and reverse diffusion algorithms, are implemented in a modular fashion and with plenty of comments and references.</p>

<p>As an example, the function <code class="language-plaintext highlighter-rouge">visualize_forward_diffusion()</code> produces these noisy images on a single car sample:</p>

<p align="center">
<img src="/assets/diffusion/cars_forward_diffusion.png" alt="Forward Diffusion on Car Sample" width="1000" />
<small style="color:grey">
A total of <code>T=512</code> steps are taken to iteratively add noise to a sample and train the <i>U-Net</i> to predict the added noise map. The figure shows 7 equally spaced stages of those steps.
</small>
</p>

<p>I trained a <em>U-Net</em> model of 54 million parameters using the following configuration:</p>

<ul>
  <li>Device: <a href="https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html">NVIDIA RTX 3060</a></li>
  <li>300 epochs (10 warm-up)</li>
  <li>A base learning rate of <code class="language-plaintext highlighter-rouge">0.0001</code> and cosine scheduling</li>
  <li>Batch size of 64</li>
  <li><code class="language-plaintext highlighter-rouge">T=512</code> diffusion steps</li>
  <li>A linearly increased noise variance $\beta$ in the range of <code class="language-plaintext highlighter-rouge">[0.0001, 0.02]</code></li>
</ul>

<p>The training process, run by <code class="language-plaintext highlighter-rouge">train()</code>, produces a denoised image strip at every epoch, generated from the same fixed noise map. In the following, the image strips of epochs 1, 5, 10, 100, 200 and 300 are shown.</p>

<p align="center">
<img src="/assets/diffusion/car_sample_epoch_001.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<img src="/assets/diffusion/car_sample_epoch_005.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<img src="/assets/diffusion/car_sample_epoch_010.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<img src="/assets/diffusion/car_sample_epoch_100.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<img src="/assets/diffusion/car_sample_epoch_200.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<img src="/assets/diffusion/car_sample_epoch_300.png" alt="Inference at Epoch 1: Reverse Diffusion on Car Sample" width="1000" />
<small style="color:grey">
Inference or reverse diffusion during training; the performance for the same noise input is shown for epochs 1, 5, 10, 100, 200 and 300 (last epoch).
A total of <code>T=512</code> steps are taken to iteratively remove noise. The figures show 9 equally spaced stages of those steps at each epoch.
</small>
</p>

<p>The final model is able to generate new samples, as shown below:</p>

<p align="center">
<img src="/assets/diffusion/car_generation_best_model.png" alt="Eight Samples Generated by a DDPM" width="1000" />
<small style="color:grey">
Eight generated samples after 300 epochs of training.
</small>
</p>

<h2 id="wrap-up">Wrap Up</h2>

<p>Diffusion models have become the standard approach for image generation by combining high-quality samples, good coverage, and relatively stable training. By framing generation as a gradual denoising process, DDPMs avoid many of the pitfalls of earlier generative models while remaining surprisingly intuitive.</p>

<p>This post focused on building intuition and connecting theory to practice through a minimal DDPM implementation. While the underlying math is rather simple, I still find it fascinating that such models can learn a representation of images rich enough to generate entirely new samples from pure noise — it often feels a bit like magic.</p>

<p>If you want to deepen your understanding, the best next step is to <a href="https://github.com/mxagar/diffusion-examples/blob/main/ddpm/ddpm.ipynb">run the notebook yourself</a>, visualize the diffusion process, and experiment with the model’s components. Small changes in schedules, architectures, or datasets can lead to very different behaviors.</p>

<p><a href="https://mikelsagardia.io/blog/diffusion-hands-on.html">In the next post</a>, I’ll move toward more practical diffusion workflows using Hugging Face Diffusers and modern text-to-image models. As always, comments, questions, and suggestions are more than welcome :smile:</p>

<p><br /></p>

<blockquote>
  <p>Does image generation (still) feel a bit magical to you? Were my technical explanations clear enough to understand what’s going on under the hood?</p>
</blockquote>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/diffusion-for-developers.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/diffusion-for-developers.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="AI" /><category term="engineering," /><category term="diffusion," /><category term="machine" /><category term="learning," /><category term="image" /><category term="generation," /><category term="generative" /><category term="AI," /><category term="deep" /><category term="model" /><category term="training," /><category term="inference" /><summary type="html"><![CDATA[&lt;!– Blog Post 1 Title: An Introduction to Image Generation with Diffusion Models (1/2) Subtitle: A Conceptual Guide for Developers &amp; ML Practitioners]]></summary></entry><entry><title type="html">My Personal eGPU Server Setup</title><link href="https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html" rel="alternate" type="text/html" title="My Personal eGPU Server Setup" /><published>2025-10-21T10:30:00+00:00</published><updated>2025-10-21T10:30:00+00:00</updated><id>https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu</id><content type="html" xml:base="https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html"><![CDATA[<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  How to Run and Train LLMs Locally with NVIDIA Chips from a Mac &amp; Linux Setup
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/linux_nvidia_egpu/workstation-dgx-spark-nvidia.jpg" alt="NVIDIA DGX Spark" width="1000" />
<small style="color:grey">This blog post is not about the <a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/">NVIDIA DGX Spark</a>. Instead, it's about my eGPU setup, the <i>personal supercomputer</i> I've been using the past 2 years. Image from <a href="https://nvidianews.nvidia.com/news/nvidia-dgx-spark-arrives-for-worlds-ai-developers">NVIDIA</a>.
</small>
</p>

<div style="height: 20px;"></div>
<div align="center" style="border: 1px solid #e4f312ff; background-color: #fcd361b9; padding: 1em; border-radius: 6px;">
<strong>
For a detailed setup guide, check <a href="https://github.com/mxagar/linux_nvidia_egpu">this GitHub repository</a>.
</strong>
</div>
<div style="height: 30px;"></div>

<p>You may have seen the release of the <a href="https://www.nvidia.com/en-us/products/workstations/dgx-spark/">NVIDIA DGX Spark</a>, the new <em>personal supercomputer</em> from NVIDIA.
With 128 GB of memory, 20 CPU cores, and a price tag of USD $3,999, it’s sure to land on many AI enthusiasts’ wish lists this Christmas.</p>

<p>This post presents my own, more modest alternative.
For the past two years, I’ve been using an NVIDIA eGPU (external GPU) connected to my MacBook Pro M1 — but running through a Linux machine that acts as a dedicated server.
After several colleagues and friends showed interest, I decided to document the entire setup on <a href="https://github.com/mxagar/linux_nvidia_egpu">GitHub</a> as the guide I once looked for but never fully found.
In this post, I’ll introduce the overall setup and explain the motivation behind it.</p>

<p>Here’s the schematic of my personal <em>supercomputer</em>:</p>

<p align="center">
<img src="/assets/linux_nvidia_egpu/egpu_linux.png" alt="eGPU Linux &amp; Mac Setup" width="1000" />
<small style="color:grey">My eGPU setup consists of a MacBook M1 and a Linux server with an NVIDIA eGPU.
</small>
</p>

<p>I mainly use the eGPU to train general Deep Learning models (with <a href="https://code.visualstudio.com/docs/remote/ssh">VS Code Remote Development</a>) and to run LLMs locally (with <a href="https://ollama.com/">Ollama</a>); as illustrated in the figure above:</p>

<ul>
  <li>I have a <a href="https://www.lenovo.com/gb/en/p/laptops/thinkpad/thinkpadp/p14s-amd-g1/22wsp144sa1">Lenovo ThinkPad P14s</a> with an integrated NVIDIA Quadro T500 graphics card, running Ubuntu.</li>
  <li>I attach to a Thunderbolt port of the Lenovo a <a href="https://www.razer.com/mena-en/gaming-laptops/razer-core-x">Razer Core X External Case</a>, which contains a <a href="https://www.gigabyte.com/Graphics-Card/GV-N3060GAMING-OC-12GD-rev-20">NVIDIA GeForce RTX 3060</a> (12GB of memory).</li>
  <li>I run applications which require GPU power on the Lenovo/Ubuntu but interface with them via my MacBook Pro M1.</li>
</ul>

<p>You might ask <em>why I would want to run and train models locally</em>, since we have many cloud services available that spare us the hassle. Here’re my answers:</p>

<ul>
  <li>Many models (LLMs or any other DL networks) can be used locally for a <strong>fraction of the cost</strong> required by cloud providers; in fact, the <a href="https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3060-3060ti/">NVIDIA RTX 3060</a> with 12GB is quite similar to the often offered low tier GPU, the <a href="https://www.nvidia.com/en-us/data-center/tesla-t4/">NVIDIA T4</a>. Model deployment often requires private or public cloud services, but experimentation, prototyping, and small-scale training can be done locally.</li>
  <li>Local models allow you to process data <strong>confidentially</strong>: running models locally allows you to process sensitive or proprietary data (e.g., personal notes, internal reports, or corporate documents) without uploading them to third-party servers. This means full control over your data lifecycle, compliance with privacy policies, and peace of mind knowing that no external provider logs or stores your content.</li>
  <li>We <strong>avoid dependence</strong> on cloud services if we run models locally: while cloud platforms provide flexibility, they also create a point of failure and an ongoing dependency on external infrastructure and pricing. Outages like the <a href="https://techcrunch.com/2021/12/07/amazon-web-services-went-down-and-took-a-bunch-of-the-internet-with-it/">AWS downtime of December 2021</a> or the more recent <a href="https://www.wired.com/story/what-that-huge-aws-outage-reveals-about-the-internet/">AWS outage in October 2025</a> show how fragile these systems can be.</li>
  <li>Tinkering locally, we <strong>learn</strong> how to set up hardware, firmware, and software: managing your own GPU infrastructure provides a deeper understanding of the systems that power modern AI. From BIOS configuration and driver setup to Docker and Conda environments, each layer teaches valuable skills that translate directly into real-world MLOps and engineering practice.</li>
</ul>

<p><br /></p>

<blockquote>
  <p>Running models locally offers major advantages: it’s far cheaper than using cloud GPUs, keeps your data fully private, and works even when cloud services fail. It also helps you build hands-on expertise with the hardware and software stack that powers modern AI.</p>
</blockquote>

<p><br /></p>

<p>You might also ask <em>why not stick to a single computer, Ubuntu or MacOS, with an attached eGPU</em>. That question has several layers:</p>

<ul>
  <li>Even though I really like Ubuntu, MacOS offers in my opinion another level of user experience in general, which I find more polished than the Linux variant.</li>
  <li>In the past, Intel-based Macs supported AMD eGPUs, but since the introduction of the Apple M1, that option seems to have vanished.</li>
  <li>Ideally, I’d use MacOS with NVIDIA eGPU support, because NVIDIA chips are the industry standard.</li>
  <li>Another option would be to upgrade my MacBook Pro M1 to a MacStudio M3 Ultra or similar, which comes with a very powerful processor — but why abandon a perfectly capable MacBook Pro M1?</li>
</ul>

<h2 id="setup-guide-a-summary">Setup Guide: A Summary</h2>

<p>The <a href="https://github.com/mxagar/linux_nvidia_egpu">GitHub repository I have created</a> answers all the key questions and walks you through the complete setup process for getting an NVIDIA eGPU up and running. It includes detailed guidance on:</p>

<ul>
  <li><a href="https://github.com/mxagar/linux_nvidia_egpu/tree/main?tab=readme-ov-file#step-0-hardware-requirements">Hardware requirements</a>: <em>What components do you need for an eGPU setup? Which GPUs and enclosures are compatible? How much VRAM do typical ML models require?</em></li>
  <li><a href="https://github.com/mxagar/linux_nvidia_egpu/tree/main?tab=readme-ov-file#step-1-install-ubuntu">Installation of Ubuntu</a> and <a href="https://github.com/mxagar/linux_nvidia_egpu/tree/main?tab=readme-ov-file#step-3-install-and-configure-nvidia-and-gpu-related-libraries">NVIDIA libraries</a>: <em>How do you install and configure Ubuntu so it works seamlessly with my external NVIDIA GPU?</em></li>
</ul>

<p>Beyond the essentials, the guide also covers some practical extras that make the setup truly usable day to day:</p>

<ul>
  <li><a href="https://github.com/mxagar/linux_nvidia_egpu/tree/main?tab=readme-ov-file#step-5-install-docker-with-nvidia-gpu-support">Installation of Docker with GPU support</a>: Containerization is now a must in AI/ML workflows. Unfortunately, enabling full GPU acceleration inside Docker images can be tricky — this section provides a simple, reliable recipe that works.</li>
  <li><a href="https://github.com/mxagar/linux_nvidia_egpu/tree/main?tab=readme-ov-file#step-6-remote-access-configuration">Remote access configuration</a>: This section explains how to securely connect to the Ubuntu GPU machine from another device (e.g., a MacBook) within the same local network.</li>
</ul>

<p>After you’ve completed the setup, you can verify that your eGPU is correctly recognized by running a quick check in the Mac’s terminal.</p>

<p align="center">
<img src="/assets/linux_nvidia_egpu/mac_nvidia_smi.png" alt="MacOS NVIDIA SMI" width="1000" />
<small style="color:grey">Snapshot of the <code>nvidia-smi</code> output on the Ubuntu machine (hostname: <code>urgull</code>) but executed from my MacBook (hostname: <code>kasiopeia</code>). We can see the eGPU and its load: NVIDIA GeForce RTX 3060, 14W / 170W used, 26MiB / 12288MiB used.
</small>
</p>

<h2 id="using-the-egpu-remote-vs-code-and-ollama">Using the eGPU: Remote VS Code and Ollama</h2>

<p>Once we get the correct output from <code class="language-plaintext highlighter-rouge">nvidia-smi</code>, we can start using the eGPU. To that end, the <a href="https://github.com/mxagar/linux_nvidia_egpu">guide GitHub repository</a> contains a simple Jupyter notebook we can run: <a href="https://github.com/mxagar/linux_nvidia_egpu/blob/main/test_gpu.ipynb">test_gpu.ipynb</a>.</p>

<p>The way I prefer to run complete repositories remotely (i.e., on the Ubuntu machine but interfaced from the MacBook) is using a <a href="https://code.visualstudio.com/docs/remote/ssh"><strong>Remote VS Code</strong></a> instance. To start one, these are the preliminary steps we need to follow:</p>

<ol>
  <li>Open the MacBook Terminal (make sure no VPN connections are active).</li>
  <li>SSH to the Ubuntu machine with our credentials.</li>
  <li>Clone the <a href="https://github.com/mxagar/linux_nvidia_egpu">GitHub repository</a> with the notebook <a href="https://github.com/mxagar/linux_nvidia_egpu/blob/main/test_gpu.ipynb">test_gpu.ipynb</a>.</li>
  <li>Install the GPU Conda environment.</li>
</ol>

<p>Steps 1-4 are carried out with these commands:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># -- MacBook</span>
ssh &lt;username&gt;@&lt;hostname-ubuntu&gt;.local
ssh mikel@urgull.local

<span class="c"># -- Ubuntu via MacBook</span>
<span class="nb">cd</span> <span class="o">&amp;&amp;</span> <span class="nb">mkdir</span> <span class="nt">-p</span> git_repositories <span class="o">&amp;&amp;</span> <span class="nb">cd </span>git_repositories
git clone https://github.com/mxagar/linux_nvidia_egpu.git
conda <span class="nb">env </span>create <span class="nt">-f</span> conda.yaml.  <span class="c"># Create the 'gpu' environment</span>
</code></pre></div></div>

<p>Then, we can start a remote VS Code instance:</p>

<ol>
  <li>We open VS Code on our MacBook.</li>
  <li>Click on <em>Open Remote Window</em> (bottom left corner) &gt; <em>Connect to Host</em>.</li>
  <li>We enter the user and host as in <code class="language-plaintext highlighter-rouge">&lt;username&gt;@&lt;hostname-ubuntu&gt;.local</code>, followed by the password.</li>
</ol>

<p>… <em>et voilà</em>: we have already a VS Code instance running on the Ubuntu machine, but interfaced by the MacBook UI! Now, we can open any folder, including the folder containing the notebook:</p>

<ol>
  <li>Click on <em>Explorer menu</em> (left menu bar) &gt; <em>Open Folder</em>.</li>
  <li>And, finally, we load our repository cloned in <code class="language-plaintext highlighter-rouge">~/git_repositories/linux_nvidia_egpu</code>.</li>
</ol>

<p>After selecting the <code class="language-plaintext highlighter-rouge">gpu</code> environment (kernel) for the <a href="https://github.com/mxagar/linux_nvidia_egpu/blob/main/test_gpu.ipynb">test_gpu.ipynb</a> notebook, we can start executing its cells.</p>

<p>Among others, a simple CNN is trained in the notebook using the MNIST dataset (~45MB). In my tests, the NVIDIA RTX 3060 completed training in 37 seconds, while the MacBook Pro M1 took about 62 seconds — nearly twice as fast on the eGPU.</p>

<p>In terms of memory, my MacBook has a <em>unified memory</em> of 16GB, vs. the 12GB VRAM of the RTX 3060.
At first glance, the Mac’s chip seems superior to the NVIDIA’s.
However, in practice, the NVIDIA GPU performs better for large models, because its VRAM is fully dedicated to GPU workloads, whereas the Mac’s unified memory is shared between CPU and GPU, which can lead to bottlenecks.</p>

<p align="center">
<!-- 80% of the viewport width, centered -->
<!-- Enable larger resolution images for Retina displays -->
<img src="/assets/linux_nvidia_egpu/mac_ubuntu_egpu_vscode@2x.png" srcset="/assets/linux_nvidia_egpu/mac_ubuntu_egpu_vscode@2x.png 2x, /assets/linux_nvidia_egpu/mac_ubuntu_egpu_vscode.png 1x" class="img-breakout" style="--w: 80vw" />
<small style="color:grey">Snapshot of the remote VS Code instance: the repository is on the Ubuntu machine with leveraging the eGPU (hostname: <code>urgull</code>), but interfaced from my MacBook (hostname: <code>kasiopeia</code>).
</small>
</p>

<div style="height: 20px;"></div>
<p align="center">── ◆ ──</p>
<div style="height: 20px;"></div>

<p>Another application I use quite extensively with the eGPU is <a href="https://ollama.com/"><strong>Ollama</strong></a>, which enables <em>local</em> Large Language Models (LLMs) for a plethora of tasks.</p>

<p>To run Ollama on the eGPU but interfaced from the MacBook, first, we need to <a href="https://github.com/mxagar/linux_nvidia_egpu?tab=readme-ov-file#ollama-server-use-ollama-llms-running-on-the-gpu-ubuntu-but-from-another-machine">install it</a> properly on both machines. Then, we can follow these easy commands to start a chatbot via the CLI:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># -- Ubuntu (...or via ssh from MacBook)</span>
<span class="c"># Make sure Ollama uses GPU and is accessible in our LAN</span>
<span class="nb">export </span><span class="nv">OLLAMA_USE_GPU</span><span class="o">=</span>1
<span class="nb">export </span><span class="nv">OLLAMA_HOST</span><span class="o">=</span>0.0.0.0:11434
<span class="c"># Download a 9.1GB model (takes some minutes) and start the Ollama server</span>
ollama pull gemma3:12b
ollama serve &amp;

<span class="c"># -- MacBook</span>
<span class="c"># Change the Ollama host to the Ubuntu machine</span>
<span class="nb">export </span><span class="nv">OLLAMA_HOST</span><span class="o">=</span>urgull.local:11434
ollama run gemma3:12b
<span class="c"># ... now we can chat :)</span>

<span class="c"># To revert to use the local MacBook Ollama service</span>
<span class="nb">export </span><span class="nv">OLLAMA_HOST</span><span class="o">=</span>127.0.0.1:11434
</code></pre></div></div>

<p>The result is summarized on the following snapshot:</p>

<p align="center">
<img src="/assets/linux_nvidia_egpu/mac_ubuntu_egpu_ollama.png" alt="Ollama on eGPU" width="1000" />
<small style="color:grey"><a href="https://deepmind.google/models/gemma/gemma-3/">Gemma 3 12B</a> running on the Ubuntu eGPU via Ollama, but operated from the MacBook.
</small>
</p>

<p>The Ollama server can be also reached from our LAN using <code class="language-plaintext highlighter-rouge">cURL</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl http://urgull.local:11434/api/generate <span class="nt">-d</span> <span class="s1">'{
  "model": "gemma3:12b",
  "prompt": "Write a haiku about machine learning."
}'</span>
</code></pre></div></div>

<p>Of course, there are many useful and more interesting downstream applications:</p>

<ul>
  <li>Private chatbot with proper GUI, history, document upload and internet access</li>
  <li>Copilot-style code completion</li>
  <li>Agents: Autonomous CLI Agents, Local Operator, Ollama MCP Agent, etc.</li>
  <li>…</li>
</ul>

<p>However, those applications are not in the scope of this post — maybe I will introduce them in another one ;)</p>

<h2 id="wrap-up">Wrap Up</h2>

<p>In this post, I’ve shared the motivation and architecture behind my personal eGPU server setup — a compact yet powerful alternative to commercial “AI workstations”. The combination of a Linux GPU server (NVIDIA RTX 3060) and a MacBook (Pro M1) client creates a seamless environment for experimentation: you can train models efficiently, run large LLMs locally via Ollama, and enjoy the responsive UI and ecosystem of macOS for daily work.</p>

<p>Running models locally is not only cost-effective but also empowers you to work autonomously, privately, and creatively, without depending on cloud services.</p>

<p><br /></p>

<blockquote>
  <p>Would you prefer to build your own local AI workstation, or do you trust cloud services enough to rely on them entirely? What would be your ideal balance between local and cloud compute?</p>
</blockquote>

<p><br /></p>

<p>If you’re interested in a step-by-step guide, check <a href="https://github.com/mxagar/linux_nvidia_egpu"><strong>my Github repository of the project</strong></a>.</p>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/mac-os-ubuntu-nvidia-egpu.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="eGPU," /><category term="AI" /><category term="engineering," /><category term="LLM," /><category term="machine" /><category term="learning," /><category term="Ollama," /><category term="Remote" /><category term="VS" /><category term="Code," /><category term="Linux," /><category term="deep" /><category term="model" /><category term="training," /><category term="inference" /><summary type="html"><![CDATA[How to Run and Train LLMs Locally with NVIDIA Chips from a Mac &amp; Linux Setup]]></summary></entry><entry><title type="html">An Infinite Text Generator</title><link href="https://mikelsagardia.io/blog/text-generation-rnn.html" rel="alternate" type="text/html" title="An Infinite Text Generator" /><published>2022-10-08T07:30:00+00:00</published><updated>2022-10-08T07:30:00+00:00</updated><id>https://mikelsagardia.io/blog/text-generation-rnn-lstm</id><content type="html" xml:base="https://mikelsagardia.io/blog/text-generation-rnn.html"><![CDATA[<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  A Toy Recurrent Neural Network Based on LSTM Cells Which Generates TV Scripts
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/text_generation_rnn/Chimpanzee_seated_at_typewriter.jpg" alt="A chimpanzee seated at a typewriter" width="1000" />
<small style="color:grey">"If you give me an infinite number of bananas I'll type <em>banana</em> for you." Photo from <a href="https://commons.wikimedia.org/wiki/File:Chimpanzee_seated_at_typewriter.jpg">Wikimedia</a>.</small>
</p>

<p>The <a href="https://en.wikipedia.org/wiki/Infinite_monkey_theorem">infinite monkey theorem</a> states that a monkey writing random letters on a keyboard long enough can reproduce the complete works of Shakespeare. There is even a straightforward proof when <em>long enough</em> tends to infinity.</p>

<p>Now, I don’t plan to have monkeys in my cellar and I surely don’t have infinite time. But could maybe neural networks aid in that enterprise? It turns out, they can, and they are astonishingly effective even with small tweaking efforts.</p>

<p><br /></p>

<blockquote>
  <p>Deep neural networks are amazingly good at learning patterns and one can take advantage of that to generate new and structurally coherent data.</p>
</blockquote>

<p><br /></p>

<p>Inspired by the <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">great post</a> from <a href="https://karpathy.ai/">Andrej Karpathy</a> in which he describes how <a href="https://github.com/karpathy/char-rnn">text can be generated character-wise</a>, I implemented a <em>word-wise</em> text generator which works with Recurrent Neural Networks (RNNs). My code can be found in <a href="https://github.com/mxagar/text_generator"><strong>this Github repository</strong></a>.</p>

<p>Are you interested in how this is possible? Let’s dive in!</p>

<h2 id="recursive-neural-networks-and-their-application-to-language-modeling">Recursive Neural Networks and Their Application to Language Modeling</h2>

<p>While <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">Convolutional Neural Networks (CNNs)</a> are particularly good at capturing spatial relationships, <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">Recurrent Neural Networks (RNNs)</a> model sequential structures very efficiently. Also, in recent years, the <a href="https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)">Transformer</a> architecture has been shown to work remarkably well with language data – but let’s keep it aside for this small toy project.</p>

<p>In many language modeling applications, and in the particular text generation case explained here, we need to undertake the following general steps:</p>

<ul>
  <li>The text needs to be <strong>processed</strong> as sequences of numerical vectors.</li>
  <li>We define <strong>recurrent layers</strong> which take those sequences of vectors and yield sequences of outputs.</li>
  <li>We take the complete or partial output sequence and we <strong>map it to the target space</strong>, e.g., words.</li>
</ul>

<p>Let’s analyze in more detail what happens in each step.</p>

<h3 id="text-preprocessing">Text Preprocessing</h3>

<p>Computers are able to work only with numbers. The same way an image is represented as a matrix of pixels that contain <code class="language-plaintext highlighter-rouge">R-G-B</code> values, sentences need to be transformed into numerical values. One common recipe to achieving that is the following:</p>

<ol>
  <li>The text is <a href="https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization"><strong>tokenized</strong></a>: it is converted into a list of elements or tokens that have an identifiable unique meaning; these elements are usually words and related symbols, such as question marks or other punctuation elements.</li>
  <li>A <strong>vocabulary</strong> is created: we generate a dictionary with all the <code class="language-plaintext highlighter-rouge">n</code> unique tokens in the dataset which maps from the token string to an <code class="language-plaintext highlighter-rouge">id</code> and vice versa.</li>
  <li>Tokens are <strong>vectorized</strong>: tokens can be represented as <strong>one-hot encoded</strong> vectors, i.e., each of them becomes a vector of size <code class="language-plaintext highlighter-rouge">n</code> which contains all <code class="language-plaintext highlighter-rouge">0</code>-s except in the index/cell which corresponds to the token <code class="language-plaintext highlighter-rouge">id</code> in the vocabulary, where the value <code class="language-plaintext highlighter-rouge">1</code> is assigned. Then, those one-hot encoded vectors can be compressed to an <a href="https://en.wikipedia.org/wiki/Embedding"><strong>embedding space</strong></a> consisting of vectors of size <code class="language-plaintext highlighter-rouge">m</code>, with <code class="language-plaintext highlighter-rouge">m &lt;&lt; n</code>. Those embedded vectors contain floating point numbers, i.e., they are not <em>sparse</em> as their one-hot encoded version. That mapping is achieved with an embedding layer, which is akin to a linear layer, and it considerably improves the model efficiency. Typical reference sizes are <code class="language-plaintext highlighter-rouge">n = 70,000</code>, <code class="language-plaintext highlighter-rouge">m = 300</code>.</li>
</ol>

<p>Note that, in practice, one-hot encoding the tokens can be skipped. Instead, tokens are represented with their <code class="language-plaintext highlighter-rouge">id</code> or <code class="language-plaintext highlighter-rouge">index</code> values in the vocabulary and the embedding layer handles everything with that information. That is possible because each token has a unique <code class="language-plaintext highlighter-rouge">id</code> value which triggers <code class="language-plaintext highlighter-rouge">m</code> unique weights only. The following figure illustrates that idea and the overall vectorization of the tokens:</p>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/text_generation_rnn/Embeddings.png" alt="Text vectorization: the word 'dog' converted into an embedding vector" width="600" />
<br />
<small style="color:grey">Text vectorization: the word "dog" converted into an embedding vector. Image by the author.</small>
</p>

<div style="line-height:150%;">
    <br />
</div>

<h3 id="recurrent-neural-networks">Recurrent Neural Networks</h3>

<p>Once we have sequences of vectorized tokens, we can feed them to recursive layers that learn patterns from them. For instance, in our word-wise text generator, we might input a sequence like</p>

<p><code class="language-plaintext highlighter-rouge">The</code>, <code class="language-plaintext highlighter-rouge">dog</code>, <code class="language-plaintext highlighter-rouge">is</code>, <code class="language-plaintext highlighter-rouge">eating</code>, <code class="language-plaintext highlighter-rouge">a</code></p>

<p>and make the model learn to output the target token <code class="language-plaintext highlighter-rouge">bone</code>. In other words, the network is trained to predict the likeliest vector(s) given the sequence of vectors we have shown it.</p>

<p>Recursive layers are characterized by the following properties:</p>

<ul>
  <li>Vectors of each sequence are fed one by one to them.</li>
  <li>Neurons that compose those layers keep a <em>memory state</em>, also known as <em>hidden state</em>.</li>
  <li>The memory state from the previous step, i.e., the one produced by the previous vector in the sequence, is used in the current step to produce a new output and a new memory state.</li>
</ul>

<p>The most basic recursive layer is the <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">Simple RNN or Elman Network</a>, depicted in the following figure:</p>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/text_generation_rnn/SimpleRNN.png" alt="The model of a Simple Recurrent Neural Network or Elman Network" width="600" />
<br />
<small style="color:grey">The model of a Simple Recurrent Neural Network or Elman Network. Image by the author.</small>
</p>

<div style="line-height:150%;">
    <br />
</div>

<p>In the picture, we can see that we have 3 vectors for each time step \(t\): the input \(x\), the output \(y\) and the memory state \(s\). Additionally, the previous memory state is used together with the current input to generate the new memory state, and that new memory state is mapped to be the output. In that process, 3 weight matrices are used (\(W_x\), \(W_y\) and \(W_s\)), which are learned during training.</p>

<p>Unfortunately, simple RNNs or Elman networks suffer from the <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem"><strong>vanishing gradient</strong></a> problem; due to that, in practice, they can reuse only 8-10 previous steps. Luckily, <a href="https://en.wikipedia.org/wiki/Long_short-term_memory"><strong>Long Short-Term Memory (LSTM) units</strong></a> were introduced by Schmidhuber et al. in 1997. LSTMs efficiently alleviate the vanishing gradient issue and they are able handle +1,000 steps backwards.</p>

<p>LSTM cells are differentiable units that perform several operations every step; those operations decide which information is removed from memory, which kept in it and which used to form an output. They segregate the memory input/output into two types, as shown in the next figure:</p>

<ul>
  <li>short-term memory, which captures recent inputs and outputs,</li>
  <li>and long-term memory, which captures the context.</li>
</ul>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/text_generation_rnn/LSTMs.png" alt="The abstract model of Long Short-Term Memory (LSTM) unit" width="600" />
<br />
<small style="color:grey">The abstract model of Long Short-Term Memory (LSTM) unit. Image by the author.</small>
</p>

<div style="line-height:150%;">
    <br />
</div>

<p>Therefore, we have:</p>

<ul>
  <li>Three inputs:
    <ul>
      <li>signal/event: \(x_t\)</li>
      <li>previous short-term memory: \(h_{t-1}\)</li>
      <li>previous long-term memory : \(C_{t-1}\)</li>
    </ul>
  </li>
  <li>Three outputs:
    <ul>
      <li>transformed signal or output: \(y_t = h_t\)</li>
      <li>current/updated short-term memory: \(h_t\)</li>
      <li>current/updated long-term memory: \(C_t\)</li>
    </ul>
  </li>
</ul>

<p>Note that the updated short-term memory is the signal output, too!</p>

<p>All 3 inputs are used in the cell in <strong>4 different and interconnected gates</strong> to generate the 3 outputs; these internal gates are:</p>

<ul>
  <li>The <strong>forget</strong> gate, where useless parts of the previous long-term memory are forgotten, creating a <em>lighter</em> long-term memory.</li>
  <li>The <strong>learn</strong> gate, where the previous short-term memory and the current event are learned.</li>
  <li>The <strong>remember</strong> gate, in which we mix the <em>light</em> long-term memory with forgotten parts and the learned information to form the new long-term memory.</li>
  <li>The <strong>use</strong> gate, in which, similarly, we mix the <em>light</em> long-term memory with forgotten parts and the learned information to form the new short-term memory.</li>
</ul>

<p>If you are interested in more detailed information, <a href="https://colah.github.io/">Christopher Olah</a> has a great post which explains what’s exactly happening inside an LSTM unit: <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">Understanding LSTM Networks</a>. Also, note that a simpler but similarly efficient alternative to LSTM cells are <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit"><strong>Gated Recurrent Units (GRUs)</strong></a>.</p>

<p>From a pragmatic point of view, it suffices to know that LSTM units have short- and long-term memory vectors which are automatically passed from the previous to the current step. Additionally, the output of the cell is the short-term memory or hidden state, and since we input a <em>sequence</em> of embedded vectors to the unit, we obtain a <em>sequence</em> of hidden vectors.</p>

<h3 id="final-mapping-and-putting-it-all-together">Final Mapping and Putting It All Together</h3>

<p>Usually, 2-3 RNN layers are stacked one after the other and the final output vector sequence can be mapped to the desired target space. For instance, in the case of the text generation example, I have used a fully connected layer which transforms the <em>last vector from the output sequence</em> to <em>one vector of the size of the vocabulary</em>; thus, given a sequence of words/tokens, the model is fit to predict the next most likely one.</p>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/text_generation_rnn/TextGeneration.png" alt="The complete text generation pipeline" width="800" />
<br />
<small style="color:grey">A complete text generation pipeline. In the example, the vocabulary size is n = 10 and we pass a sequence of 5 tokens to the network. The embedding size is m = 3 and the hidden states have a size of 4. Image by the author.</small>
</p>

<div style="line-height:150%;">
    <br />
</div>

<p>As already mentioned, the output of an LSTM cell is a sequence of hidden states; the length of that sequence is the same as the length of the input sequence and each vector has the size of a hidden state, which can be different than the embedding dimension <code class="language-plaintext highlighter-rouge">m</code> (that hidden dimension is a <a href="https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)"><em>hyperparameter</em></a> we can modify). Since in our application we only take the last hidden state from that sequence, the defined RNN architecture is of the type <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/"><em>many-to-one</em></a>. However, other types of architectures can be designed thanks to the sequential nature of the RNNs; for instance, we can implement a <em>many-to-many</em> mapping, which is used to perform language translation, or <em>one-to-many</em>, employed in <a href="https://github.com/mxagar/image_captioning">image captioning</a>.</p>

<p>At the end of the day, we need to get the proper dataset we’d like to fit, apply the matrix mappings that match the input features with the target values and learn by optimization the weights within those matrices. Well, with RNNs, we need to consider additionally that we work with sequences.</p>

<p>After seeing a sequence of tokens, the trained model is able to infer the likelihood of each token in the vocabulary to be the next one. That functionality is wrapped in a text generation application that works as follows:</p>

<ol>
  <li>We define an initial sequence filled with the padding token and allocate in its last cell a priming word/token. The padding token is a placeholder or <em>empty</em> symbol, whereas the priming token is the seed with which the model will start to generate text.</li>
  <li>The sequence is fed to the network and it produces the probabilities for all possible tokens. We take a random token from the 5 most likely ones: that is the first generated token/word.</li>
  <li>The previous input sequence is rolled one element to the front and the last generated token is inserted in the last position.</li>
  <li>We repeat steps 2 and 3 until we generate the number of words we want.</li>
</ol>

<h2 id="results">Results</h2>

<p>To train the network, I used the <a href="https://www.kaggle.com/datasets/thec03u5/seinfeld-chronicles">Seinfeld Chronicles Dataset from Kaggle</a>, which contains the complete scripts from the <a href="https://en.wikipedia.org/wiki/Seinfeld">Seinfeld TV Show</a>. To be honest, I’ve never watched Seinfeld, but the conversation does seem to look structurally fine :sweat_smile:</p>

<p>You can judge it by yourself:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jerry: you know, it's the way i can do. i don't know what the hell happened.

jerry: what?

george: what about it?

elaine: i think you could be able to get out of here.

jerry: oh, i can't do anything about the guy.

jerry: what?

george:(smiling) yeah..........

george: you know, you should do the same thing.

jerry: i think i can.

jerry: oh, no, no! no. no.

jerry: i don't know.(to the phone) what do you think?

george: what?

jerry: oh, i think you're not a good friend.

jerry: yeah.

jerry: oh, you can't.

jerry:(to the phone) hey, hey, hey!

jerry:(to jerry) hey hey hey, hey!

george: hey, i can't believe i was gonna have to do that.

george: i don't know how much this is.

kramer:(smiling to jerry) i don't know, i'm not gonna get it.

kramer:(pointing) oh!(starts maniacally pleased to himself, and exits) oh, my god, i don't know!

elaine:(pause) i can't believe i can't. i don't know how much i mean, i was just thinking about this thing! i mean, i'm gonna take it.

george: you know what you want?

elaine: oh yeah, well, i'm gonna go see the way to get it.

elaine: oh yeah, well, i am not gonna get a little uncomfortable for the.

george: what?

george: oh. i don't know what the problem is.

george:(smiling, to himself, he looks in his head.

george: i can't believe you said it was an accident.

elaine: yeah, but you should take some more
</code></pre></div></div>

<h2 id="conclusions">Conclusions</h2>

<p>In this blog post I explain how the <a href="https://github.com/mxagar/text_generator">toy word-wise text generator I implemented</a> works. The application uses Recurrent Neural Networks (RNNs) consisting of Long Short-Term Memory (LSTM) units; the parts and steps developed for it are common to many Natural Language Processing (NLP) applications, such as <a href="https://github.com/mxagar/text_sentiment">sentiment analysis</a> or <a href="https://github.com/mxagar/image_captioning">image captioning</a>, and I try to answer the central questions around them:</p>

<ul>
  <li>Text processing: what tokenization and vocabulary generation are, and why we need to vectorize words in embedding spaces.</li>
  <li>RNNs and LSTM units: what these recurrent layers do and the shape of their inputs and outputs.</li>
  <li>Final sequence mapping: how the outputs from recurrent layers can be transformed into the target space.</li>
</ul>

<p>I trained the model with the <a href="https://www.kaggle.com/datasets/thec03u5/seinfeld-chronicles">Seinfeld Chronicles Dataset from Kaggle</a> and, although the generated text doesn’t make complete sense, the dialogues seem structurally similar to the ones in the dataset; in some cases, I read 1-3 sentences and I can almost hear the sitcom laugh track in the background :joy:</p>

<p><br /></p>

<blockquote>
  <p>Which text would you like to capture and regenerate?</p>
</blockquote>

<p><br /></p>

<p>If you’re interested in more technical details related to the topic, you can have a look at <a href="https://github.com/mxagar/text_generator"><strong>Github repository of the project</strong></a>. Also, if you’d like to see how a very similar architecture as the one used here can be employed to generate text descriptions of image contents, you can have a look at my <a href="https://github.com/mxagar/image_captioning">image captioning project</a>.</p>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/text-generation-rnn.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/text-generation-rnn.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="data" /><category term="science," /><category term="machine" /><category term="learning," /><category term="feature" /><category term="engineering," /><category term="modelling," /><category term="neural" /><category term="networks," /><category term="deep" /><category term="natural" /><category term="language" /><category term="processing," /><category term="recurrent" /><category term="generative" /><category term="model," /><category term="RNN," /><category term="LSTM," /><category term="TV" /><category term="script" /><summary type="html"><![CDATA[A Toy Recurrent Neural Network Based on LSTM Cells Which Generates TV Scripts]]></summary></entry><entry><title type="html">From Jupyter Notebooks to Production-Level Code</title><link href="https://mikelsagardia.io/blog/machine-learning-production-level.html" rel="alternate" type="text/html" title="From Jupyter Notebooks to Production-Level Code" /><published>2022-09-23T07:30:00+00:00</published><updated>2022-09-23T07:30:00+00:00</updated><id>https://mikelsagardia.io/blog/machine-learning-production-level</id><content type="html" xml:base="https://mikelsagardia.io/blog/machine-learning-production-level.html"><![CDATA[<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  A Boilerplate Package to Transform Machine Learning Research Notebooks into Deployable Pipelines
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/machine_learning_production/notebook.jpg" alt="A notebook" width="1000" />
<small style="color:grey">A glimpse to my current notebook. Photo by the author.</small>
</p>

<p>I love writing and drawing on my notebook. In there, you’ll find not only formulas or flow charts, but also funny cartoons, interminable lists of things I’d like to do, shopping lists, or important scribbles my kids leave me every now and then. Therefore, one could say they are a unique window to what’s going on in my mind and life.</p>

<p>I think something similar happens with the <a href="https://jupyter.org">Jupyter notebooks</a> commonly used in data science: they are great because it’s very easy to try new ideas with code in them, you jot down notes beside the features you engineered or the models you tried, and everything is visually great – but the produced content often grows chaotically and it ends up being unusable in real life without proper modifications.</p>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/machine_learning_production/sj-YDvfndOs4IQ-unsplash.jpg" alt="Chocolate cookies: Photo by @sjcbrn on Unsplash" width="1000" />
<small style="color:grey">Photo by <a href="https://unsplash.com/@sjcbrn">SJ</a> on <a href="https://unsplash.com/photos/YDvfndOs4IQ">Unsplash</a>.</small>
</p>

<p>I have also noticed that I start becoming more sloppy and lazy when I’m too long around notebooks; it’s like when you tend to leave your vegetables unfinished and indulge yourself with cookies for dessert. And then, you try to fit in that wedding suit and you realize it somehow shrunk.</p>

<p><br /></p>

<blockquote>
  <p>Jupyter Notebooks are like chocolate cookies: You know you should eat them in moderation, but you can’t help sneaking the last one again.</p>
</blockquote>

<p><br /></p>

<h2 id="applying-software-engineering-and-devops-to-research-code">Applying Software Engineering and DevOps to Research Code</h2>

<p>Food metaphores aside, and using the jargon of the Software Engineering world, Jupyter notebooks belong to <strong>research and development environments</strong>, whereas deployed code belongs to <strong>production environments</strong>. Most of the data science projects don’t leave the research environment, because their goal is to provide useful insights. However, when the created models need to be used in online predictions with new data, we need to level up the code and the infrastructure quality to meet the production standards, characterized by a guarantee of reliability.</p>

<p>Machine learning systems have particular properties that present new challenges in production, as <a href="https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">Sculley et al.</a> pointed out in their motivational foundations of what is becoming the field of <a href="https://en.wikipedia.org/wiki/MLOps">MLOps</a>. Many tools which target those specific needs have appeared in recent years; those tools and the applications which use them are often categorized in maturity levels:</p>

<ul>
  <li>Level 0 (research and development): data analysis and modeling is performed to answer business questions, but the models are not used to perform online inferences.</li>
  <li>Level 1 (production): the inference pipeline is deployed manually and the artifact versions are tracked (models, data, code, etc.) and pipeline outputs monitored.</li>
  <li>Level 2 (<em>very serious</em> production): deployments of training and inference pipelines are done automatically and frequently, enabling large-scale continuously updated applications.</li>
</ul>

<p>Small/medium-sized projects (teams of 1-20 people) require typically level 1 maturity, and often, the companies where they are implemented in don’t have the resources to go for level 2.</p>

<p>In this article, <strong>I present a standardized way of transforming research notebooks into production-level code</strong>; in MLOps maturity levels that represents the journey from level 0 to 1. To that end, I have implemented <strong>a boilerplate project with production-ready quality that can be cloned from this <a href="https://github.com/mxagar/customer_churn_production">Github repository</a></strong>.</p>

<p>The selected business case consists in analyzing <strong>customer churn</strong> using the <a href="https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers/code">Credit Card Customers</a> dataset from <a href="https://www.kaggle.com/">Kaggle</a>. Data analysis, modeling and inference pipelines are implemented in the project to end up with an interpretable model-pipeline that is also able to perform reliable predictions. However, the package is designed so that the business case and the data analysis can be easily replaced, and the focus lies in providing a template with the following properties:</p>

<ul>
  <li>Structure which reflects the typical steps in a small/medium-sized data science project</li>
  <li>Readable, simple, concise code</li>
  <li>PEP8 conventions applied, checked with <a href="https://pypi.org/project/pylint/">pylint</a> and <a href="https://pypi.org/project/autopep8/">autopep8</a></li>
  <li>Modular and efficient code, with Object-Oriented patterns</li>
  <li>Documentation provided at different stages: code, <code class="language-plaintext highlighter-rouge">README</code> files, etc.</li>
  <li>Error/exception handling</li>
  <li>Execution and data testing with <a href="https://docs.pytest.org/en/7.1.x/">pytest</a></li>
  <li>Logging implemented during production execution and testing</li>
  <li>Dependencies controlled for custom environments</li>
  <li>Installable python package</li>
  <li>Basic containerization with <a href="https://www.docker.com/">Docker</a></li>
</ul>

<p>However, two properties are missing to reach full level 1:</p>

<ul>
  <li>Deployment of the pipeline</li>
  <li>Tracking of the generated artifacts (model-pipelines, data, etc.)</li>
  <li>Monitoring of the model (drift)</li>
</ul>

<p>Those are fundamental attributes, but I consider they are out of scope in this article/project, because they often rely on additional 3rd party tools. My goal is to provide a template to transform notebook code into professional software using the minimum additional tools it is possible; after that, we have a solid base to add more layers that take care of the tracking and monitoring of the different elements.</p>

<h3 id="the-boilerplate">The Boilerplate</h3>

<p>The boilerplate project from the <a href="https://github.com/mxagar/customer_churn_production">Github repository</a> has the following basic file structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
├── README.md                         # Package description, usage, etc.
├── churn_notebook.ipynb              # Research notebook
├── config.yaml                       # Configuration file for production
├── customer_churn/                   # Production library, package
│   ├── __init__.py                   # Python package file         
│   ├── churn_library.py              # Production library
│   └── transformations.py            # Utilities for the library
├── data/                             # Dataset folder
│   ├── README.md                     # Dataset details
│   └── bank_data.csv                 # Dataset file
├── main.py                           # Executable of production code
├── requirements.txt                  # Dependencies
├── setup.py                          # Python package file
└── tests/                            # Pytest testing scripts
    ├── __init__.py                   # Python package file
    ├── conftest.py                   # Pytest fixtures
    └── test_churn_library.py         # Tests for churn_library.py
</code></pre></div></div>

<p>All the research work of the project is contained in the notebook <code class="language-plaintext highlighter-rouge">churn_notebook.ipynb</code>; in particular, simplified implementations of the typical data processing and modeling tasks are performed:</p>

<ol>
  <li>Data Acquisition/Import</li>
  <li>Exploratory Data Analysis (EDA)</li>
  <li>Data Processing: Data Cleaning, Feature Engineering (FE)</li>
  <li>Data Modelling: Training, Evaluation, Interpretation</li>
  <li>Model Scoring: Inference</li>
</ol>

<p>The code from <code class="language-plaintext highlighter-rouge">churn_notebook.ipynb</code> has been transformed to create the package <code class="language-plaintext highlighter-rouge">customer_churn</code>, which contains two files:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">churn_library.py</code>: this file contains most of the refactored and modified code from the notebook.</li>
  <li><code class="language-plaintext highlighter-rouge">transformations.py</code>: definition of auxiliary transformations used in the data processing; complex operations on the data are implemented in Object-Oriented style so that they can be cleanly applied as with the <a href="https://scikit-learn.org/stable/modules/preprocessing.html"><code class="language-plaintext highlighter-rouge">sklearn.preprocessing</code></a> package.</li>
</ul>

<p>Additionally, a <code class="language-plaintext highlighter-rouge">tests</code> folder is provided, which contains <code class="language-plaintext highlighter-rouge">test_churn_library.py</code>. This script performs unit tests on the different functions of <code class="language-plaintext highlighter-rouge">churn_library.py</code> using <a href="https://docs.pytest.org/">pytest</a>.</p>

<p>The executable or <code class="language-plaintext highlighter-rouge">main</code> function is provided in <code class="language-plaintext highlighter-rouge">main.py</code>; this script imports the package <code class="language-plaintext highlighter-rouge">customer_churn</code> and runs three functions from <code class="language-plaintext highlighter-rouge">churn_library.py</code>:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">run_setup()</code>: the configuration file <code class="language-plaintext highlighter-rouge">config.yaml</code> is loaded and auxiliary folders are created, if not there yet:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">images</code>: it will contain the images of the EDA and the model evaluation.</li>
      <li><code class="language-plaintext highlighter-rouge">models</code>: it will contain the inference models/pipelines as serialized objects (pickles).</li>
      <li><code class="language-plaintext highlighter-rouge">artifacts</code>: it will contain the data processing parameters created during the training and required for the inference, serialized as pickles.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">run_training()</code>: it performs the EDA, the data checks, the data processing and modeling, and it generates the inference artifacts (the model/pipeline), which are persisted as serialized objects (pickles). In the provided example, logistic regression, support vector machines and random forests are optimized in a grid search to find the best set of hyperparameters.</li>
  <li><code class="language-plaintext highlighter-rouge">run_inference()</code>: it shows how the inference artifacts need to be used to perform a prediction; an exemplary dataset sample created during the training is used.</li>
</ul>

<p>The following diagram shows the workflow:</p>

<div style="line-height:150%;">
    <br />
</div>

<p align="center">
<img src="/assets/machine_learning_production/pipeline_diagram.png" alt="Diagram of the boilerplate package functions" width="600" />
<!--
<small style="color:grey">Diagram of the boilerplate package functions. Image by the author.</small>
-->
</p>

<div style="line-height:150%;">
    <br />
</div>
<p>Although the training and the inference pipeline represented by <code class="language-plaintext highlighter-rouge">run_training()</code> and <code class="language-plaintext highlighter-rouge">run_inference()</code> should run sequentially one after the other, the symmetry of both is clear; in fact, a central property of the package is that <code class="language-plaintext highlighter-rouge">run_training()</code> and <code class="language-plaintext highlighter-rouge">run_inference()</code> share the function <code class="language-plaintext highlighter-rouge">perform_data_processing()</code>. When <code class="language-plaintext highlighter-rouge">perform_data_processing()</code> is executed in <code class="language-plaintext highlighter-rouge">run_training()</code>, it generates the processing parameters and stores them to disk. In contrast, when <code class="language-plaintext highlighter-rouge">perform_data_processing()</code> is executed in <code class="language-plaintext highlighter-rouge">run_inference()</code>, it loads those stored parameters to perform the data processing for the inference.</p>

<p>Note that the implemented <code class="language-plaintext highlighter-rouge">run_inference()</code> is exemplary and it needs to be adapted:</p>

<ul>
  <li>Currently, it is triggered manually and it scores a sample dataset from a <code class="language-plaintext highlighter-rouge">CSV</code> file offline; instead, we should wait for external requests that feed new data to be scored.</li>
  <li>The data processing parameters and the model should be loaded once in the beginning (hence, the dashed box) and used every time new data is scored.</li>
</ul>

<p>Those intentional loose ends are to be tied when deciding how to deploy the model, which is not in the scope of this repository, as mentioned.</p>

<p>Finally, note that this boilerplate is designed for small/medium datasets, which are not that uncommon in small/medium enterprises; in my experience, its structure is easy to understand, implement and adapt. However, as the complexity increases (e.g., when we need to apply extensive feature engineering), it is recommended to apply these changes to the architecture:</p>

<ul>
  <li>All data processing steps should be written in an Object Oriented style and packed into a <a href="https://scikit-learn.org/stable/">Scikit-Learn</a> <code class="language-plaintext highlighter-rouge">Pipeline</code> (or similar), as done in <code class="language-plaintext highlighter-rouge">transformations.py</code>.</li>
  <li>Any data processing that must be applied to new data should be integrated in the inference pipeline generated in <code class="language-plaintext highlighter-rouge">train_models()</code>; that means that we should integrate most of the content in <code class="language-plaintext highlighter-rouge">perform_data_processing()</code> as a <code class="language-plaintext highlighter-rouge">Pipeline</code> in <code class="language-plaintext highlighter-rouge">train_models()</code>. Thus, <code class="language-plaintext highlighter-rouge">perform_data_processing()</code> would be reduced to basic tasks related to cleaning (e.g., duplicate removal) and checking.</li>
</ul>

<p>More details on the package can be found on the source <a href="https://github.com/mxagar/customer_churn_production">Github repository</a>.</p>

<h2 id="conclusions">Conclusions</h2>

<p>In this article I introduced my personal boilerplate to transform small/medium-sized data science projects into production-ready packages without exposing to many 3rd party tools. The template works on the customer churn prediction problem using the <a href="https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers/code">Credit Card Customers</a> dataset from <a href="https://www.kaggle.com/">Kaggle</a>, but you are free to clone the boilerplate from its <a href="https://github.com/mxagar/customer_churn_production">Github repository</a> and modify it for your business case. Important software engineering aspects as covered, such as, clean code conventions, modularity, reproducibility, logging, error and exception handling, testing, dependency handling with environments – and more.</p>

<p>Topics such as data processing techniques, pipeline deployment, artifact tracking and model monitoring are out of scope; for them, have a look at the following links:</p>

<ul>
  <li><a href="https://mikelsagardia.io/blog/data-processing-guide.html">A 80/20 Guide for Exploratory Data Analysis, Data Cleaning and Feature Engineering</a>.</li>
  <li><a href="https://github.com/mxagar/music_genre_classification">A Boilerplate for Reproducible and Tracked Machine Learning Pipelines with MLflow and Weights &amp; Biases and Its Application to Song Genre Classification</a>.</li>
  <li><a href="https://github.com/mxagar/census_model_deployment_fastapi">Deployment of a Census Salary Classification Model Using FastAPI</a>.</li>
  <li>If you are interested in more MLOps-related content, you can visit my notes on the <a href="https://www.udacity.com/course/machine-learning-dev-ops-engineer-nanodegree--nd0821">Udacity Machine Learning DevOps Engineering Nanodegree</a>: <a href="https://github.com/mxagar/mlops_udacity">mlops_udacity</a>.</li>
</ul>

<p><br /></p>

<blockquote>
  <p>Do you find the boilerplate helpful? What would you add or modify? Do you know similar templates to learn from?</p>
</blockquote>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/machine-learning-production-level.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/machine-learning-production-level.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="data" /><category term="science," /><category term="machine" /><category term="learning," /><category term="analysis," /><category term="exploratory" /><category term="feature" /><category term="engineering," /><category term="modelling," /><category term="regression," /><category term="classification," /><category term="random" /><category term="forests," /><category term="logistic" /><category term="support" /><category term="vector" /><category term="machine," /><category term="python" /><category term="packages," /><category term="production," /><category term="logging," /><category term="PEP8," /><category term="linting," /><category term="testing," /><category term="pytest," /><category term="docker," /><category term="MLOps," /><category term="deployment" /><summary type="html"><![CDATA[A Boilerplate Package to Transform Machine Learning Research Notebooks into Deployable Pipelines]]></summary></entry><entry><title type="html">Practical Recipes for Your Data Processing</title><link href="https://mikelsagardia.io/blog/data-processing-guide.html" rel="alternate" type="text/html" title="Practical Recipes for Your Data Processing" /><published>2022-06-28T07:30:00+00:00</published><updated>2022-06-28T07:30:00+00:00</updated><id>https://mikelsagardia.io/blog/data-processing-guide</id><content type="html" xml:base="https://mikelsagardia.io/blog/data-processing-guide.html"><![CDATA[<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  The 80/20 Guide that Solves Your Data Cleaning, Exploratory Data Analysis and Feature Engineering with Tabular Datasets
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/data_processing_guide/tim-gouw-1K9T5YiZ2WU-unsplash.jpg" alt="Donostia-San Sebastian: Photo by @ultrashricco on Unsplash" width="1000" />
<small style="color:grey">Don't worry, working hard often pays off. Photo by <a href="https://unsplash.com/@punttim">Tim Gouw</a> on <a href="https://unsplash.com/photos/1K9T5YiZ2WU">Unsplash</a>.</small>
</p>

<p>Thanks to the powerful packages we have available nowadays, training machine learning models is often a very tiny step in the pipeline of a regular data science project. Altogether, we need to address the following tasks:</p>

<ol>
  <li>Data Understanding &amp; Formulation of the Questions</li>
  <li>Data Cleaning</li>
  <li>Exploratory Data Analysis</li>
  <li>Feature Engineering</li>
  <li>Feature Selection</li>
  <li>Data Modelling</li>
</ol>

<p>Additionally, if online inferences are planned, several parts from the steps 2-5 needs to be prepared for production environments, i.e., they need to be transferred into scripts in which reproducibility and maintainability can be guaranteed for robust and trustworthy deployments.</p>

<p>Independently from that fact and remaining still in the research and development environment, steps 2-5 consume a large percentage of the effort. We need to apply some kind of methodical creativity to often messy datasets that almost never behave as we initially want.</p>

<p>So, is there an easy way out? Unfortunately, I’d say there is not, at least I don’t know one yet. However, <strong>I have collected a series of guidelines and code snippets you can use systematically to ease your data processing journey in a <a href="https://github.com/mxagar/eda_fe_summary">Github repository</a></strong>. It summarizes the map I have sketched along the years.</p>

<p>In the repository, you will find two important files:</p>

<ul>
  <li>A large python script <code class="language-plaintext highlighter-rouge">data_processing.py</code> which contains many code examples; these cover 80% of the processing techniques I usually apply on <em>tabular</em> datasets.</li>
  <li>The <code class="language-plaintext highlighter-rouge">README.md</code> itself, which sums up the steps and <em>dos &amp; don’ts</em> in the standard order for data processing described above.</li>
</ul>

<p>Some caveats:</p>

<ul>
  <li>The script <code class="language-plaintext highlighter-rouge">data_processing.py</code> does not run! Instead, it’s a compilation of useful commands with comments.</li>
  <li>I assume the reader knows the topic, i.e., the repository is not for complete beginners.</li>
  <li>The guide does not cover advanced cases either: it’s a set of tools that follow the 80/20 <a href="https://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a>.</li>
  <li>The guide focuses on <em>tabular</em> data; images and text have their own particular pipelines, not covered here.</li>
  <li>This is my personal guide, made for me; no guarantees are assured and it will probably change organically.</li>
</ul>

<p><br /></p>

<blockquote>
  <p>Do you find the repository helpful? What would you add? Do you know similar summaries to learn from?</p>
</blockquote>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/data-processing-guide.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/data-processing-guide.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="data" /><category term="science," /><category term="analysis," /><category term="exploratory" /><category term="feature" /><category term="engineering," /><category term="modelling," /><category term="hypothesis" /><category term="testing," /><category term="regression," /><category term="classification," /><category term="random" /><category term="forests," /><category term="summary" /><summary type="html"><![CDATA[The 80/20 Guide that Solves Your Data Cleaning, Exploratory Data Analysis and Feature Engineering with Tabular Datasets]]></summary></entry><entry><title type="html">Planning Your Next Vacation in Spain</title><link href="https://mikelsagardia.io/blog/airbnb-spain-basque-data-analysis.html" rel="alternate" type="text/html" title="Planning Your Next Vacation in Spain" /><published>2022-06-23T10:30:00+00:00</published><updated>2022-06-23T10:30:00+00:00</updated><id>https://mikelsagardia.io/blog/airbnb-spain-basque-country-analysis</id><content type="html" xml:base="https://mikelsagardia.io/blog/airbnb-spain-basque-data-analysis.html"><![CDATA[<p style="color: #777; font-style: italic; font-size: 1.5em; margin-top: 0.5em;">
  Analysis and Modelling of the AirBnB Dataset from the Basque Country
</p>

<!--
<div style="line-height:150%;">
    <br>
</div>
-->

<p align="center">
<img src="/assets/airbnb_analysis/san_sebastian_ultrash-ricco-8KCquMrFEPg-unsplash.jpg" alt="Donostia-San Sebastian: Photo by @ultrashricco from Unsplash" width="1000" />
<small style="color:grey">Donostia-San Sebastian. Photo by <a href="https://unsplash.com/photos/8KCquMrFEPg">@ultrashricco from Unsplash</a>.</small>
</p>

<p>In 2020 I decided to move back to my birthplace in the <a href="https://en.wikipedia.org/wiki/Basque_Country_(autonomous_community)">Basque Country</a> (Spain) after almost 15 years in Munich (Germany). The Basque region in Spain is a popular touristic destination, as it has a beautiful seaside with a plethora of surfing spots and alluring hills that call for hiking and climbing adventures. Culture and gastronomy are also important features, both embedded in a friendly and developed society with modern infrastructure.</p>

<p>When the pandemic seemed to start fading away in spring 2022, friends and acquaintances from Europe began asking me about the best areas and trips in the region, hotels and hostels to stay in case there was no room in my place, etc. The truth is, after so many years abroad I was not the best person to guide them with updated information; however, the <a href="http://insideairbnb.com/get-the-data/">AirBnB dataset from <em>Euskadi</em></a> (i.e., Basque Country in <a href="https://en.wikipedia.org/wiki/Basque_language">Basque language</a>) has clarified some of my questions. The dataset contains, among others, a list of 5228 accommodations, each one of them with 74 variables.</p>

<p>Following the standard <a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining">CRISP-DM process</a> for data analysis, I have cleaned, processed and modelled the dataset to answer three major <em>business</em> questions:</p>

<ol>
  <li><strong>Prices</strong>. Is it possible to build a model that predicts the price from the variables? If so, which are the most important variables that determine the price? Can we detect accommodations that, having a good review score, are a bargain?</li>
  <li><strong>Differences between accommodations with and without beach access</strong>. Surfing or simply enjoying the seaside are probably some important attractions visitors seek on their vacations. However, not all accommodations are a walk distance from a beach. How does that influence the features of the housings?</li>
  <li><strong>Differences between the two most important cities: <a href="https://en.wikipedia.org/wiki/San_Sebastián">Donostia-San Sebastian</a> and <a href="https://en.wikipedia.org/wiki/Bilbao">Bilbao</a></strong>. These province capitals are the biggest and most visited cities in the Basque Country; in fact, their listings account for 50% of all offered accommodations. However, both cities are said to have a different character: Bilbao is a bigger, modern city, without beach access but probably with richer cultural offerings and nightlife; meanwhile, Donostia-San Sebastian is more aesthetic, it has three beaches and it’s perfect for day-strolling. How are those popular differences reflected on the features of the accommodations?</li>
</ol>

<h2 id="the-dataset">The Dataset</h2>

<p>AirBnB provides with several CSV files for each world region: (1) a listing of properties that offer accommodation, (2) reviews related to the listings, (3) a calendar and (4) geographical data. A detailed description of the features in each file can be found in the official <a href="https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896">dataset dictionary</a>.</p>

<p>My analysis has concentrated on the listings file, which consists in a table of 5228 rows/entries (i.e., the accommodation places) and 74 columns/features (their attributes). Among the features, we find <strong>continuous variables</strong>, such as:</p>

<ul>
  <li>the price of the complete accommodation,</li>
  <li>accommodates: maximum number of persons that can be accommodated,</li>
  <li>review scores for different dimensions,</li>
  <li>reviews per month,</li>
  <li>longitude and latitude,</li>
  <li>etc.</li>
</ul>

<p>… <strong>categorical variables</strong>:</p>

<ul>
  <li>neighbourhood name,</li>
  <li>property type (apartment, room, hotel, etc.)</li>
  <li>licenses owned by the host,</li>
  <li>amenities offered in the accommodation,</li>
  <li>etc.</li>
</ul>

<p>… <strong>date-related data</strong>:</p>

<ul>
  <li>first and last review dates,</li>
  <li>date when the host joined the platform,</li>
</ul>

<p>… and <strong>image and text data</strong>:</p>

<ul>
  <li>URL of the listing,</li>
  <li>URL of the pictures,</li>
  <li>description of the listing,</li>
  <li>etc.</li>
</ul>

<p>Of course, not all features are meaningful to answer the posed questions. The explanations given on my <a href="https://github.com/mxagar/airbnb_data_analysis">Gihub repository</a> describe in detail how I dealt with noisy and missing values, and how some features were dropped or some other engineered. After that processing, we get a new table with 3931 entries and 353 features.</p>

<p>So… Would like to have a look at what I have learned from the data? Let’s dive in!</p>

<h2 id="question-1-prices">Question 1: Prices</h2>

<p>In order check whether we can predict the price, I have trained several models with 90% of the processed dataset (i.e., the training split) using <a href="https://scikit-learn.org/stable/">Scikit-Learn</a>: (1) linear regression as baseline, (2) <a href="https://en.wikipedia.org/wiki/Ridge_regression">Ridge regression</a> (L2 regularized regression), (3) <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)">Lasso regression</a> (L1 regularized regression) and (2) <a href="https://en.wikipedia.org/wiki/Random_forest">random forests</a>. <a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)">Cross-validation</a> was performed with all of them and their hyperparameters were tuned; additionally, the effect of polynomial features on the model performances was also studied, as thoroughly summarized on the <a href="https://github.com/mxagar/airbnb_data_analysis">Gihub repository</a>.</p>

<p>The modelling experiments show that the random forests model seems to score the best R2 value for the test split: 69% of the variance can be explained with the random decision trees. Moreover, adding polynomial terms does not improve predictions for the present dataset. The following diagram shows the performance of the Ridge regression model and the random forests model with the test split using only the 353 linear features.</p>

<p align="center">
<img src="/assets/airbnb_analysis/regression_evaluation.png" alt="Performance of regression models" width="400" />
</p>

<p>The models tend to under-predict accommodation prices; that bias clearly increases as the prices start to be larger than 50 USD. Such a moderate R2 is not the best one to apply the model to perform predictions. However, we can deduce the most important features that determine the listing prices if we compute the <a href="https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3">Gini importances</a>, as done in the following diagram. The top-5 variables that determine the price of a listing are:</p>

<ul>
  <li>whether an accommodation is an entire home or apartment,</li>
  <li>the number of bathrooms in it,</li>
  <li>the number of accommodates,</li>
  <li>whether the bathroom(s) is/are shared,</li>
  <li>and whether the housing is located in Donostia-San Sebastian.</li>
</ul>

<p align="center">
<img src="/assets/airbnb_analysis/regression_feature_importance_rf.png" alt="Feature importance: Gini importance values of the random forests model" width="600" />
</p>

<p>Note that only the top-30 features are shown; these account for almost 89% of the accumulated Gini importance (all 353 variables would account for 100%).</p>

<p>But how does increasing the value of each feature affect the price: does it contribute to an increase in price or a decrease? That can be observed in the following diagram, similar to the previous one. In contrast to the former, here the 30 regression coefficients with the largest magnitude are plotted - red bars are associated with features that decrease the price when they are increased, i.e., negative coefficients.</p>

<p align="center">
<img src="/assets/airbnb_analysis/regression_feature_importance_lm.png" alt="Feature importance according to the coefficient value in ridge regression" width="600" />
</p>

<p>Being different models, different features appear in the ranking; in any case, both lists are consistent and provide valuable insights. For instance, we deduce that the price decreases the most when</p>

<ul>
  <li>the number of reviews per month increases (note that review positivity is not measured),</li>
  <li>the host is estimated to have shared rooms,</li>
  <li>the accommodation is a shared room,</li>
  <li>and when the bathroom(s) is/are shared.</li>
</ul>

<p>Finally, a very practical insight to close the pricing question: we can easily select the the accommodations which have a very good average review (above the 90% percentile) and yield a model price larger than the real one, as shown in the following figure. These are the likely bargains!</p>

<p align="center">
<img src="/assets/airbnb_analysis/economical_listings_geo.jpg" alt="Economical listings with high quality" width="800" />
</p>

<p>I prefer not post the URLs of the detected listings, but it is straightforward to obtain them using the notebooks of the linked repository :wink:.</p>

<!--![Map of listing prices encoded in color](./pics/map_listings_prices_geo.jpg)
-->

<h2 id="question-2-to-beach-or-not-to-beach">Question 2: To Beach or not to Beach</h2>

<p>Of course, you can always go to the beach to catch some waves in the Basque Country, but going on foot and in less than 15 minutes has an additional cost, on average. That is one of the insights distilled from the next diagram.</p>

<p>This difference or significance plot shows the <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T and Z statistics</a> computed for each feature considering two independent groups: accommodations with and without beach access. These statistics are related to the difference of means (T statistic, for continuous variables) or proportions (Z statistic, for discrete variables or proportions). If we take the usual significance level of 5%, the critical Z or T value is roughly 2. That means that if the values in the diagram are greater than 2, the averages or proportions of each group in each feature are significantly different. The probability of being otherwise but incorrectly stating that they are different is 5%.</p>

<p>The sign of the statistic is color-coded: blue bars denote positive statistics, which are associated with larger values for accommodations that have beach access.</p>

<p align="center">
<img src="/assets/airbnb_analysis/beach_comparison.png" alt="Feature differences between accommodations with and without beach access" width="600" />
</p>

<p>Long story short, here’s the interpretation: the group of accommodations that have a beach within 2 km have significantly larger</p>

<ul>
  <li>proportions of accommodations located in the province of Gipuzkoa, compared to Bizkaia,</li>
  <li>proportions of accommodations with a waterfront,</li>
  <li>and prices.</li>
</ul>

<p>Note that larger statistics don’t necessarily mean larger differences; instead, they mean that the probability of wrongly stating a difference between groups is lower.</p>

<p>Instead of reading the ranking top-down, it is more interesting to compose a <em>profile</em> of listings with beach access and without by selecting features manually; for instance, the accommodations on the seaside:</p>

<ul>
  <li>have larger prices,</li>
  <li>are more often entire homes or apartments,</li>
  <li>usually have less shared bathrooms,</li>
  <li>have more often a description in English instead of in Spanish (i.e., they target more foreign tourists),</li>
  <li>have more often a beachfront, patio or balcony,</li>
  <li>have more bedrooms,</li>
  <li>allow for more accommodates but for longer minimum periods,</li>
  <li>their host lives more often nearby,</li>
  <li>…</li>
</ul>

<p>Going back to the price, the following figure shows the different price distributions for accommodations with a beach in less than 2 km and further. We need to consider that there is such a distribution or a contingency table behind each of the Z/T statistics in the previous diagram.</p>

<p align="center">
<img src="/assets/airbnb_analysis/price_distribution_beach.png" alt="Price distribution for accommodations with and without beach access in less than 2km" width="600" />
</p>

<h2 id="question-3-athletic-de-bilbao-vs-real-sociedad">Question 3: Athletic de Bilbao vs. Real Sociedad</h2>

<p>If you’re a soccer fan, maybe you’ve heard about the Basque derby: <a href="https://en.wikipedia.org/wiki/Athletic_Bilbao">Athletic de Bilbao</a> vs. <a href="https://en.wikipedia.org/wiki/Real_Sociedad">Real Sociedad</a>. Both football teams are originally from the two major cities in the Basque Country, Bilbao and Donostia-San Sebastian, and they represent the healthy rivalry between the two province capitals.</p>

<p>In order to determine the differences between the two cities in terms of listing features, I have computed the same difference or significance plot as before, shown below.</p>

<p align="center">
<img src="/assets/airbnb_analysis/donostia_bilbao_comparison.png" alt="Feature differences between accomodations in Donostian-San Sebastian and Bilbao" width="600" />
</p>

<p>Donostia-San Sebastian seems to have</p>

<ul>
  <li>larger prices,</li>
  <li>more accommodations with waterfronts,</li>
  <li>more descriptions in English,</li>
  <li>hosts that joined AirBnB longer ago and who have more accommodations,</li>
  <li>more often patios or balconies,</li>
  <li>more often entire homes or apartments,</li>
  <li>more space for accommodates,</li>
  <li>…</li>
</ul>

<p>On the other hand, Bilbao has</p>

<ul>
  <li>more accommodations that are a bedroom,</li>
  <li>more shared bathrooms,</li>
  <li>more amenities, such as shampoo, hangers, first aid kits, extra pillows, breakfast</li>
  <li>…</li>
</ul>

<p>Finally, as before, I leave the price distribution for both cities, since it is the feature in which the difference is more significant. We can see that the distribution from Bilbao has more units in the lowest price region, whereas the red city lacks listings with prices above 150 USD, compared to Donostia-San Sebastian. That is in line with several already explained facts, such as that Bilbao has more shared rooms, whereas Donostia has more entire homes, while being the effect on the price of both characteristics the opposite.</p>

<p align="center">
<img src="/assets/airbnb_analysis/price_distribution_city.png" alt="Price distribution for accommodations in Donostia-San Sebastian and Bilbao" width="600" />
</p>

<h2 id="conclusions">Conclusions</h2>

<p>In this blog post, we took a look at the AirBnB accommodation properties from the Basque Country, narrowing down to these insights:</p>

<ol>
  <li>Even though the price regression models have a moderate R2, we have shown how to detect listings which are candidate to be a bargain: accommodations with high review scores and predicted price above the true one. Additionally, we have discovered the features with the largest impact on the price: type of accommodation, bathrooms, locations, etc.</li>
  <li>Listings with a beach in less than 2 km have significantly more entire homes, more balconies, waterfronts and space for more accommodates; this is in line with their larger prices.</li>
  <li>The two major cities Donostia-San Sebastian and Bilbao nicely align with the previous synthesis, being Donostia a beach city and Bilbao a city without. Additionally, Bilbao seems to favor other practical domestic amenities.</li>
</ol>

<p>These conclusions are quite informal, but I hope they can guide my data-savvy friends; in any case, I’m sure you can have a great vacation anywhere you go in the Basque Country :)</p>

<blockquote>
  <p>Are you planning a trip to the Basque Country? Has this blog post helped you?</p>
</blockquote>

<p>To learn more about this analysis, see the link to my <a href="https://github.com/mxagar/airbnb_data_analysis">Gihub repository</a>. You can download the pre-processed dataset and ask the data your own specific questions!</p>

<p><br /></p>

<div id="disqus_thread"></div>
<script>
    /**
    *  RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
    *  LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables    */
    
    var disqus_config = function () {
    this.page.url = 'https://mikelsagardia.io/blog/airbnb-spain-basque-data-analysis.html';  // Replace PAGE_URL with your page's canonical URL variable
    this.page.identifier = 'https://mikelsagardia.io/blog/airbnb-spain-basque-data-analysis.html'; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
    };
    
    (function() { // DON'T EDIT BELOW THIS LINE
    var d = document, s = d.createElement('script');
    s.src = 'https://mikelsagardia.disqus.com/embed.js';
    s.setAttribute('data-timestamp', +new Date());
    (d.head || d.body).appendChild(s);
    })();
</script>

<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>]]></content><author><name></name></author><category term="data" /><category term="science," /><category term="analysis," /><category term="exploratory" /><category term="feature" /><category term="engineering," /><category term="modelling," /><category term="hypothesis" /><category term="testing," /><category term="regression," /><category term="random" /><category term="forests," /><category term="AirBnB," /><category term="Basque" /><category term="Country," /><category term="price" /><category term="prediction" /><summary type="html"><![CDATA[Analysis and Modelling of the AirBnB Dataset from the Basque Country]]></summary></entry></feed>