Richard Li

7 Lessons from building a small-scale AI application


ChatGPT heralded a seismic shift in software, and one that I felt compelled to understand. So, over the past year, I’ve been building an AI assistant for my past-CEO-self as a pedagogical exercise. It answers questions, gets status reports, and summarizes what’s going on. Reflecting on what I know now, here are my takeaways over the past year.

My biggest realization was that problems that I would have traditionally characterized as “scale up” problems occurred much earlier than I would have guessed. But at the same time, using AI tools such as Copilot, which I also wrote about, make a productivity difference too. It’s not slower, it’s not faster, it’s just different.

AI programming is stochastic

Unlike conventional programming, getting an AI to work the way you want is a stochastic process. Generally speaking, the process of “programming” an AI involves lots of experiments. In each experiment, you adjust different inputs and parameters and see if you can improve on what you did before.

Generally speaking, adjustments fall into four major categories:

  1. Your prompt. Most likely you’ll start by adjusting your prompt. But there’s a striking amount of techniques that you should adopt, ranging from few-shot prompting to retrieval augmented generation to chain-of-thought prompting, all of which can substantially improve performance.
  2. Task/domain fine tuning, using LoRA or its descendants. If you have a bunch of domain-specific data, you’ll definitely want to fine-tune the model on this data set.
  3. Preference tuning. By optimizing for human-preferred outputs, you can steer the model’s behavior to match specific goals or values more closely.
  4. Hyperparameter tuning. Beyond prompts and fine-tuning, adjusting hyperparameters (e.g., learning rates, batch sizes, and optimizer settings) can improve the model’s training efficiency and performance. While this is often a lower-impact lever compared to prompt engineering or fine-tuning, it’s still a critical step in optimizing results.

As I went through the process, I started to develop an intuition as to the types of adjustments needed to address a given issue.

Data quality is real work.

The mechanics of fine tuning and preference tuning are quite straightforward, and implemented in many different open source libraries. The hard part is in creating a high quality dataset to begin with that can be used for training in the first place. Building this dataset is not a one-and-done process. I built a pipeline that took unstructured data, transformed it into a format suitable for fine tuning and preference tuning, and evaluated data quality. Building and iterating on this pipeline took a lot of time.

Models are only as good as the evaluation

In mission critical software, your software is only as good as your test coverage. Similarly, an AI model is only as good as your evaluation strategy. The most common strategy takes a subset of your input data and uses it for validation. The challenge with this approach is that it assumes the validation data is representative of the real-world scenarios your model will encounter. However, this is rarely the case. Real-world data contain edge cases, ambiguities, and outliers that are not captured in a simple validation set. As a result, the model might perform well during validation but will fail to deliver the desired results in production.

I found off-the-shelf solutions for evaluation to be very immature. I explored common strategies include using an LLM-as-judge, benchmarks such as SimpleQA and MMLU, and human review, all of which have significant limitations. I ended building a bespoke evaluation system using a combination of human review and LLM-as-judge, but was never particularly satisfied with my approach.

Trust/Quality is #1 issue

Ritendra Datta, head of Applied AI at Databricks, gave an excellent talk on lessons from building AI systems. One key point: “The most critical long-term factor for [AI product] success is quality, followed closely by performance and reliability.”

💯

Getting a great demo with a LLM is deceptively easy (I did it in a week!). Getting it to work well in real-world conditions is not. In January 2025, Apple disabled news summaries after discovering multiple hallucinations in the real world. I’m assuming that Apple did a lot of work on quality — and they still ran into issues. Addressing quality requires a continuous loop of experimentation and evaluation.

Your training pipeline is your core IP

I used to think that the AI model was “the secret sauce”. Now, I realize that AI models change every week as you iterate. I believe now that the training pipeline is your core intellectual property.

By training pipeline, I’m referring to everything that you need to create your AI model: starting with your data to your data transformation workflow to your fine-tuning process to evaluation. This is the secret sauce.

As you find problems in your AI, you will hopefully find new strategies for fixing them, whether it’s adding more rounds of preference tuning or adding more data or something else. These are incorporated into your training pipeline. Building a training pipeline that supports rapid iteration is critical to the success of your AI.

Distributed systems, redux

My AI application initially consisted of three main components: a database (in my case, PostgreSQL + PGVector), a FastAPI application that ran all the business logic, and an LLM. Having spent the last decade in cloud-native software and Kubernetes, it was obvious to me that I was building a distributed system, subject to the eight fallacies of distributed computing. Moreover, even though I was running vLLM, a high performance inference engine, the reality is that an LLM is a high-latency service. In addition, latency increases as input sizes grow — a common scenario since I was leveraging retrieval-augmented generation (RAG). This slowness fundamentally changes the way distributed systems need to be designed.

In cloud-native applications, distributed systems rely on many remote procedure calls (RPCs) between services. To ensure resilience, developers commonly implement techniques like timeouts, circuit breakers, retries, and backoff strategies. These patterns protect the system from cascading failures when a service experiences temporary issues or downtime.

I realized, though, that my application was different, as it relied heavily on a single high-latency service: the LLM. While traditional RPC resilience patterns can mitigate transient network issues or minor delays, they were insufficient to address the inherent latency, variability, and processing overhead introduced by LLMs.

Given the high-latency nature of LLMs, I adopted a fully asynchronous architecture is better suited for AI applications. Rather than relying on synchronous RPCs along with a reliable task management system. In my past life, task queues were an architectural paradigm that could be deferred until systems were at much larger scale. I found, though, that deploying a task queue to enqueue requests to the LLM was critical, allowing my application to remain responsive despite high LLM latency. The task queue also provided built-in resilience for retries, traffic smoothing, and dynamic scaling of worker pools. Phil Calcado, a fellow microservices/Kubernetes refugee like myself, wrote an excellent article about his experiences that mirrors some of mine.

Don’t buy the AI library hype

There’s a plethora of developer libraries that claim to make AI development faster and easier. These libraries introduce new abstractions designed to improve your productivity. I tried many of them, and I found that the abstractions work well only if I stayed on a narrow path (the quick starts generally worked great!).

I found two major problems:

  1. Missing/incomplete implementation. For example, I tried to use LlamaIndex, a popular framework for building RAG applications. I wanted to retrieve documents from PostgreSQL using Okapi BM25 and tried to use the LlamaIndex BM25 implementation. Unfortunately, the LlamaIndex BM25 implementation at the time stores the index in-memory, with no persistence mechanism, making it useless for a production application.
  2. Poor ecosystem integration. I wanted to use Outlines for structured text generation. Outlines integrates with several popular inference libraries, e.g., vLLM, Llama.cpp. At the time, I was using Unsloth kernels for better performance — but they don’t work with Outlines.

While I’m sure this will all improve over time, I generally believe that you should be careful about adopting some of these higher-level abstractions right now (particularly based on a Quick Start experience). I found that it was simpler to just implement the functionality directly myself directly on top of well-established lower-level abstractions such as PyTorch or 🤗 Transformers.

Final thoughts

We’re still in the early innings of the AI era, with a lot more to learn ahead of us. I thought the early days of Kubernetes and cloud-native technologies was a fast-moving space, but AI’s evolution puts cloud-native’s evolutionary speed to shame.

Final note: you can read as much as you like (thanks for reading this far), but if you really want to learn more about AI, just try and build something! Today, AI tools like ChatGPT and Claude make programming more accessible than ever before. You will be surprised at how far you can get with relatively little effort.