Breaking the Black Box: A Software Developer’s Deep Dive into Transformers
I was a software developer for 15 years and have now started learning AI/LLM, but my learning path is a bit different from what a typical software developer follows.
As of May 2025, most software developers see AI/LLM as:
Coding agents
HTTP APIs
So, if they want to use AI/LLM in their traditional software development, they typically try to find the best model and the best way to utilize a coding agent. When they need to embed AI/LLM in their products, they tend to treat it as just another third-party API dependency—i.e., a black box.
However, I see AI/LLM as an emerging computing model that is poised to dominate new ideas for at least the next few years. In this post, I’d like to explain my understanding of AI/LLM and recommend several learning resources that I have used.
Why start with the foundations?
We, as software engineers, learn about CPUs that have been dominant in computing for decades. Although most software developers today won’t build a CPU themselves, learning how a CPU works is critical because it’s the system we operate every day. That’s why many computer-science programs include a course to build a CPU from scratch!
In addition to the CPU, the OS is also important because the majority of software runs on an OS. So, software engineers learn about one or more modern OSs, including processes, memory management, system calls, etc. (One of my favorite UNIX commands is strace
.)
Learning how compilers and interpreters work also has huge benefits for understanding how your programs run, because they ultimately become 0/1 binaries.
I learned these topics while working as an SRE, DBA, and SWE. I also built a CPU (8086) emulator and tried to run a MINIX binary to see how the CPU and OS actually work, and I built a Forth compiler and interpreter with LLVM to understand virtual machines. Of course, I’ve never worked on building a CPU, OS, or programming language/VM professionally, but knowing these low-level details helps me see what’s happening under the hood. Because all the software engineers I respect are experts in these areas, I tried to catch up, and these have been my foundational skills as a senior engineer for years.
Enter the Transformer
When I decided to shift my career toward AI development, the first thing I wanted to understand wasn’t how to use coding agents or build AI agents. Instead, I knew it was more important to learn the foundational technology of AI so I could anticipate what’s going on.
One friend mentioned the word “Transformer.” I’d never heard it at the time, but I now know the Transformer is the dominant technology in the AI world today (though that may change tomorrow).
Luckily, I had a little experience with machine learning, so I recalled some basics of neural networks and backpropagation from college 20 years ago. That tiny advantage became a bridge I want to share with software engineers learning AI/LLM today:
A Transformer is a special neural-network architecture that’s incredibly powerful—not only for LLMs but also for image and voice generation, among many other tasks. I originally thought an LLM was just a large stack of perceptron layers (sometimes called an MLP—Multi-Layer Perceptron). This isn’t true. The Transformer network is a purposefully designed, complex architecture. Although it contains MLP layers, its key component—the attention mechanism—is essentially a sophisticated matrix multiplication, not an MLP.
At first, I wasn’t sure what the Transformer architecture was because most online courses assumed students already knew it. Transformers are that ubiquitous today, and I felt embarrassed for not knowing about this 2017 invention sooner.
Learning resources that clicked
To understand Transformer, I watched many online video courses. For example, this is the very passionate explanation from one of the authors of the origin of today’s AI - “Attention is All You Need“ paper:
I guess this is too hard to understand without NLP background, however. Actually, before watching the video above, I went through Stanford CS224N NLP course. I recommend the entire series, but these two lectures gave me “aha” moments:
I also just found this video today that is well visualizing LLM in just 7 minutes. I highly recommend to start from here now:
After that, my instinct was: “I want to build one.” There are many prebuilt libraries and models to run Transformers in PyTorch locally, but I wanted to write the code myself. For that, Stanford CS336 was perfect. It provided the logical architecture of a Transformer plus PyTorch implementations for every component, including most popular improvements since 2017.
The best course to catch up LLM - Stanford CS336: Language Modeling from Scratch
I’m sorry I hadn’t had chances to continue the previous episodes, but recently I switched my career to AI engineer entirely: Fifteen Years of Dev, Deleted — Hello AI
A great companion blog post is “Attention Wasn’t All We Needed,” which lists key techniques that evolved the Transformer—CS336 covers almost all of them.
CS336 also covers hardware—specifically GPUs. Just like CPUs, GPUs are at the core of AI/LLM, so understanding GPU architecture is essential. (Flash Attention, for example, is a hardware-aware optimization of attention calculation.)
Putting it all together
I’ll probably never build foundational LLMs myself; that work happens in huge GPU farms and belongs to LLM researchers. However, just as CPU/OS/compiler knowledge helps traditional software development, learning about GPUs, Transformers, and post-training lets me understand AI/LLM discussions.
When you build AI agents, this understanding is critical to using LLMs effectively. For instance, you won’t fully grasp why context size is limited—or how some models overcome it—until you study Transformer internals. I won’t claim to fully understand everything yet, but I can now at least imagine and research these issues.
I’ve skipped post-training here, but I believe fine-tuning is becoming a key to AI applications. It doesn’t need the massive data and hardware dictated by scaling laws, but it does require techniques beyond the Transformer itself, such as reinforcement learning (which I’m currently studying). Still, understanding GPUs and Transformers is foundational.
I hope my investment in these foundations pays off. As mentioned, I also plan to build learning content for software developers who lack machine-learning experience. Let me know if you’re interested!