The best course to catch up LLM - Stanford CS336: Language Modeling from Scratch

May 18, 2025

I’m sorry I hadn’t had chances to continue the previous episodes, but recently I switched my career to AI engineer entirely: Fifteen Years of Dev, Deleted — Hello AI

So, in this magazine, I’ll keep posting about AI engineering, too. Let’s start from how I’m learning about the foundation of AI.

When I started to change my career to AI engineer, the biggest mysterious part for me was “What is LLM?” I knew it’s a neural network learned from the internet and it can predict the next token. But I didn’t know how it works under the hood. At that moment, I didn’t know “Transformer” nor “Attention” as my NLP knowledge wasn’t updated since Word2Vec.

Fortunately, I encountered the best online course that is actually running in this term - Stanford CS336 | Language Modeling from Scratch. I jumped on this course right away, keep watching the lecture videos and tackling the assignments at my own pace. Their YouTube list is here.

This course describes LLM in details as well as provides the very handy assignments. Let me show you what I've learned from the assignment 1.

The assignment 1 guides you to build the latest Transformer architecture from scratch, that means you don’t use PyTorch’s built-in modules but reimplement them by using only primitive functions. While it’s better to use those built-in modules for most of the cases, it’s difficult for me to understand the inside without writing the logic by myself.

I take this type of approach usually when I learn something new; I built B+Tree to understand database, Multi-Paxos to understand consensus, Forth interpreter to understand compiler and virtual machine, 8086 emulator to understand CPU. This time, I built Transformer to understand the language model.

Here is the list of building blocks I built. All have unit tests, so it’s easy to verify:

BPE (Byte-Pair Encoding) Tokenizer
Linear layer
Embedding
RMS Norm
SwiGLU
RoPE (Rotary Positional Embeddings)
Softmax
Scaled Dot-Product Attention
Causal Multi-Head Self-Attention
Transformer (decoder-only)
Cross-entropy loss
AdamW
Learning rate scheduler with warmup and cosine
Gradient clipping

Then, I was able to train the model on M1 MacBook Pro (lr is the learning rate and loss is the training loss) with TinyStory training data:

Here is a few token generations at the several steps during training (“Once upon a time,“ is the prompt):

Time: 2025-05-17 22:45:33.400874 Step: 1, Loss: 9.251720428466797

Once upon a time, drives aisle shop heelsluffy fact brave poking suffer achieved Rich Ally Trust jugom guide greetedip suggest terrified wishesches N doctorunnies Gus magnets tipped animalsould darling green Joey dishwasherork later roadiens guyslex stays driveway ruffaredarrasschenulptureasis sadlyael…

Step 1 is completely at random.

Time: 2025-05-17 22:52:08.402544 Step: 500, Loss: 2.468946933746338

Once upon a time, there was a big tree. The tree was very happy. The tree was very good at the tree.

One day, a little bird came to the tree. The bird wanted to find the bird, so it looked very pretty. The bird saw…

At step 500, it looks like English but the meaning is useless.

Time: 2025-05-17 22:58:22.731600 Step: 1000, Loss: 2.165599822998047

Once upon a time, there was a little boy named Tim. Tim had a toy box. He liked to play with it every day. One day, Tim found a small toy under his bed. It was a small, green toy. Tim was very happy.

Tim…

At step 1000, it’s good already. Nice progress.

Time: 2025-05-17 23:10:47.535277 Step: 2000, Loss: 1.9070618152618408

Once upon a time, there was a little girl named Lily. She had a pretty doll named Rose. Lily loved to play with Rose. They would run and jump all day long. Lily was very happy.

One day, Lily wanted to show Lily her doll to her…

At step 2000, the first paragraph looks perfect. The second paragraph looks a bit confusing.

Time: 2025-05-17 23:47:30.009432 Step: 5000, Loss: 1.763701319694519

Once upon a time, there was a little boy named Tim. Tim had a big dream. In his dream, he was a big, round ball. He loved to play with his ball every day.

One day, Tim's friend, Sue, came over to play…

Finally, it reaches 1.7 training loss at step 5000. It looks almost similar level to step 2000, though.

The learning curve wasn’t good initially so that I needed to investigate multiple aspects. This is the very first learning curve and the inferences, lol:

After stabilizing this crazy curve, I still struggled with higher loss. It ends up to my bugs e.g. clipping gradient before backward pass, and forgetting to truncate and normalize the initial parameters of attention matrixes:

These very hands-on experience gives me deep understanding of how Transformer works and how to train it e.g. I really understand how the computational stability is important through the bugs above…

Because I built and trained Transformer once, I start to understand how each component works. Since Transformer is the dominant architecture today, I can somehow understand the novel proposals that was impossible for me one month ago.

I’ll keep learning CS336 and I’m especially interested in Alignment sections upcoming because now my next mysterious point is post training alignments.

Remember, I didn’t know Transformer at all one month ago. So, you could build your own model in a month even if you don’t know Transformer today! Feel free to ask me if you have trouble to run CS336 assignment 1.

OpsBR Magazine

Discussion about this post