Large Language Models (LLMs), such as ChatGPT, Gemini, Claude, etc., have been around for a while now, and I believe all of us already used at least one of them. As this article is written, ChatGPT already implements the fourth generation of the GPT-based model, named GPT-4. But do you know what GPT actually is, and what the underlying neural network architecture looks like? In this article we are going to talk about GPT models, especially GPT-1, GPT-2 and GPT-3. I will also demonstrate how to code them from scratch with PyTorch so that you can get better understanding about the structure of these models.
A Brief History of GPT
Before we get into GPT, we need to understand the original Transformer architecture in advance. Generally speaking, a Transformer consists of two main components: the Encoder and the Decoder. The former is responsible for understanding input sequence, whereas the latter is used for generating another sequence based on the input. For example, in a question answering task, the decoder will produce an answer to the input sequence, while in a machine translation task it is used for generating the translation of the input.