AIMaks

Introduction to Large Language Models

30 min readvideoLLM Foundations
1 of 38Large Language Models & GenAI

Introduction to Large Language Models

Large Language Models (LLMs) are neural networks trained on massive text corpora that can understand, generate, and reason about human language. They represent the most significant leap in artificial intelligence since the invention of the neural network itself. In this lesson we will explore what LLMs are, how they evolved, and why they matter for every software engineer today.

1. What Are Large Language Models?

At their core, LLMs are autoregressive language models that predict the next token in a sequence. Given a prompt like "The capital of France is", the model assigns probabilities to every token in its vocabulary and selects the most likely continuation — "Paris".

What makes them large is the scale at which this simple idea is applied: billions of parameters, trillions of training tokens, and thousands of GPUs running for months. This scale unlocks emergent abilities — capabilities that smaller models simply do not have, such as multi-step reasoning, code generation, and following complex instructions.

2. The Evolution: From N-grams to Transformers

Language modeling has a long history. Each generation of techniques brought dramatic improvements in quality and capability.

EraTechniqueKey IdeaLimitation
1990sN-gram modelsCount word co-occurrences in a fixed windowNo long-range context; exponential memory
2003Neural LMs (Bengio)Learn word embeddings with a feed-forward netFixed context window
2013Word2Vec / GloVeEfficient word embeddings at scaleStatic embeddings — one vector per word
2015RNN / LSTM / GRUSequential processing with hidden stateSlow training; vanishing gradients
2017TransformerSelf-attention over all positions in parallelQuadratic memory in sequence length
2018–nowGPT / BERT / LLMsScale transformers to billions of parametersCost, alignment, hallucination
"Attention Is All You Need" — the 2017 paper by Vaswani et al. introduced the Transformer architecture and changed the trajectory of AI research forever.

3. The Modern LLM Timeline

The pace of progress since the Transformer has been extraordinary. Here are the key milestones:

YearModelParametersSignificance
2018GPT-1117MFirst large-scale decoder-only LM
2018BERT340MBidirectional pre-training; dominated NLP benchmarks
2019GPT-21.5B"Too dangerous to release" — coherent long-form text
2020GPT-3175BIn-context learning; few-shot prompting
2022ChatGPT~175BRLHF-aligned GPT-3.5; 100M users in 2 months
2023GPT-4~1.8T (MoE)Multimodal; near-expert performance on exams
2023Llama 27–70BOpen-weights revolution by Meta
2024Claude 3200K context; strong reasoning and safety
2024Gemma 22–27BGoogle's open-weights family, efficient inference
2025Llama 4Scout/Maverick10M context, MoE, open-weights from Meta
2025Gemma 41–27BState-of-art open model; natively multimodal; our course model

4. Major LLM Families Compared

The LLM landscape is rich and varied. Here is how the major families compare as of 2025:

FamilyDeveloperOpen Weights?SizesStrengths
GPT-4o / o3OpenAINoUnknown (MoE)Multimodal, reasoning, massive ecosystem
Claude 4AnthropicNoHaiku / Sonnet / OpusLong context (200K), safety, coding, agentic
Gemini 2.5GoogleNoFlash / Pro1M context, multimodal, deep thinking
Gemma 4GoogleYes1B / 4B / 12B / 27BOpen-weights, multimodal, efficient, free API
Llama 4MetaYesScout / Maverick10M context (Scout), MoE, open ecosystem
Mistral / MixtralMistral AIYes (some)7B / 8x7B / LargeEfficient MoE, strong multilingual
Qwen 3AlibabaYes0.6B – 235BMoE, hybrid thinking, multilingual
DeepSeek-R1DeepSeekYes671B (MoE)Reasoning, math, code, cost-efficient

5. Key Properties of Modern LLMs

Modern LLMs exhibit several remarkable properties that earlier language models lacked:

  • In-Context Learning — LLMs can learn new tasks from just a few examples placed in the prompt, without any weight updates. This is sometimes called "few-shot learning."
  • Instruction Following — After alignment training (SFT + RLHF), models can follow complex, multi-step instructions expressed in natural language.
  • Emergent Abilities — Capabilities like chain-of-thought reasoning, translation between unseen language pairs, and multi-digit arithmetic appear only above certain scale thresholds.
  • Tool Use — LLMs can learn to call external tools (APIs, calculators, search engines) by generating structured function calls.
  • Multimodality — Recent models like Gemma 4 and GPT-4o can process images, audio, and video alongside text.

6. Scale Matters: Parameters, Data, and Compute

The Scaling Laws (Kaplan et al., 2020; Hoffmann et al., 2022) showed that LLM performance improves predictably with three factors:

  1. Parameters (N) — the number of trainable weights in the model. More parameters = more capacity to store patterns.
  2. Data (D) — the number of tokens seen during training. The Chinchilla paper showed that data and parameters should scale together.
  3. Compute (C) — total FLOPs used for training. Roughly C ≈ 6 × N × D for transformer models.

This is the Chinchilla scaling law, where is the loss, is parameter count, is dataset size, and are exponents (~0.34 and ~0.28). The key insight: for a given compute budget, there is an optimal balance between model size and training data.

ModelParametersTraining TokensTraining Cost (est.)
GPT-3175B300B~$4.6M
Llama 2 70B70B2T~$2M
Gemma 4 27B27B~14T
Llama 4 Maverick400B (17B active)~22T
GPT-4~1.8T (MoE)~13T~$100M+

7. The Open-Source LLM Ecosystem

One of the most exciting developments in AI is the thriving open-source ecosystem that has grown around LLMs. Key platforms and tools include:

  • Hugging Face — the "GitHub of ML." Hosts thousands of open models, datasets, and spaces. The transformers library is the de facto standard for working with LLMs locally.
  • Google AI Studio — free API access to Gemma 4 and Gemini models. This is what we will use throughout this course.
  • Ollama — run open models locally with a single command. Great for development and prototyping.
  • vLLM — high-throughput inference engine for serving LLMs in production with PagedAttention.
  • LangChain / LlamaIndex — frameworks for building LLM-powered applications (RAG, agents, chains).
python
# Quick taste: running Gemma 4 locally with Ollama
# (we'll use the Google AI Studio API in this course instead)

# Install: curl -fsSL https://ollama.com/install.sh | sh
# Then:
# ollama run gemma4:12b "What is a transformer?"

8. Real-World Applications

LLMs are already transforming every industry. Here are the most impactful application categories:

ApplicationDescriptionExample
Code GenerationWrite, debug, and explain codeGitHub Copilot, Cursor, Claude Code
Conversational AICustomer support, tutoring, assistantsChatGPT, Claude, Gemini
SummarizationCondense long documents, meetings, papersLegal document review, meeting notes
TranslationHigh-quality machine translationGoogle Translate (LLM-backed), DeepL
Search & RAGAnswer questions grounded in documentsPerplexity, enterprise knowledge bases
Reasoning & AnalysisMulti-step problem solving, data analysisResearch assistants, financial analysis
AI AgentsAutonomous systems that use tools and take actionsComputer use agents, research agents
Content CreationWriting, editing, brainstormingMarketing copy, blog posts, reports

9. Preview: Your First Gemma 4 API Call

To give you a taste of what is coming in the next lessons, here is the simplest possible call to the Gemma 4 model via Google AI Studio. We will set this up properly in Lesson 4.

python
import os
from google import genai

# Create a client using your API key (set as environment variable)
client = genai.Client(api_key=os.environ["GOOGLE_AI_STUDIO_API_KEY"])

# Generate a response from Gemma 4
response = client.models.generate_content(
    model="gemma-4-12b-it",
    contents="Explain what a Large Language Model is in three sentences."
)

print(response.text)
example_output.py Show Output
python
# Expected output from Gemma 4:
print(response.text)
A Large Language Model (LLM) is a type of artificial intelligence model
trained on vast amounts of text data to understand and generate human
language. These models use deep neural networks, typically based on the
Transformer architecture, with billions of parameters that capture
statistical patterns in language. LLMs can perform a wide range of tasks
including text generation, translation, summarization, question answering,
and code writing, often with remarkable fluency and coherence.
Up next · How LLMs Are Trained: Pretraining to RLHF