from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
from scipy.stats import entropy
# Rich imports for better formatting
from rich.console import Console
from rich.table import Table
from rich.text import Text
from rich import print as rprint
# Initialize rich console
= Console() console
Temperature Scaling in Softmax: The Mathematics
Temperature scaling modifies the softmax function to control the “sharpness” of the probability distribution. The standard softmax function is:
\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}\]
With temperature scaling, we introduce a temperature parameter \(T\):
\[\text{softmax}_T(x_i) = \frac{e^{x_i/T}}{\sum_{j=1}^{n} e^{x_j/T}}\]
Effects of Temperature:
- T = 1: Standard softmax (no scaling)
- T > 1: Higher temperature → More uniform distribution → More randomness
- T < 1: Lower temperature → Sharper distribution → More deterministic
- T → 0: Distribution becomes one-hot (argmax)
- T → ∞: Distribution becomes uniform
Entropy and Diversity:
The entropy of a probability distribution measures its randomness: \[H(p) = -\sum_{i=1}^{n} p_i \log p_i\]
Higher temperature typically leads to higher entropy and more diverse outputs.
Imports and Setup
Let’s import the necessary libraries and set up our environment:
All necessary libraries loaded for language modeling, visualization, and rich formatting.
Visual Demonstration
Let’s create a visual diagram showing how temperature affects probability distributions:
# Set up prettier matplotlib styling
plt.rcParams.update({'font.size': 12,
'axes.labelsize': 14,
'axes.titlesize': 16,
'xtick.labelsize': 12,
'ytick.labelsize': 12,
'legend.fontsize': 12,
'figure.titlesize': 18
})
# Create sample logits and temperature ranges
= np.array([2.0, 1.0, 0.5, 0.2, 0.1])
logits = [0.1, 0.5, 1.0, 2.0, 5.0]
temperatures
# Create subplots with better styling
= plt.subplots(1, len(temperatures), figsize=(20, 5))
fig, axes = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FECA57']
colors
for i, temp in enumerate(temperatures):
# Calculate probabilities for this temperature
= logits / temp
scaled_logits = np.exp(scaled_logits) / np.sum(np.exp(scaled_logits))
probs
# Create DataFrame for this temperature
= pd.DataFrame({
df_temp 'Token': [f'Token_{j}' for j in range(len(logits))],
'Probability': probs
})
# Plot distribution with better styling
= axes[i].bar(df_temp['Token'], df_temp['Probability'],
bars =colors[i], alpha=0.8, edgecolor='white', linewidth=1.5)
color
# Bold T=1.0
= 'bold' if temp == 1.0 else 'normal'
title_weight = 16 if temp == 1.0 else 14
title_size f'Temperature = {temp}', fontweight=title_weight, fontsize=title_size)
axes[i].set_title(
'Probability', fontweight='bold')
axes[i].set_ylabel(0, 1.1)
axes[i].set_ylim(='x', rotation=45)
axes[i].tick_params(axisTrue, alpha=0.3, linestyle='--')
axes[i].grid('#F8F9FA')
axes[i].set_facecolor(
# Add probability values on top of bars
for bar, prob in zip(bars, probs):
+ bar.get_width()/2., bar.get_height() + 0.02,
axes[i].text(bar.get_x() f'{prob:.2f}', ha='center', va='bottom', fontweight='bold', fontsize=10)
'Temperature Scaling Effects: Distribution Shape Changes',
plt.suptitle(=20, fontweight='bold', y=1.02)
fontsize
plt.tight_layout() plt.show()
This diagram clearly shows how temperature reshapes probability distributions - low temperatures create peaked distributions while high temperatures flatten them.
Model Setup
Load a language model to explore temperature effects in practice:
= "distilgpt2"
model_name = AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoModelForCausalLM.from_pretrained(model_name)
model
# Set padding token to avoid warnings
= tokenizer.eos_token tokenizer.pad_token
DistilGPT2 loaded with proper tokenizer configuration.
Example Prompt
Let’s start with a simple prompt to analyze:
= "In the future, AI will"
prompt = tokenizer(prompt, return_tensors="pt").input_ids
input_ids print("Input IDs:", input_ids)
print("Vocabulary size:", model.config.vocab_size)
Input IDs: tensor([[ 818, 262, 2003, 11, 9552, 481]])
Vocabulary size: 50257
The model converts text to token IDs and works with a vocabulary of 50,257 tokens.
Text Generation with Low Temperature
Generate text with very low temperature to see deterministic behavior:
= model.generate(
output
input_ids,=True,
do_sample=0.01,
temperature=20,
max_new_tokens=50,
top_k=tokenizer.eos_token_id,
pad_token_id=torch.ones_like(input_ids)
attention_mask
)
= tokenizer.decode(output[0], skip_special_tokens=True)
decoded print("Generated text:", decoded)
Generated text: In the future, AI will be able to do things like make people smarter, more intelligent, more intelligent, more intelligent, more
Low temperature produces repetitive, deterministic text.
Logits Analysis
Let’s examine the raw logits and probabilities for next token prediction:
# Get the logits for prediction
with torch.no_grad():
= model(input_ids)
outputs = outputs.logits
logits # Show logits and probabilities for top 10 tokens
= torch.topk(logits[0, -1], k=10)
top_logits, top_indices = torch.softmax(top_logits, dim=0)
top_probs = [tokenizer.decode([idx]) for idx in top_indices]
top_tokens
# Create a clean table
= Table(title="Top 10 Token Predictions")
table "Rank", style="dim")
table.add_column("Token", style="cyan")
table.add_column("Logit", style="magenta")
table.add_column("Probability", style="green")
table.add_column(
for i, (token, logit, prob) in enumerate(zip(top_tokens, top_logits, top_probs)):
table.add_row(str(i + 1),
repr(token),
f"{logit.item():.4f}",
f"{prob.item():.4f}"
)
print(table) console.
Top 10 Token Predictions ┏━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓ ┃ Rank ┃ Token ┃ Logit ┃ Probability ┃ ┡━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩ │ 1 │ ' be' │ -63.5934 │ 0.5269 │ │ 2 │ ' have' │ -64.7306 │ 0.1690 │ │ 3 │ ' need' │ -65.5254 │ 0.0763 │ │ 4 │ ' become' │ -65.8555 │ 0.0549 │ │ 5 │ ' not' │ -66.0847 │ 0.0436 │ │ 6 │ ' also' │ -66.4207 │ 0.0312 │ │ 7 │ ' take' │ -66.5617 │ 0.0271 │ │ 8 │ ' continue' │ -66.5808 │ 0.0266 │ │ 9 │ ' make' │ -66.6369 │ 0.0251 │ │ 10 │ ' only' │ -66.8984 │ 0.0193 │ └──────┴─────────────┴──────────┴─────────────┘
The model shows clear preference for certain tokens, with “be” having the highest probability.
Token Sampling
Sample tokens from the probability distribution to see randomness in action:
# Sample tokens from the distribution
= torch.multinomial(top_probs, num_samples=100, replacement=True)
sampled_indices = [top_tokens[idx] for idx in sampled_indices]
sampled_tokens
# Display sample using Rich
"Sample of tokens:")
rprint(" ".join([f"[cyan]{token}[/cyan]" for token in sampled_tokens[:20]]))
rprint(
# Count occurrences and display in Rich table
= Counter(sampled_tokens)
token_counts
# Create Rich table for token counts
= Table(title="Token Counts from 100 Samples")
table "Token", style="cyan")
table.add_column("Count", style="green")
table.add_column("Percentage", style="yellow")
table.add_column(
for token, count in token_counts.most_common():
= (count / 100) * 100
percentage str(count), f"{percentage:.1f}%")
table.add_row(token,
print(table) console.
Sample of tokens:
be make make be not have become be have have have be be need become need also be also be
Token Counts from 100 Samples ┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓ ┃ Token ┃ Count ┃ Percentage ┃ ┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩ │ be │ 53 │ 53.0% │ │ have │ 16 │ 16.0% │ │ become │ 9 │ 9.0% │ │ need │ 9 │ 9.0% │ │ also │ 4 │ 4.0% │ │ make │ 3 │ 3.0% │ │ not │ 2 │ 2.0% │ │ take │ 2 │ 2.0% │ │ continue │ 2 │ 2.0% │ └───────────┴───────┴────────────┘
Token counts from 100 samples reflect the underlying probability distribution.
Temperature Scaling Comparison
Compare extreme temperature values to see dramatic differences:
# Temperature scaling comparison across multiple temperatures
= [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 100.0]
temperature_values
# Calculate probabilities for each temperature
= {}
temp_results for temp in temperature_values:
= logits / temp
scaled_logits = torch.topk(scaled_logits[0, -1], k=10)
top_logits_temp, top_indices_temp = torch.softmax(top_logits_temp, dim=0)
top_probs_temp = {
temp_results[temp] 'probs': top_probs_temp,
'tokens': [tokenizer.decode([idx]) for idx in top_indices_temp]
}
# Create comprehensive comparison table
= Table(title="Temperature Scaling: Full Spectrum Comparison")
table "Rank", style="dim")
table.add_column("Token", style="cyan")
table.add_column(
for temp in temperature_values:
if temp == 1.0:
f"**T={temp}**", style="bold green") # Bold T=1
table.add_column(else:
f"T={temp}", style="green" if temp > 1.0 else "red")
table.add_column(
for i in range(10):
= [str(i + 1), repr(temp_results[temperature_values[0]]['tokens'][i])]
row_data
for temp in temperature_values:
= temp_results[temp]['probs'][i].item()
prob f"{prob:.4f}")
row_data.append(
*row_data)
table.add_row(
print(table) console.
Temperature Scaling: Full Spectrum Comparison ┏━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓ ┃ Rank ┃ Token ┃ T=0.01 ┃ T=0.1 ┃ T=0.5 ┃ **T=1.0** ┃ T=2.0 ┃ T=5.0 ┃ T=100.0 ┃ ┡━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩ │ 1 │ ' be' │ 1.0000 │ 1.0000 │ 0.8667 │ 0.5269 │ 0.2731 │ 0.1550 │ 0.1023 │ │ 2 │ ' have' │ 0.0000 │ 0.0000 │ 0.0891 │ 0.1690 │ 0.1547 │ 0.1235 │ 0.1012 │ │ 3 │ ' need' │ 0.0000 │ 0.0000 │ 0.0182 │ 0.0763 │ 0.1039 │ 0.1053 │ 0.1004 │ │ 4 │ ' become' │ 0.0000 │ 0.0000 │ 0.0094 │ 0.0549 │ 0.0881 │ 0.0986 │ 0.1000 │ │ 5 │ ' not' │ 0.0000 │ 0.0000 │ 0.0059 │ 0.0436 │ 0.0786 │ 0.0942 │ 0.0998 │ │ 6 │ ' also' │ 0.0000 │ 0.0000 │ 0.0030 │ 0.0312 │ 0.0664 │ 0.0881 │ 0.0995 │ │ 7 │ ' take' │ 0.0000 │ 0.0000 │ 0.0023 │ 0.0271 │ 0.0619 │ 0.0856 │ 0.0993 │ │ 8 │ ' continue' │ 0.0000 │ 0.0000 │ 0.0022 │ 0.0266 │ 0.0613 │ 0.0853 │ 0.0993 │ │ 9 │ ' make' │ 0.0000 │ 0.0000 │ 0.0020 │ 0.0251 │ 0.0596 │ 0.0843 │ 0.0992 │ │ 10 │ ' only' │ 0.0000 │ 0.0000 │ 0.0012 │ 0.0193 │ 0.0523 │ 0.0800 │ 0.0990 │ └──────┴─────────────┴────────┴────────┴────────┴───────────┴────────┴────────┴─────────┘
High temperature creates uniform distribution while low temperature creates deterministic selection.
Temperature Analysis
Analyze how different temperatures affect probability distributions and entropy:
# Function to analyze temperature effects
def analyze_temperature_effects(logits, temperatures, top_k=10):
= []
results for temp in temperatures:
= logits / temp
scaled_logits = torch.softmax(scaled_logits, dim=-1)
probs
# Get top-k tokens
= torch.topk(probs[0, -1], k=top_k)
top_probs, top_indices = [tokenizer.decode([idx]) for idx in top_indices]
top_tokens
# Calculate entropy
= probs[0, -1].cpu().numpy()
prob_dist = entropy(prob_dist)
entropy_value
for i, (token, prob) in enumerate(zip(top_tokens, top_probs)):
results.append({'temperature': temp,
'token': token,
'probability': prob.item(),
'rank': i + 1,
'entropy': entropy_value
})
return results
# Test with different temperature values
= [0.1, 0.5, 1.0, 2.0, 5.0]
temperatures = analyze_temperature_effects(logits, temperatures)
results
# Create prettier visualizations
plt.rcParams.update({'font.size': 12,
'axes.labelsize': 14,
'axes.titlesize': 16,
'xtick.labelsize': 12,
'ytick.labelsize': 12,
'legend.fontsize': 12,
'figure.titlesize': 18
})
= plt.subplots(1, 2, figsize=(16, 6))
fig, axes
# Plot 1: Top-k probabilities with better styling
= ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FECA57']
colors for i, temp in enumerate(temperatures):
= [r for r in results if r['temperature'] == temp]
temp_data = [r['rank'] for r in temp_data]
ranks = [r['probability'] for r in temp_data]
probs
= 3 if temp == 1.0 else 2
linewidth 0].plot(ranks, probs, 'o-', label=f'T={temp}',
axes[=colors[i], linewidth=linewidth, markersize=8, alpha=0.8)
color
0].set_xlabel('Token Rank', fontweight='bold')
axes[0].set_ylabel('Probability', fontweight='bold')
axes[0].set_title('Top-k Token Probabilities vs Temperature', fontweight='bold')
axes[0].legend(frameon=True, fancybox=True, shadow=True)
axes[0].grid(True, alpha=0.3, linestyle='--')
axes[0].set_facecolor('#F8F9FA')
axes[
# Plot 2: Entropy vs Temperature with better styling
= []
entropies for temp in temperatures:
= [r['entropy'] for r in results if r['temperature'] == temp][0]
temp_entropy
entropies.append(temp_entropy)
1].plot(temperatures, entropies, 'o-', color='#E74C3C',
axes[=3, markersize=10, alpha=0.8)
linewidth1].set_xlabel('Temperature', fontweight='bold')
axes[1].set_ylabel('Entropy', fontweight='bold')
axes[1].set_title('Entropy vs Temperature', fontweight='bold')
axes[1].grid(True, alpha=0.3, linestyle='--')
axes[1].set_facecolor('#F8F9FA')
axes[
# Highlight T=1.0 on entropy plot
= temperatures.index(1.0)
idx_1 1].scatter(1.0, entropies[idx_1], color='#2ECC71', s=150,
axes[=5, edgecolors='white', linewidth=2)
zorder1].annotate('T=1.0\n(Standard)', xy=(1.0, entropies[idx_1]),
axes[=(1.5, entropies[idx_1] + 0.5),
xytext=dict(arrowstyle='->', color='#2ECC71', lw=2),
arrowprops=11, ha='center', fontweight='bold')
fontsize
plt.tight_layout() plt.show()
Higher temperatures lead to flatter distributions and higher entropy, confirming the mathematical relationship.
Diverse Examples Across Domains
Test temperature effects on different types of content:
def generate_comparison_examples():
"""Generate examples across different domains to show temperature effects"""
= [
examples "The scientific method involves", "Science"),
("To solve the equation x^2 - 4x + 3 = 0,", "Mathematics"),
("In Shakespeare's time, the theater", "Literature/English")
(
]
= [0.3, 1.0, 2.0]
temperatures
for prompt, domain in examples:
f"\n[bold]{domain.upper()}[/bold]")
rprint(f"Prompt: '{prompt}'")
rprint("-" * 50)
rprint(
= tokenizer(prompt, return_tensors="pt").input_ids
input_ids
for temp in temperatures:
f"\n[cyan]Temperature {temp}:[/cyan]")
rprint(for i in range(3): # Generate 3 samples per temperature
= model.generate(
output
input_ids,=True,
do_sample=temp,
temperature=15,
max_new_tokens=50,
top_k=tokenizer.eos_token_id,
pad_token_id=torch.ones_like(input_ids)
attention_mask
)
= tokenizer.decode(output[0], skip_special_tokens=True)
generated_text = generated_text[len(prompt):].strip()
new_text f" Sample {i+1}: {new_text}")
rprint(
print()
generate_comparison_examples()
SCIENCE
Prompt: 'The scientific method involves'
--------------------------------------------------
Temperature 0.3:
Sample 1: the use of a single, single, single, single-digit number.
Sample 2: a series of steps to achieve a specific goal. For example, the goal
Sample 3: taking a single molecule of a molecule of a molecule of a molecule of a
Temperature 1.0:
Sample 1: two elements: a measurement of the quality of the water and a measurement of
Sample 2: measuring the concentration of a large number of nucleic acids. (See a
Sample 3: measuring all sorts of atoms using two different atomic weights, called the ‴
Temperature 2.0:
Sample 1: some complicated equations such that the equations simply must exist before applying force. In
Sample 2: determining each set of biological pathways to change each phenotype based on that pathway.
Sample 3: using samples in real and made use of pure glass; with a special lens
MATHEMATICS
Prompt: 'To solve the equation x^2 - 4x + 3 = 0,'
--------------------------------------------------
Temperature 0.3:
Sample 1: 2x + 3 = 0, 2x + 3 = 0, 2
Sample 2: and then it is the same as x^2 + 4x + 4
Sample 3: x^2 - 4x + 3 = 0, x^2 -
Temperature 1.0:
Sample 1: 3x = 1, 4x = 2, 4x = 3,
Sample 2: 0 , and 1 ) The idea is simple.
Sample 3: but the answer at 1x is the same as with m^2 -
Temperature 2.0:
Sample 1: where * x=A.add(-10, 0 / B)
Sample 2: ‰
Sample 3: 4 ( \times f(-x)). Note the error of z(5
LITERATURE/ENGLISH
Prompt: 'In Shakespeare's time, the theater'
--------------------------------------------------
Temperature 0.3:
Sample 1: was a kind of theater, a kind of theater, a kind of theater
Sample 2: was a place where people could play Shakespeare and Shakespeare. The theater was a
Sample 3: was a kind of theater where the audience could see the characters and the characters
Temperature 1.0:
Sample 1: is a work of art and creativity.
He and his wife
Sample 2: and the music are also known for using Shakespeare's "soul with him
Sample 3: 's most famous poet, Thomas Edison, drew its light on his favorite verse
Temperature 2.0:
Sample 1: troupe was founded by British writer Stephen King at around that stage. Here
Sample 2: might have had better management because of the lack of quality of production at a
Sample 3: director himself had, as Sir Francis Molloy had hinted about a �
Multiple samples reveal consistency patterns at low temperatures and diversity at high temperatures across all domains.
Practical Applications
Use Cases
- Low Temperature (0.1-0.5): Code generation, technical docs, factual content
- Medium Temperature (0.7-1.2): Creative writing, chatbots, general text
- High Temperature (1.5-3.0): Brainstorming, fiction, diverse idea generation
Quick Guidelines
- Start with T=1.0 as baseline
- Lower temperature for consistency and accuracy
- Higher temperature for creativity and diversity
Conclusion
Temperature scaling is a simple yet powerful technique for controlling language model randomness:
- T < 1: More deterministic, consistent outputs
- T = 1: Standard softmax behavior
- T > 1: More random, diverse outputs
Monitor entropy to quantify diversity. Temperature remains one of the most practical tools for controlling LLM behavior.
References
Code and Implementation
- Temperature Scaling Repository: https://github.com/gpleiss/temperature_scaling
- Twitter Discussion: https://x.com/akshay_pachaar/status/1942201076767412307
Academic Sources
- Hinton, G. et al. (2015): “Distilling the Knowledge in a Neural Network” - Original paper introducing temperature in knowledge distillation
- Guo, C. et al. (2017): “On Calibration of Modern Neural Networks” - ICML paper on temperature scaling for calibration
- Goodfellow, I. et al. (2016): “Deep Learning” - Chapter 6 covers softmax and temperature scaling
- Bishop, C. (2006): “Pattern Recognition and Machine Learning” - Chapter 4 discusses softmax temperature
Textbooks
- “Deep Learning” by Goodfellow, Bengio, and Courville - Comprehensive coverage of softmax and sampling techniques
- “Pattern Recognition and Machine Learning” by Christopher Bishop - Mathematical foundations of probability distributions
- “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman - Statistical perspective on temperature scaling