%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
A while back, I read this wonderful article called “Top 50 ggplot2 Visualizations - The Master List (With Full R Code)”. Many of the plots looked very useful. In this post, I’ll look at creating the first of the plot in Python (with the help of Stack Overflow).
Here’s how the end result should look like.
How the final plot should look like
Attributes of above plot
- X-Y scatter for area vs population
- Color by state
- Marker-size by population
I’ll first use Pandas to create the plot. Pandas plotting capabilites are almost the first thing I use to create plots. Next, I’ll show how to use Seaborn to reduce some complexity. Lastly, I’ll use Altair, ggplot and Plotnine to show how it focuses on getting directly to the point, i.e. expressing the 3 required attributes!
TLDR: Declarative visualisatio) is super useful!
Original R code
# install.packages("ggplot2")
# load package and data
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
data("midwest", package = "ggplot2")
# midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source
# Scatterplot
<- ggplot(midwest, aes(x=area, y=poptotal)) +
gg geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
plot(gg)
Color scheme (borrowed from Randy Olson’s website)
# Tableau 20 Colors
= [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
tableau20 44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
(148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
(227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
(188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
(
# Rescale to values between 0 and 1
for i in range(len(tableau20)):
= tableau20[i]
r, g, b = (r / 255., g / 255., b / 255.) tableau20[i]
Getting the data
= pd.read_csv("http://goo.gl/G1K41K")
midwest# Filtering
= midwest[midwest.poptotal<50000] midwest
'area'] ] midwest.head().loc[:, [
area | |
---|---|
1 | 0.014 |
2 | 0.022 |
3 | 0.017 |
4 | 0.018 |
5 | 0.050 |
Default Pandas scatter plot with marker size by population density
='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)), s=midwest['popdensity']*0.1) midwest.plot(kind
If we just use the default Pandas scatter, we won’t get the colour by state. For that we wil group the dataframe by states and then scatter plot each group individually.
Complete Pandas’ solution (hand-wavy at times!)
= plt.subplots()
fig, ax = midwest.groupby('state')
groups = tableau20[::2]
colors
# Plotting each group
for i, (name, group) in enumerate(groups):
='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)),
group.plot(kind=10+group['popdensity']*0.1, # hand-wavy :(
s=name, ax=ax, color=colors[i])
label
# Legend for State colours
= ax.legend(numpoints=1, loc=1, borderpad=1,
lgd =True, framealpha=0.9, title="state")
frameonfor handle in lgd.legendHandles:
100.0])
handle.set_sizes([
# Make a legend for popdensity. Hand-wavy. Error prone!
= (pd.cut(midwest['popdensity'], bins=4, retbins=True)[1]).round(0)
pws for pw in pws:
=(pw**2)/2e4, c="k",label=str(pw))
plt.scatter([], [], s
= plt.gca().get_legend_handles_labels()
h, l 5:], l[5:], labelspacing=1.2, title="popdensity", borderpad=1,
plt.legend(h[=True, framealpha=0.9, loc=4, numpoints=1)
frameon
plt.gca().add_artist(lgd)
Using Seaborn
The solution using Seaborn is slightly less complicated as we won’t need to write the code for plotting different states on different colours. However, the legend jugglery for markersize would still be required!
= [10, 40, 70, 100]
sizes = pd.cut(midwest['popdensity'], range(0, 2500, 500), labels=sizes)
marker_size 'area', 'poptotal', data=midwest, hue='state', fit_reg=False, scatter_kws={'s':marker_size})
sns.lmplot(0, 50000)) plt.ylim((
Altair (could not get simpler!)
from altair import Chart
= Chart(midwest)
chart
chart.mark_circle().encode(='area',
x='poptotal',
y='state',
color='popdensity',
size )
ggplot
from ggplot import *
='area', y='poptotal', color='state', size='popdensity'), data=midwest) +\
ggplot(aes(x+\
geom_point() +\
theme_bw() "Area") +\
xlab("Population") +\
ylab("Area vs Population") ggtitle(
It was great fun (and frustration) trying to make this plot. Still some bits like LOESS are not included in the visualisation I made. The best thing about this exercise was discovering Altair! Declarative visualisation looks so natural. Way to go declarative visualisation!