The code for this python
example, has been adapted from an article by Ekta Sharma on the Palmer Penguins dataset.
import pandas as pd
import os
# add the R df into Python
penguins_df = r.penguin_df
Lets take a quick look at the data using describe()
, this will give us an indication of the variables and their respective values:
penguins_df[["species", "sex", "body_mass_g", "flipper_length_mm", "bill_length_mm"]].dropna().describe(include='all')
## species sex body_mass_g flipper_length_mm bill_length_mm
## count 333 333 333.000000 333.000000 333.000000
## unique 3 2 NaN NaN NaN
## top Adelie male NaN NaN NaN
## freq 146 168 NaN NaN NaN
## mean NaN NaN 4207.057057 200.966967 43.992793
## std NaN NaN 805.215802 14.015765 5.468668
## min NaN NaN 2700.000000 172.000000 32.100000
## 25% NaN NaN 3550.000000 190.000000 39.500000
## 50% NaN NaN 4050.000000 197.000000 44.500000
## 75% NaN NaN 4775.000000 213.000000 48.600000
## max NaN NaN 6300.000000 231.000000 59.600000
Lets take a closer look into the data.
Grouping the penguins according to species demonstrates a particular relationship between weight an flipper length, where Adelie female penguins appear to be the lightest and have the shortest flippers.
(penguins_df
.dropna()
.groupby(["species", "sex"])
.agg({"body_mass_g": "mean", "flipper_length_mm": "mean", "sex": "count"})
.sort_index()
)
## body_mass_g flipper_length_mm sex
## species sex
## Adelie female 3368.835616 187.794521 73
## male 4043.493151 192.410959 73
## Chinstrap female 3527.205882 191.735294 34
## male 3938.970588 199.911765 34
## Gentoo female 4679.741379 212.706897 58
## male 5484.836066 221.540984 61
It seems that the Gentoo is the largest penguin species. We can also take a closer look at their distribution along with the overall distribution:
larger = penguins_df[penguins_df.species=="Gentoo"].dropna()
larger
## species island bill_length_mm ... body_mass_g sex year
## 152 Gentoo Biscoe 46.1 ... 4500 female 2007
## 153 Gentoo Biscoe 50.0 ... 5700 male 2007
## 154 Gentoo Biscoe 48.7 ... 4450 female 2007
## 155 Gentoo Biscoe 50.0 ... 5700 male 2007
## 156 Gentoo Biscoe 47.6 ... 5400 male 2007
## .. ... ... ... ... ... ... ...
## 270 Gentoo Biscoe 47.2 ... 4925 female 2009
## 272 Gentoo Biscoe 46.8 ... 4850 female 2009
## 273 Gentoo Biscoe 50.4 ... 5750 male 2009
## 274 Gentoo Biscoe 45.2 ... 5200 female 2009
## 275 Gentoo Biscoe 49.9 ... 5400 male 2009
##
## [119 rows x 8 columns]
Let’s move on to some plots,this time using ggplot for visualising the overall distribution of body mass for the Gentoo species:
penguin_plot <- py$larger %>%
filter(!is.na(sex)) %>%
ggplot(aes(body_mass_g, fill = sex)) +
geom_density(color = "white", alpha = 0.5) +
scale_fill_manual(values = c("darkorange","purple")) +
labs(x = "Body Mass (g)")+
theme_minimal()
penguin_plot