Diagram of penguin head with indication of bill length and bill depth.

The code for this python example, has been adapted from an article by Ekta Sharma on the Palmer Penguins dataset.

import pandas as pd
import os

# add the R df into Python
penguins_df = r.penguin_df

Lets explore the data

Lets take a quick look at the data using describe(), this will give us an indication of the variables and their respective values:

penguins_df[["species", "sex", "body_mass_g", "flipper_length_mm", "bill_length_mm"]].dropna().describe(include='all')
##        species   sex  body_mass_g  flipper_length_mm  bill_length_mm
## count      333   333   333.000000         333.000000      333.000000
## unique       3     2          NaN                NaN             NaN
## top     Adelie  male          NaN                NaN             NaN
## freq       146   168          NaN                NaN             NaN
## mean       NaN   NaN  4207.057057         200.966967       43.992793
## std        NaN   NaN   805.215802          14.015765        5.468668
## min        NaN   NaN  2700.000000         172.000000       32.100000
## 25%        NaN   NaN  3550.000000         190.000000       39.500000
## 50%        NaN   NaN  4050.000000         197.000000       44.500000
## 75%        NaN   NaN  4775.000000         213.000000       48.600000
## max        NaN   NaN  6300.000000         231.000000       59.600000

Deeper Dive

Lets take a closer look into the data.

Grouping the penguins according to species demonstrates a particular relationship between weight an flipper length, where Adelie female penguins appear to be the lightest and have the shortest flippers.

(penguins_df
.dropna()
.groupby(["species", "sex"])
.agg({"body_mass_g": "mean", "flipper_length_mm": "mean", "sex": "count"})
.sort_index()
)
##                   body_mass_g  flipper_length_mm  sex
## species   sex                                        
## Adelie    female  3368.835616         187.794521   73
##           male    4043.493151         192.410959   73
## Chinstrap female  3527.205882         191.735294   34
##           male    3938.970588         199.911765   34
## Gentoo    female  4679.741379         212.706897   58
##           male    5484.836066         221.540984   61

It seems that the Gentoo is the largest penguin species. We can also take a closer look at their distribution along with the overall distribution:

larger = penguins_df[penguins_df.species=="Gentoo"].dropna()
larger
##     species  island  bill_length_mm  ...  body_mass_g     sex  year
## 152  Gentoo  Biscoe            46.1  ...         4500  female  2007
## 153  Gentoo  Biscoe            50.0  ...         5700    male  2007
## 154  Gentoo  Biscoe            48.7  ...         4450  female  2007
## 155  Gentoo  Biscoe            50.0  ...         5700    male  2007
## 156  Gentoo  Biscoe            47.6  ...         5400    male  2007
## ..      ...     ...             ...  ...          ...     ...   ...
## 270  Gentoo  Biscoe            47.2  ...         4925  female  2009
## 272  Gentoo  Biscoe            46.8  ...         4850  female  2009
## 273  Gentoo  Biscoe            50.4  ...         5750    male  2009
## 274  Gentoo  Biscoe            45.2  ...         5200  female  2009
## 275  Gentoo  Biscoe            49.9  ...         5400    male  2009
## 
## [119 rows x 8 columns]

Plot Section

Let’s move on to some plots,this time using ggplot for visualising the overall distribution of body mass for the Gentoo species:

penguin_plot <- py$larger %>% 
filter(!is.na(sex)) %>% 
  ggplot(aes(body_mass_g, fill = sex)) + 
  geom_density(color = "white", alpha = 0.5) +
   scale_fill_manual(values = c("darkorange","purple")) +
  labs(x = "Body Mass (g)")+
  theme_minimal()

 penguin_plot