Exploring the Data

Contents

Exploring the Data#

In this notebook, we’ll explore the dataset included with Herculano-Houzel et al. (2015) “Mammalian Brains Are Made of These”.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Get the data from the "raw" version of the file hosted on our GitHub
!wget https://raw.githubusercontent.com/NeuralDataScience/NeuralDataScience.github.io/refs/heads/master/Data/species_brainmass_neurons.csv

# Open the csv and assign to a dataframe called "data"
data = pd.read_csv('species_brainmass_neurons.csv')
data.head()

--2025-11-25 13:01:03--  https://raw.githubusercontent.com/NeuralDataScience/NeuralDataScience.github.io/refs/heads/master/Data/species_brainmass_neurons.csv

Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... 

connected.

HTTP request sent, awaiting response...

200 OK
Length: 3649 (3.6K) [text/plain]
Saving to: ‘species_brainmass_neurons.csv.6’


          species_b   0%[                    ]       0  --.-KB/s               
species_brainmass_n 100%[===================>]   3.56K  --.-KB/s    in 0s      

2025-11-25 13:01:03 (15.5 MB/s) - ‘species_brainmass_neurons.csv.6’ saved [3649/3649]

	Species	Order	cortex_mass_g	Neurons	Other_cells	Neurons_mg	Other_cells_mg	Source
0	Sorex fumeus	Eulipotyphla	0.084	9730000	9290000	116727	111754	Sarko et al., 2009
1	Mus musculus	Glires	0.173	13688162	12061838	78672	68643	Herculano-Houzel et al., 2006
2	Blarina brevicauda	Eulipotyphla	0.197	11876000	15820000	60214	80729	Sarko et al., 2009
3	Heterocephalus glaber	Glires	0.184	6151875	8398125	33374	45894	Herculano-Houzel et al., 2011
4	Condylura cristata	Eulipotyphla	0.420	17250000	32010000	40777	76995	Sarko et al., 2009

One of the first steps in data exploration is checking the shape of the dataset:

data.shape

(38, 8)

data.columns

Index(['Species', 'Order', 'cortex_mass_g', 'Neurons', 'Other_cells',
       'Neurons_mg', 'Other_cells_mg', 'Source'],
      dtype='object')

fig,ax = plt.subplots(1,2,figsize=(10,4))
sns.histplot(data['Neurons'],ax=ax[0])
sns.kdeplot(data['Neurons'],ax=ax[1])
plt.show()

../_images/d53c6f0335e60fdb55e44b5b624ab447d6ad4366c6c868508f8841ce90d7cd9f.png

sns.boxplot(data=data, x='Order', y='Neurons')
plt.ticklabel_format(style='plain', axis='y')  # Disable scientific notation
plt.show()

../_images/b688ec2f073cb8423d34a7b70896381766ff74183f19c75891325dfd536b1f29.png

sns.scatterplot(data=data, x='cortex_mass_g', y='Neurons')
plt.ticklabel_format(style='plain', axis='y')  # Disable scientific notation
plt.xlabel('Cortex Mass (g)')
plt.show()

../_images/275b894d6dce8527c0d5f15699ab73d6ef5f76dd199614766498328ee909519b.png

sns.pairplot(data)
plt.show()

../_images/39cd1c5361678f618ae44718493aa05e6bb7fe7db57107ca568261816314f47f.png

# Compute the cross correlation
corr = data.corr() 

# Create axes, colormap, and plot a heatmap
fig,ax = plt.subplots(1,1,figsize=(4,3))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr,cmap=cmap,annot=True)

plt.show()

/var/folders/xf/zpnqd_3d3m77t0w3b54_8ls80000gp/T/ipykernel_83049/619943810.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = data.corr()

../_images/e303f18d6fde2ea890058bf719c2de2139d34803281ff52232e19aa0ecf98d7b.png

Bonus Challenges#

Can you use the NCBI esearch tool to look up information about the Herculano-Houzel et al. 2015 paper for this dataset?