Exploring the Data

Exploring the Data#

In this notebook, we’ll explore the dataset included with Herculano-Houzel et al. (2015) “Mammalian Brains Are Made of These”.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'

import seaborn as sns

# Get the data from the "raw" version of the file hosted on our GitHub
!wget https://raw.githubusercontent.com/NeuralDataScience/NeuralDataScience.github.io/refs/heads/master/Data/species_brainmass_neurons.csv

# Open the csv and assign to a dataframe called "data"
data = pd.read_csv('species_brainmass_neurons.csv')
data.head()
--2025-07-30 12:21:00--  https://raw.githubusercontent.com/NeuralDataScience/NeuralDataScience.github.io/refs/heads/master/Data/species_brainmass_neurons.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 
200 OK
Length: 3649 (3.6K) [text/plain]
Saving to: ‘species_brainmass_neurons.csv.1’


          species_b   0%[                    ]       0  --.-KB/s               
species_brainmass_n 100%[===================>]   3.56K  --.-KB/s    in 0s      

2025-07-30 12:21:00 (8.51 MB/s) - ‘species_brainmass_neurons.csv.1’ saved [3649/3649]
Species Order cortex_mass_g Neurons Other_cells Neurons_mg Other_cells_mg Source
0 Sorex fumeus Eulipotyphla 0.084 9730000 9290000 116727 111754 Sarko et al., 2009
1 Mus musculus Glires 0.173 13688162 12061838 78672 68643 Herculano-Houzel et al., 2006
2 Blarina brevicauda Eulipotyphla 0.197 11876000 15820000 60214 80729 Sarko et al., 2009
3 Heterocephalus glaber Glires 0.184 6151875 8398125 33374 45894 Herculano-Houzel et al., 2011
4 Condylura cristata Eulipotyphla 0.420 17250000 32010000 40777 76995 Sarko et al., 2009

One of the first steps in data exploration is checking the shape of the dataset:

data.shape
(38, 8)
data.columns
Index(['Species', 'Order', 'cortex_mass_g', 'Neurons', 'Other_cells',
       'Neurons_mg', 'Other_cells_mg', 'Source'],
      dtype='object')
fig,ax = plt.subplots(1,2,figsize=(10,4))
sns.histplot(data['Neurons'],ax=ax[0])
sns.kdeplot(data['Neurons'],ax=ax[1])
plt.show()
sns.pairplot(data)
plt.show()
# Compute the cross correlation
corr = data.corr() 

# Create axes, colormap, and plot a heatmap
fig,ax = plt.subplots(1,1,figsize=(4,3))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr,cmap=cmap,annot=True)

plt.show()
/var/folders/xf/zpnqd_3d3m77t0w3b54_8ls80000gp/T/ipykernel_88540/619943810.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = data.corr()
../_images/146076ef93b79b537ea58fb4f71ce2903246d7c44e673f255903bc95735afc18.png

Bonus Challenges#

  • Can you use the NCBI esearch tool to look up information about the Herculano-Houzel et al. 2015 paper for this dataset?