# Finding Data

## SDK Code Exercise
**Software development kits (SDKs)** provide a set of tools, libraries, documentation, code samples, and more that allow developers to create software applications. SDKs often include APIs, which we’ll dig into a little bit below. In research and technology development, SDKs are often provided by organizations or companies that are hosting data, as a way of providing  examples and tools to work with the data.

One great example of an SDK in neuroscience is the [Allen Institute SDK](https://allensdk.readthedocs.io). This SDK provides researchers up-to-date access to data and code that is hot off the presses from the Allen Institute, based in Seattle, Washington.

For example, here’s how you can — in just a few lines of code — start working with electrophysiology data from the Allen:


In [None]:
# Install the allensdk to your environment if needed (note you may need to restart session after this cell)
try:
    import allensdk
    print("allensdk is already installed.")
except ImportError:
    print("allensdk not found, installing now...")
    !pip install allensdk

In [None]:
# Import pandas and the "Cell Types Cache" from the AllenSDK core package
import pandas as pd
from allensdk.core.cell_types_cache import CellTypesCache

# Initialize the cache as 'ctc' (cell types cache)
ctc = CellTypesCache(manifest_file='cell_types/manifest.json')

# Download all electrophysiology features for all cells
ephys_features = ctc.get_ephys_features()

# Make it a dataframe & show the first 5 rows
ef_df = pd.DataFrame(ephys_features)
ef_df.head()

For now, this just shows you how easy it is to access some open neuroscience data! We'll come back to this particular dataset in a later chapter to play with it.

## API Code Exercise

**Application programmer interfaces**, or APIs, allow programmatic access to many databases and tools. Many large organizations such as the National Institutes of Health will help upkeep APIs that enable researchers to conduct research using publicly available datasets.

(It’s not worth worrying too much about the difference between APIs and SDKs — we’d generally encourage you to think about it as: SDKs contain much more than just APIs. An API provides the building blocks for software workflow, while an SDK is a pre-packaged collection of code and data that researchers can use to work with data easily and efficiently.)

For example, a very popular bioinformatics tool called BLAST has an API that researchers can use to interact with -ohmics datasets, rather than downloading the BLAST database on their computer. BLAST is a tool to find similarities behind sequences of DNA. Likewise, a tool called ENTREZ allows researchers to programmatically search many National Center for Biotechnology Information (NCBI) databases.

There isn’t one standard way of interacting with an API — each one works slightly differently. However, almost always you’ll need the help of a library called requests. This library allows you to retrieve information from a URL. 

In the code exercise below, we’ll use requests to search the Entrez database for the term “neural data science.” Within, we are using the URL and parameters (params) as informed by the documentation for the API.


In [4]:
import requests

url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
term = "neural data science"
params = {"db": "pubmed", "term": term,"retmode": "json"}

response = requests.get(url, params=params)

# Show results of search
response.json()

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '41129',
  'retmax': '20',
  'retstart': '0',
  'idlist': ['40590186',
   '40590026',
   '40589994',
   '40589818',
   '40588827',
   '40588700',
   '40588695',
   '40588487',
   '40588361',
   '40588165',
   '40587931',
   '40587919',
   '40586488',
   '40586430',
   '40586263',
   '40585969',
   '40585851',
   '40585236',
   '40584826',
   '40583914'],
  'translationset': [{'from': 'neural',
    'to': '"neural"[All Fields] OR "neuralization"[All Fields] OR "neuralize"[All Fields] OR "neuralized"[All Fields] OR "neuralizes"[All Fields] OR "neuralizing"[All Fields] OR "neurally"[All Fields]'},
   {'from': 'data science',
    'to': '"data science"[MeSH Terms] OR ("data"[All Fields] AND "science"[All Fields]) OR "data science"[All Fields]'}],
  'querytranslation': '("neural"[All Fields] OR "neuralization"[All Fields] OR "neuralize"[All Fields] OR "neuralized"[All Fields] OR "neuralizes"[All Fields] OR "neuraliz

In the output above, you can see the results of our search. When we published the book, there were about 41,000 papers. Can you see how many there are now? (Hint: look for `count`.)

## Webscraping Code Exercise
We can use the requests module to scrape data from any website, actually. For example, if you want to scrape the very informative “iscaliforniaonfire.com” and show the results of this, you can write the following:

In [9]:
import requests
page = requests.get('http://iscaliforniaonfire.com/')
page.content

b'<html>\n<head>\n<title>Is California On Fire?</title>\n</head>\n<body>\n<h1>Yes</h1>\nupdated: Tue Jul  1 12:39:15 2025 PDT\n</body>\n</html>\n'

The output here is in html format, which we’d then need to parse if we wanted to scrape it for some purpose. We can import yet another package, poetically named BeautifulSoup, to organize this html output, search through it for a particular HTML tag, and cleanly print the results:

In [10]:
# Import beautiful soup package

from bs4 import BeautifulSoup
# Create the soup
soup = BeautifulSoup(page.content, 'html.parser')

# Find the HTML tag of interest and show results
h1_content = soup.find('h1')
h1_content.text

'Yes'

This website is very simple (and alarming, speaking as two people that live in California), so it’s quite simple to get to the point: yes, California is almost always on fire. Most websites aren’t so easy to scrape cleanly, and getting the exact information you need can be a bit tricky.

## PubMed Utilities Exercise

As a neural data scientist, you likely won’t do a ton of web scraping, but these HTML (or XML, or JSON) parsing skills can come in handy in many different types of informatics. For example, if you’d like to work with the PubMed utilities mentioned above to pull abstracts of scientific articles around a particular search term, PubMed will return the results to you in XML by default. So, you need to know how to parse these results in order to do fun informatics work with them.

In [13]:
# Install package that contains Entrez (you may need to restart session after doing so)
!pip install Biopython

Collecting Biopython
  Downloading biopython-1.85-cp312-cp312-macosx_11_0_arm64.whl.metadata (13 kB)
Downloading biopython-1.85-cp312-cp312-macosx_11_0_arm64.whl (2.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Biopython
Successfully installed Biopython-1.85


In [15]:
# Import entrez
from Bio import Entrez 

# Specify email address (required by NCBI E-utilities)
Entrez.email = 'myemail@email.com'

# Fetch a particular paper
fetch_handle = Entrez.efetch(db='pubmed',id='36729258',retmax=100,rettype='abstract')

fetch_handle.read()

b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2025//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_250101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM" IndexingMethod="Automated"><PMID Version="1">36729258</PMID><DateCompleted><Year>2023</Year><Month>06</Month><Day>26</Day></DateCompleted><DateRevised><Year>2024</Year><Month>06</Month><Day>03</Day></DateRevised><Article PubModel="Print-Electronic"><Journal><ISSN IssnType="Electronic">1618-727X</ISSN><JournalIssue CitedMedium="Internet"><Volume>36</Volume><Issue>3</Issue><PubDate><Year>2023</Year><Month>Jun</Month></PubDate></JournalIssue><Title>Journal of digital imaging</Title><ISOAbbreviation>J Digit Imaging</ISOAbbreviation></Journal><ArticleTitle>Ultrasound Prostate Segmentation Using Adaptive Selection Principal Curve and Smooth Mathematical Model.</ArticleTitle><Pagination><StartPage>947</StartPage><EndPage>963</EndPage><MedlinePg

## Code Challenge

In the cell above, we hardcoded an id argument to the `Entrez.efetch` function. Can you figure out how to instead use one of the ids in the list we generated in the API exercise above? 