Differential gene expression analysis

This tutorial showcases one of the basic ways to perform differential gene expression analysis using the atlasapprox-disease API. You will use the metadata and differential_gene_expression functions to identify datasets for a specific cell type, analyze gene expression changes in a disease context, and identify frequently occurring differentially expressed genes across datasets. The tutorial uses memory B cells as an example, but you can apply the code to any cell type, disease, or tissue of interest using the API’s many features.

Contents

Overview metadata with filters
Perform differential gene expression analysis for a specific cell type across datasets
Find frequently occurring differentially expressed genes
Tips for further exploration

Installation

Install the required packages using pip:

pip install atlasapprox-disease pandas

Import libraries and initialize the API

Import the necessary libraries

import atlasapprox_disease as aad
import pandas as pd

# Initialize the API
api = aad.API()

Overview datasets with cell type-specific data

One way to start is to use the metadata function to get an overview of all the data relevant to what you want to explore, such as a cell type, disease, or tissue. In this example, we will focus on datasets related to memory B cells as a simple starting point:

cell_metadata = api.metadata(cell_type="memory B cell")

# Display the result
cell_metadata

	unique_id	dataset_id	cell_type	tissue_general	disease	development_stage_general	sex	cell_count
0	a925cc9db06ddad8450db673a12c769c	0041b9c3-6a49-4bf7-8514-9bc7190067a7	memory B cell	skin of body	normal	adult	male	9
1	54cd76493a728a81e3f835b1b461c004	03d5794d-cde9-4769-a1a9-b3899d2b1d87	memory B cell	esophagogastric junction	normal	adult	female	84
2	5edd95b990ab65367627bd85b3005b69	03d5794d-cde9-4769-a1a9-b3899d2b1d87	memory B cell	esophagogastric junction	normal	adult	male	2
3	c3ca6e7e995ff56d7063a69997096af8	03d5794d-cde9-4769-a1a9-b3899d2b1d87	memory B cell	esophagus	Barrett esophagus	adult	female	80
4	70f50e140f68a84d87cc3853e7f08aab	03d5794d-cde9-4769-a1a9-b3899d2b1d87	memory B cell	esophagus	Barrett esophagus	adult	male	8
...	...	...	...	...	...	...	...	...
181	453bdd0f78d96b935a8fd217b4ed0cff	f01bdd17-4902-40f5-86e3-240d66dd2587	memory B cell	exocrine gland	normal	adult	male	4
182	cb49352fd25f7642e90e30b991b05df0	f6dafdd1-d746-407e-8019-4470e02d4cbd	memory B cell	lung	normal	adult	female	356
183	a43c02c3be1257db4b3636139bfcc403	f6dafdd1-d746-407e-8019-4470e02d4cbd	memory B cell	lung	normal	adult	male	316
184	0034a338d54a3f31c52fded5366c488d	f6dafdd1-d746-407e-8019-4470e02d4cbd	memory B cell	respiratory system	normal	adult	female	361
185	8d7f5cb441c725b466f7f51aa01b3025	f6dafdd1-d746-407e-8019-4470e02d4cbd	memory B cell	respiratory system	normal	adult	male	348

186 rows × 8 columns

The DataFrame contains 186 rows, each representing a unique combination of metadata attributes (e.g. tissue, disease, sex, and development stage) involving memory B cells.

To see the full list of unique diseases without truncation:

cell_metadata.disease.unique()

array(['normal', 'Barrett esophagus', 'gastric intestinal metaplasia',
       'gastritis', 'breast carcinoma',
       'invasive ductal breast carcinoma',
       'invasive lobular breast carcinoma', 'COVID-19',
       'post-COVID-19 disorder', 'common variable immunodeficiency',
       'Crohn disease', 'B-cell non-Hodgkin lymphoma', 'influenza'],
      dtype=object)

As shown, there is a variety of diseases involving memory B cell data, e.g., COVID-19, post-COVID-19 disorder, breast carcinoma, and Crohn disease, which you can explore further. For example, you can select a disease like COVID-19 to perform differential gene expression analysis on memory B cells, as demonstrated in the following sections:

Perform differential gene expression analysis for memory B cells in COVID-19

To understand how memory B cells respond to COVID-19, query the top 10 up- and down-regulated genes (20 in total) across all datasets with diseased and normal conditions. This analysis identifies genes with the most significant expression changes in COVID-19 compared to healthy samples.

df_genes = api.differential_gene_expression(
    differential_axis = "disease",
    disease="covid",
    cell_type="memory B cell",
    top_n=10  # Top 10 up and down-regulated genes to query
)

# Display the results
df_genes

	tissue_general	cell_type	regulation	gene	unit	baseline_expr	state_expr	baseline_fraction	state_fraction	metric	dataset_id	differential_axis	state	baseline
0	blood	IgG memory B cell	up	HLA-DRB5	cptt	2.201501	7.332582	0.205931	0.831040	0.625109	de2c780c-1747-40bd-9ccf-9588ec186cee	disease	COVID-19	normal
1	blood	memory B cell	up	HLA-DRB5	cptt	2.234262	7.035844	0.246106	0.826568	0.580462	4c4cd77c-8fee-4836-9145-16562a8782fe	disease	COVID-19	normal
2	blood	IgG-negative class switched memory B cell	up	HLA-DRB5	cptt	2.965737	7.341037	0.235019	0.811541	0.576522	de2c780c-1747-40bd-9ccf-9588ec186cee	disease	COVID-19	normal
3	nose	memory B cell	up	RPL17	cptt	2.389493	8.601742	0.409091	0.941176	0.532086	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
4	nose	memory B cell	up	TRAC	cptt	0.705393	5.557863	0.181818	0.705882	0.524064	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
175	nose	memory B cell	down	NDUFA1	cptt	3.241598	0.526369	0.477273	0.117647	-0.359626	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
176	nose	memory B cell	down	BBLN	cptt	4.268116	1.021769	0.613636	0.235294	-0.378342	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
177	nose	memory B cell	down	DDT	cptt	2.917153	0.000000	0.431818	0.000000	-0.431818	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
178	nose	memory B cell	down	HLA-DQA2	cptt	5.280043	0.000000	0.522727	0.000000	-0.522727	edc8d3fe-153c-4e3d-8be0-2108d30f8d70	disease	COVID-19	normal
179	blood	memory B cell	down	HLA-DRB5	cptt	3.983789	0.933865	0.811111	0.159544	-0.651567	59b69042-47c2-47fd-ad03-d21beb99818f	disease	COVID-19	normal

180 rows × 14 columns

The resulting DataFrame lists the top 10 up- and down-regulated genes for memory B cells in COVID-19 across all relevant datasets. Key columns include gene, regulation, expression and metric (fold change). Up-regulated genes may indicate activation of immune memory or antibody production pathways in response to COVID-19, while down-regulated genes could suggest suppression of other functions. Since the query includes multiple datasets and tissues, variations in gene expression may reflect dataset-specific or tissue-specific differences.

Find frequently occurring differentially expressed genes

Since memory B cells are present in multiple datasets, identify which genes appear most frequently as top differentially expressed genes across these datasets. This analysis highlights genes consistently affected by COVID-19 in memory B cells.

# Count the frequency of up-regulated genes across datasets
up_gene_counts = df_genes[df_genes["regulation"] == "up"]["gene"].value_counts()

# Display the results
print("Frequency of up-regulated genes across datasets:")
print(up_gene_counts)

Frequency of up-regulated genes across datasets:
gene
XAF1        4
HLA-DRB5    3
HLA-DQA2    3
LY6E        3
RPS4Y1      3
           ..
PRDX1       1
ANXA4       1
S100A10     1
S100A11     1
SNHG9       1
Name: count, Length: 64, dtype: int64

The output shows the frequency of up-regulated genes across datasets, for example, XAF1 appearing 4 times, NFKBID 3 times, MX1 3 times, and so on. This is how you can use the API to identify genes that frequently appear as top differentially expressed genes in your analysis. You can also explore down-regulated genes or analyze other diseases to compare results across different conditions.

Examples for further exploration

Analyze down-regulated genes: Repeat the frequency analysis for down-regulated genes to identify consistently suppressed pathways.
```
down_gene_counts = df_genes[df_genes["regulation"] == "down"]["gene"].value_counts()
print(down_gene_counts)
```

Explore other diseases: Use the diseases from the metadata (e.g., influenza) to compare memory B cell responses across conditions.

df_influenza = api.differential_gene_expression(
    disease="influenza",
    cell_type="memory B cell",
    top_n=10
)

Explore specific tissues: Query differential gene expression for a specific tissue (e.g., kidney) to analyze expression changes across all diseases and cell types in that tissue.
```
df_kidney = api.differential_gene_expression(
    tissue="kidney",
    top_n=10
)
print(df_kidney)
```

Next steps

This tutorial introduced differential gene expression analysis with the atlasapprox-disease API. To learn more, explore additional functions like average to retrieve gene expression levels, or dotplot for visualizing expression patterns.

Visit the official documentation <https://cell-atlas-approximations-disease-api.readthedocs.io/en/latest/python/index.html> for further details.

Total running time of the script: (0 minutes 8.201 seconds)

Gallery generated by Sphinx-Gallery