Differential gene expression analysis

This tutorial showcases one of the basic ways to perform differential gene expression analysis using the atlasapprox-disease API. You will use the metadata and differential_gene_expression functions to identify datasets for a specific cell type, analyze gene expression changes in a disease context, and identify frequently occurring differentially expressed genes across datasets. The tutorial uses memory B cells as an example, but you can apply the code to any cell type, disease, or tissue of interest using the API’s many features.

Contents

  • Overview metadata with filters

  • Perform differential gene expression analysis for a specific cell type across datasets

  • Find frequently occurring differentially expressed genes

  • Tips for further exploration

Installation

Install the required packages using pip:

pip install atlasapprox-disease pandas

Import libraries and initialize the API

Import the necessary libraries

import atlasapprox_disease as aad
import pandas as pd

# Initialize the API
api = aad.API()

Overview datasets with cell type-specific data

One way to start is to use the metadata function to get an overview of all the data relevant to what you want to explore, such as a cell type, disease, or tissue. In this example, we will focus on datasets related to memory B cells as a simple starting point:

cell_metadata = api.metadata(cell_type="memory B cell")

# Display the result
cell_metadata
unique_id dataset_id cell_type tissue_general disease development_stage_general sex cell_count
0 a925cc9db06ddad8450db673a12c769c 0041b9c3-6a49-4bf7-8514-9bc7190067a7 memory B cell skin of body normal adult male 9
1 54cd76493a728a81e3f835b1b461c004 03d5794d-cde9-4769-a1a9-b3899d2b1d87 memory B cell esophagogastric junction normal adult female 84
2 5edd95b990ab65367627bd85b3005b69 03d5794d-cde9-4769-a1a9-b3899d2b1d87 memory B cell esophagogastric junction normal adult male 2
3 c3ca6e7e995ff56d7063a69997096af8 03d5794d-cde9-4769-a1a9-b3899d2b1d87 memory B cell esophagus Barrett esophagus adult female 80
4 70f50e140f68a84d87cc3853e7f08aab 03d5794d-cde9-4769-a1a9-b3899d2b1d87 memory B cell esophagus Barrett esophagus adult male 8
... ... ... ... ... ... ... ... ...
181 453bdd0f78d96b935a8fd217b4ed0cff f01bdd17-4902-40f5-86e3-240d66dd2587 memory B cell exocrine gland normal adult male 4
182 cb49352fd25f7642e90e30b991b05df0 f6dafdd1-d746-407e-8019-4470e02d4cbd memory B cell lung normal adult female 356
183 a43c02c3be1257db4b3636139bfcc403 f6dafdd1-d746-407e-8019-4470e02d4cbd memory B cell lung normal adult male 316
184 0034a338d54a3f31c52fded5366c488d f6dafdd1-d746-407e-8019-4470e02d4cbd memory B cell respiratory system normal adult female 361
185 8d7f5cb441c725b466f7f51aa01b3025 f6dafdd1-d746-407e-8019-4470e02d4cbd memory B cell respiratory system normal adult male 348

186 rows × 8 columns



The DataFrame contains 186 rows, each representing a unique combination of metadata attributes (e.g. tissue, disease, sex, and development stage) involving memory B cells.

To see the full list of unique diseases without truncation:

cell_metadata.disease.unique()
array(['normal', 'Barrett esophagus', 'gastric intestinal metaplasia',
       'gastritis', 'breast carcinoma',
       'invasive ductal breast carcinoma',
       'invasive lobular breast carcinoma', 'COVID-19',
       'post-COVID-19 disorder', 'common variable immunodeficiency',
       'Crohn disease', 'B-cell non-Hodgkin lymphoma', 'influenza'],
      dtype=object)

As shown, there is a variety of diseases involving memory B cell data, e.g., COVID-19, post-COVID-19 disorder, breast carcinoma, and Crohn disease, which you can explore further. For example, you can select a disease like COVID-19 to perform differential gene expression analysis on memory B cells, as demonstrated in the following sections:

Perform differential gene expression analysis for memory B cells in COVID-19

To understand how memory B cells respond to COVID-19, query the top 10 up- and down-regulated genes (20 in total) across all datasets with diseased and normal conditions. This analysis identifies genes with the most significant expression changes in COVID-19 compared to healthy samples.

df_genes = api.differential_gene_expression(
    differential_axis = "disease",
    disease="covid",
    cell_type="memory B cell",
    top_n=10  # Top 10 up and down-regulated genes to query
)

# Display the results
df_genes
tissue_general cell_type regulation gene unit baseline_expr state_expr baseline_fraction state_fraction metric dataset_id differential_axis state baseline
0 blood IgG memory B cell up HLA-DRB5 cptt 2.201501 7.332582 0.205931 0.831040 0.625109 de2c780c-1747-40bd-9ccf-9588ec186cee disease COVID-19 normal
1 blood memory B cell up HLA-DRB5 cptt 2.234262 7.035844 0.246106 0.826568 0.580462 4c4cd77c-8fee-4836-9145-16562a8782fe disease COVID-19 normal
2 blood IgG-negative class switched memory B cell up HLA-DRB5 cptt 2.965737 7.341037 0.235019 0.811541 0.576522 de2c780c-1747-40bd-9ccf-9588ec186cee disease COVID-19 normal
3 nose memory B cell up RPL17 cptt 2.389493 8.601742 0.409091 0.941176 0.532086 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
4 nose memory B cell up TRAC cptt 0.705393 5.557863 0.181818 0.705882 0.524064 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
175 nose memory B cell down NDUFA1 cptt 3.241598 0.526369 0.477273 0.117647 -0.359626 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
176 nose memory B cell down BBLN cptt 4.268116 1.021769 0.613636 0.235294 -0.378342 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
177 nose memory B cell down DDT cptt 2.917153 0.000000 0.431818 0.000000 -0.431818 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
178 nose memory B cell down HLA-DQA2 cptt 5.280043 0.000000 0.522727 0.000000 -0.522727 edc8d3fe-153c-4e3d-8be0-2108d30f8d70 disease COVID-19 normal
179 blood memory B cell down HLA-DRB5 cptt 3.983789 0.933865 0.811111 0.159544 -0.651567 59b69042-47c2-47fd-ad03-d21beb99818f disease COVID-19 normal

180 rows × 14 columns



The resulting DataFrame lists the top 10 up- and down-regulated genes for memory B cells in COVID-19 across all relevant datasets. Key columns include gene, regulation, expression and metric (fold change). Up-regulated genes may indicate activation of immune memory or antibody production pathways in response to COVID-19, while down-regulated genes could suggest suppression of other functions. Since the query includes multiple datasets and tissues, variations in gene expression may reflect dataset-specific or tissue-specific differences.

Find frequently occurring differentially expressed genes

Since memory B cells are present in multiple datasets, identify which genes appear most frequently as top differentially expressed genes across these datasets. This analysis highlights genes consistently affected by COVID-19 in memory B cells.

# Count the frequency of up-regulated genes across datasets
up_gene_counts = df_genes[df_genes["regulation"] == "up"]["gene"].value_counts()

# Display the results
print("Frequency of up-regulated genes across datasets:")
print(up_gene_counts)
Frequency of up-regulated genes across datasets:
gene
XAF1        4
HLA-DRB5    3
HLA-DQA2    3
LY6E        3
RPS4Y1      3
           ..
PRDX1       1
ANXA4       1
S100A10     1
S100A11     1
SNHG9       1
Name: count, Length: 64, dtype: int64

The output shows the frequency of up-regulated genes across datasets, for example, XAF1 appearing 4 times, NFKBID 3 times, MX1 3 times, and so on. This is how you can use the API to identify genes that frequently appear as top differentially expressed genes in your analysis. You can also explore down-regulated genes or analyze other diseases to compare results across different conditions.

Examples for further exploration

  1. Analyze down-regulated genes: Repeat the frequency analysis for down-regulated genes to identify consistently suppressed pathways.

    down_gene_counts = df_genes[df_genes["regulation"] == "down"]["gene"].value_counts()
    print(down_gene_counts)
    
  2. Explore other diseases: Use the diseases from the metadata (e.g., influenza) to compare memory B cell responses across conditions.

    df_influenza = api.differential_gene_expression(
        disease="influenza",
        cell_type="memory B cell",
        top_n=10
    )
    
  3. Explore specific tissues: Query differential gene expression for a specific tissue (e.g., kidney) to analyze expression changes across all diseases and cell types in that tissue.

    df_kidney = api.differential_gene_expression(
        tissue="kidney",
        top_n=10
    )
    print(df_kidney)
    

Next steps

This tutorial introduced differential gene expression analysis with the atlasapprox-disease API. To learn more, explore additional functions like average to retrieve gene expression levels, or dotplot for visualizing expression patterns.

Visit the official documentation <https://cell-atlas-approximations-disease-api.readthedocs.io/en/latest/python/index.html> for further details.

Total running time of the script: (0 minutes 8.201 seconds)

Gallery generated by Sphinx-Gallery