Jupyter Notebook Binder

CellTypist#

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to annotate datasets analyzed with CellTypist enrichment analysis and track the dataset with LaminDB.

Setup#

!lamin load use-cases-registries
Hide code cell output
💡 connected lamindb: testuser1/use-cases-registries
Hide code cell content
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import bionty as bt
💡 connected lamindb: testuser1/use-cases-registries

Access CellTypist records #

As a first step we will read in CellTypist’s immune cell encyclopedia

import pandas as pd
description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

celltypist_df = pd.read_excel(celltypist_source_v2_url)

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()
High-hierarchy cell types Low-hierarchy cell types Description Cell Ontology ID Curated markers
0 B cells B cells B lymphocytes with diverse cell surface immuno... CL:0000236 CD79A, MS4A1, CD19
1 B cells Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 CXCR5, TNFRSF13B, CD22
2 B cells Proliferative germinal center B cells proliferating germinal center B cells CL:0000844 MKI67, SUGCT, AICDA
3 B cells Germinal center B cells proliferating mature B cells that undergo soma... CL:0000844 POU2AF1, CD40, SUGCT
4 B cells Memory B cells long-lived mature B lymphocytes which are form... CL:0000787 CR2, CD27, MS4A1

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types Description Curated markers
Cell Ontology ID Low-hierarchy cell types
CL:0000236 B cells B cells B lymphocytes with diverse cell surface immuno... CD79A, MS4A1, CD19
CL:0000843 Follicular B cells B cells resting mature B lymphocytes found in the prim... CXCR5, TNFRSF13B, CD22
CL:0000844 Proliferative germinal center B cells B cells proliferating germinal center B cells MKI67, SUGCT, AICDA
Germinal center B cells B cells proliferating mature B cells that undergo soma... POU2AF1, CD40, SUGCT
CL:0000787 Memory B cells B cells long-lived mature B lymphocytes which are form... CR2, CD27, MS4A1
Age-associated B cells B cells CD11c+ T-bet+ memory B cells associated with a... FCRL2, ITGAX, TBX21
CL:0000788 Naive B cells B cells mature B lymphocytes which express cell-surfac... IGHM, IGHD, TCL1A
CL:0000818 Transitional B cells B cells immature B cell precursors in the bone marrow ... CD24, MYO1C, MS4A1
CL:0000817 Large pre-B cells B-cell lineage proliferative B lymphocyte precursors derived ... MME, CD24, MKI67
Small pre-B cells B-cell lineage non-proliferative B lymphocyte precursors deri... MME, CD24, IGLL5

Validate CellTypist records #

For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.

This will avoid that we’ll refer to the same cell type with different identifiers.

We need a Bionty object for this:

bionty = bt.CellType.public()
bionty
PublicOntology
Entity: CellType
Organism: all
Source: cl, 2023-08-24
#terms: 2894

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

We can now validate the "Cell Ontology ID" column:

bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don’t validate:

bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
97 terms (99.00%) are not validated for name: B cells, Follicular B cells, Proliferative germinal center B cells, Germinal center B cells, Memory B cells, Age-associated B cells, Naive B cells, Transitional B cells, Large pre-B cells, Small pre-B cells, Pre-pro-B cells, Pro-B cells, Cycling B cells, Cycling DCs, Cycling gamma-delta T cells, Cycling monocytes, Cycling NK cells, Cycling T cells, DC, DC1, ...
   detected 6 terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
ontology_id definition synonyms parents __agg__ __ratio__
name
B cell CL:0000236 A Lymphocyte Of B Lineage That Is Capable Of B... B-lymphocyte|B lymphocyte|B-cell [CL:0000945] b cell 92.307692
B-1 B cell CL:0000819 A B Cell Of Distinct Lineage And Surface Marke... B1 B-cell|B1 B cell|B-1 B-cell|B1 cell|B1 B ly... [CL:0000785] b-1 b cell 85.714286

Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!

bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
93 terms (94.90%) are not validated for name: Follicular B cell, Proliferative germinal center B cell, Germinal center B cell, Memory B cell, Age-associated B cell, Naive B cell, Transitional B cell, Large pre-B cell, Small pre-B cell, Pre-pro-B cell, Pro-B cell, Cycling B cell, Cycling DC, Cycling gamma-delta T cell, Cycling monocyte, Cycling NK cell, Cycling T cell, DC, DC1, DC2, ...
   detected 31 terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, DC1, DC2, Endothelial cell, Epithelial cell, Erythrocyte, ETP, Fibroblast, Granulocyte, Neutrophil, ILC2, ILC3, NK cell, Alveolar macrophage, ...
→  standardize terms via .standardize()

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}

Register CellTypist records #

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "ct_name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()

# add standardize names for each ontology_id
celltypist_df["name"] = bionty.df().loc[celltypist_df["ontology_id"]].name.values
celltypist_df.head(2)
ct_name description ontology_id parent name
0 B cells B lymphocytes with diverse cell surface immuno... CL:0000236 None B cell
1 Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 B cells follicular B cell

Now, let’s create records from the public ontology:

public_records = bt.CellType.from_values(
    celltypist_df.ontology_id, bt.CellType.ontology_id
)
ln.save(public_records)
❗ now recursing through parents: this only happens once, but is much slower than bulk saving

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    try:
        record.add_synonym(row["ct_name"])
    except SystemExit:
        pass
Hide code cell output
❌ input synonyms ['DC2'] already associated with the following records:
id uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
0 129 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None type 2 DC|pDC|interferon-producing cell|IPC|T-... A Dendritic Cell Type Of Distinct Morphology, ... 21 2024-05-06 19:31:42.046011+00:00 2024-05-06 19:31:42.046037+00:00 1
❌ input synonyms ['ILC2'] already associated with the following records:
id uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
0 115 4ny4oBnr group 2 innate lymphoid cell CL:0001069 None natural helper cell|ILC2|nuocyte An Innate Lymphoid Cell That Is Capable Of Pro... 21 2024-05-06 19:31:36.746948+00:00 2024-05-06 19:31:36.746973+00:00 1
❌ input synonyms ['ILC3'] already associated with the following records:
id uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
0 116 3tILnbqv group 3 innate lymphoid cell CL:0001071 None ILC3 An Innate Lymphoid Cell That Constituitively E... 21 2024-05-06 19:31:37.272331+00:00 2024-05-06 19:31:37.272355+00:00 1
❌ input synonyms ['pDC'] already associated with the following records:
id uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
0 129 3JO0EdVd plasmacytoid dendritic cell CL:0000784 None type 2 DC|pDC|interferon-producing cell|IPC|T-... A Dendritic Cell Type Of Distinct Morphology, ... 21 2024-05-06 19:31:42.046011+00:00 2024-05-06 19:31:42.046037+00:00 1

Add parent-child relationship of the records from Celltypist#

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_nonval)
['B-cell lineage', 'Cycling cells', 'T cells', 'Erythroid']

Let’s get the top hits from a search:

for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(2))
Term: B-cell lineage
ontology_id definition synonyms parents __agg__ __ratio__
name
obsolete cell by lineage CL:0000220 None None [] obsolete cell by lineage 73.684211
obsolete cell line cell CL:0007014 Obsolete: A Cultured Cell That Has Been Passag... passaged cultured cell [] obsolete cell line cell 64.864865
Term: Cycling cells
ontology_id definition synonyms parents __agg__ __ratio__
name
circulating cell CL:0000080 A Cell Which Moves Among Different Tissues Of ... None [CL:0000003] circulating cell 75.862069
lining cell CL:0000213 A Cell Within An Epithelial Cell Sheet Whose M... boundary cell [CL:0000215] lining cell 75.000000
Term: T cells
ontology_id definition synonyms parents __agg__ __ratio__
name
T cell CL:0000084 A Type Of Lymphocyte Whose Defining Characteri... T-cell|T-lymphocyte|T lymphocyte [CL:0000542] t cell 92.307692
T-helper 1 cell CL:0000545 A Cd4-Positive, Alpha-Beta T Cell That Has The... T(H)-1 cell|T helper cells type 1|Th1 T-lympho... [CL:0000492] t-helper 1 cell 80.000000
Term: Erythroid
ontology_id definition synonyms parents __agg__ __ratio__
name
erythroid lineage cell CL:0000764 A Immature Or Mature Cell In The Lineage Leadi... erythropoietic cell [CL:0000763] erythroid lineage cell 90.0
primitive erythroid progenitor CL:0002361 A Progenitor Cell That Is Capable Of Forming C... EryP-CFC|inner blood island hemangioblast [CL:0002417] primitive erythroid progenitor 90.0

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)

for name in high_terms_nonval:
    if name == "T cells":
        record = bt.CellType.from_public(name="T cell")
        record.add_synonym(name)
        record.save()
    elif name == "Erythroid":
        record = bt.CellType.from_public(name="erythroid lineage cell")
        record.add_synonym(name)
        record.save()
    else:
        record = bt.CellType(name=name)
        record.save()
❗ records with similar names exist! did you mean to load one of them?
uid synonyms score
name
B cell ryEtgi1y B-lymphocyte|B cells|B-cell|Cycling B cells|B ... 92.9
T cell 22LvKd01 Cycling T cells|T-lymphocyte|CD8a/a|T lymphocy... 92.9
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}
bt.CellType(name="B-cell lineage").save()
❗ loaded CellType record with same name: 'B-cell lineage' (disable via `ln.settings.upon_create_search_names`)

Now let’s add the parent records:

celltypist_df["parent"] = bt.CellType.standardize(celltypist_df["parent"])
for _, row in celltypist_df.iterrows():
    record = public_records_dict[row["ontology_id"]]
    if row["parent"] is not None:
        parent_record = bt.CellType.filter(name=row["parent"]).one()
        record.parents.add(parent_record)

Access the registry#

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

bt.CellType.df()
uid name ontology_id abbr synonyms description public_source_id created_at updated_at created_by_id
id
142 5gxL2SWr B-cell lineage None None None None NaN 2024-05-06 19:31:49.237035+00:00 2024-05-06 19:31:49.315264+00:00 1
24 2KfvYuU7 erythroid lineage cell CL:0000764 None Mid erythroid|Erythroid|erythropoietic cell|La... A Immature Or Mature Cell In The Lineage Leadi... 21.0 2024-05-06 19:31:18.218346+00:00 2024-05-06 19:31:49.287873+00:00 1
14 22LvKd01 T cell CL:0000084 None T cells|CD8a/a|T lymphocyte|T-cell|Cycling T c... A Type Of Lymphocyte Whose Defining Characteri... 21.0 2024-05-06 19:31:18.217516+00:00 2024-05-06 19:31:49.271549+00:00 1
143 5jshKSVL Cycling cells None None None None NaN 2024-05-06 19:31:49.254734+00:00 2024-05-06 19:31:49.254755+00:00 1
68 7j3YpGzu T-helper 17 cell CL:0000899 None Th17 CD4+ T cell|IL-17-producing CD4+ T helper... Cd4-Positive, Alpha-Beta T Cell With The Pheno... 21.0 2024-05-06 19:31:18.222058+00:00 2024-05-06 19:31:49.079719+00:00 1
... ... ... ... ... ... ... ... ... ... ...
71 2Jgr5Xx4 mononuclear cell CL:0000842 None mononuclear leukocyte A Leukocyte With A Single Non-Segmented Nucleu... 21.0 2024-05-06 19:31:19.680765+00:00 2024-05-06 19:31:19.680790+00:00 1
70 X6c7osZ5 lymphocyte CL:0000542 None None A Lymphocyte Is A Leukocyte Commonly Found In ... 21.0 2024-05-06 19:31:19.229642+00:00 2024-05-06 19:31:19.229666+00:00 1
69 7GpphKmr lymphocyte of B lineage CL:0000945 None None A Lymphocyte Of B Lineage With The Commitment ... 21.0 2024-05-06 19:31:18.730989+00:00 2024-05-06 19:31:18.731015+00:00 1
38 1aLpWgJc group 3 innate lymphoid cell, human CL:0001078 None ILC3, human A Group 3 Innate Lymphoid Cell In The Human Wi... 21.0 2024-05-06 19:31:18.219545+00:00 2024-05-06 19:31:18.219555+00:00 1
37 6NmzCwsn group 2 innate lymphoid cell, human CL:0001081 None ILC2, human A Group 2 Innate Lymphoid Cell In The Human Wi... 21.0 2024-05-06 19:31:18.219462+00:00 2024-05-06 19:31:18.219472+00:00 1

143 rows × 10 columns

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = bt.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='2cUPBtY8', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B lymphocyte|Age-associated B cells|memory B-lymphocyte|Memory B cells|memory B-cell', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', updated_at=2024-05-06 19:31:47 UTC, public_source_id=21, created_by_id=1)

See cell type hierarchy:

db_lookup.memory_b_cell.view_parents()
_images/7138cf810276255e7ec3e800d967b0cfd89fe019b6d65b195f4a30b5a55e6ada.svg

Access parents of a record:

db_lookup.memory_b_cell.parents.list()
[CellType(uid='ryEtgi1y', name='B cell', ontology_id='CL:0000236', synonyms='B-lymphocyte|B cells|B-cell|Cycling B cells|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2024-05-06 19:31:47 UTC, public_source_id=21, created_by_id=1),
 CellType(uid='71xItrKo', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B-cell|mature B lymphocyte|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', updated_at=2024-05-06 19:31:23 UTC, public_source_id=21, created_by_id=1)]

Move on to the next registry: GO pathways