Analysis flow#
Here, weβll track typical data transformations like subsetting that occur during analysis.
If exploring more generally, read this first: Project flow.
Setup#
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Show code cell output
β Couldn't retrieve user id (the `created_by` field couldn't be set correctly).
Your user is not yet part of the User registry of this instance. Run
from lamindb_setup._init_instance import register_user
register_user(ln.setup.settings.user)
π‘ connected lamindb: testuser1/analysis-usecase
import lamindb as ln
import bionty as bt
from lamin_utils import logger
bt.settings.auto_save_parents = False
π‘ connected lamindb: testuser1/analysis-usecase
Register an initial dataset#
Here we register an initial artifact with a pipeline script register_example_file.py.
!python analysis-flow-scripts/register_example_file.py
Show code cell output
π‘ connected lamindb: testuser1/analysis-usecase
π‘ saved: Transform(uid='K4wsS5DTYdFp6K79', name='register_example_file.py', key='register_example_file.py', version='0', type='script', updated_at=2024-05-06 19:35:46 UTC, created_by_id=1)
π‘ saved: Run(uid='rewwu3pwjpo45HEM7M6C', transform_id=1, created_by_id=1)
β
added 3 records with Feature.name for columns: ['cell_type', 'tissue', 'disease']
β 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
β to lookup categories, use lookup().columns
β to save, run add_new_from_columns
β
added 99 records from public with Gene.ensembl_gene_id for var_index: ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079', 'ENSG00000002330', 'ENSG00000002549', 'ENSG00000002586', 'ENSG00000002587', 'ENSG00000002726', 'ENSG00000002745', 'ENSG00000002746', 'ENSG00000002822', 'ENSG00000002834', 'ENSG00000002919', 'ENSG00000002933', 'ENSG00000003056', 'ENSG00000003096', 'ENSG00000003137', 'ENSG00000003147', 'ENSG00000003249', 'ENSG00000003393', 'ENSG00000003400', 'ENSG00000003402', 'ENSG00000003436', 'ENSG00000003509', 'ENSG00000003756', 'ENSG00000003987', 'ENSG00000003989', 'ENSG00000004059', 'ENSG00000004139', 'ENSG00000004142', 'ENSG00000004399', 'ENSG00000004455', 'ENSG00000004468', 'ENSG00000004478', 'ENSG00000004487', 'ENSG00000004534', 'ENSG00000004660', 'ENSG00000004700', 'ENSG00000004766', 'ENSG00000004776', 'ENSG00000004777', 'ENSG00000004779', 'ENSG00000004799', 'ENSG00000004809', 'ENSG00000004838', 'ENSG00000004846', 'ENSG00000004848', 'ENSG00000004864', 'ENSG00000004866', 'ENSG00000004897', 'ENSG00000004939', 'ENSG00000004948', 'ENSG00000004961', 'ENSG00000004975', 'ENSG00000005001', 'ENSG00000005007', 'ENSG00000005020', 'ENSG00000005022', 'ENSG00000005059', 'ENSG00000005073', 'ENSG00000005075', 'ENSG00000005100', 'ENSG00000005102', 'ENSG00000005108', 'ENSG00000005156', 'ENSG00000005175', 'ENSG00000005187', 'ENSG00000005189', 'ENSG00000005194', 'ENSG00000005206', 'ENSG00000005238', 'ENSG00000005243', 'ENSG00000005249', 'ENSG00000005302', 'ENSG00000005339', 'ENSG00000005379', 'ENSG00000005381', 'ENSG00000005421', 'ENSG00000005436', 'ENSG00000005448', 'ENSG00000005469']
π‘ saving labels for 'cell_type'
β
added 3 records from public with CellType.name for cell_type: ['T cell', 'hematopoietic stem cell', 'hepatocyte']
β 1 non-validated categories are not saved in CellType.name: ['my new cell type']!
β to lookup categories, use lookup().cell_type
β to save, run .add_new_from('cell_type')
π‘ saving labels for 'tissue'
β
added 4 records from public with Tissue.name for tissue: ['kidney', 'liver', 'heart', 'brain']
π‘ saving labels for 'disease'
β
added 4 records from public with Disease.name for disease: ['chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease']
β
added 1 record with CellType.name for cell_type: ['my new cell type']
β
var_index is validated against Gene.ensembl_gene_id
β
cell_type is validated against CellType.name
β
tissue is validated against Tissue.name
β
disease is validated against Disease.name
π‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad')
β
storing artifact 'HMKqtGOvwRyPJ8ZkEg6v' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad'
π‘ parsing feature names of X stored in slot 'var'
β
99 terms (100.00%) are validated for ensembl_gene_id
β
linked: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', created_by_id=1)
π‘ parsing feature names of slot 'obs'
β
3 terms (75.00%) are validated for name
β 1 term (25.00%) is not validated for name: cell_type_id
β
linked: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', created_by_id=1)
β
saved 2 feature sets for slots: 'var','obs'
β
linked feature 'cell_type' to registry 'bionty.CellType'
β
linked feature 'tissue' to registry 'bionty.Tissue'
β
linked feature 'disease' to registry 'bionty.Disease'
β
saved transform.source_code: Artifact(uid='q52JEHDIjMX7rmp0g2mi', suffix='.py', description='Source of transform K4wsS5DTYdFp6K79', version='0', size=699, hash='p9yfNIWwDKdfz8URlTmM0Q', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC, storage_id=1, created_by_id=1)
β
saved run.environment: Artifact(uid='CHcL2Nf5ZY6iNieXqvnp', suffix='.txt', description='requirements.txt', size=3428, hash='5BDBXeBEKrLUBHIkJCZ8hg', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC, storage_id=1, created_by_id=1)
Pull the registered dataset, apply a transformation, and register the result#
Track the current notebook:
ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()
π‘ notebook imports: bionty==0.42.9 lamin_utils==0.13.2 lamindb==0.71.0
π‘ saved: Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-05-06 19:36:24 UTC, created_by_id=1)
π‘ saved: Run(uid='MBIkS3z6a0o6AwSc6fJK', transform_id=2, created_by_id=1)
artifact = ln.Artifact.filter(description="anndata with obs").one()
artifact.describe()
Artifact(uid='HMKqtGOvwRyPJ8ZkEg6v', suffix='.h5ad', accessor='AnnData', description='anndata with obs', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', n_observations=40, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC)
Provenance:
π storage: Storage(uid='oBe2joeXvS7u', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', instance_uid='C46ryhMWM0LV')
π transform: Transform(uid='K4wsS5DTYdFp6K79', name='register_example_file.py', key='register_example_file.py', version='0', type='script')
π run: Run(uid='rewwu3pwjpo45HEM7M6C', started_at=2024-05-06 19:35:46 UTC, finished_at=2024-05-06 19:36:23 UTC, is_consecutive=True)
π created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
π cell_type (4, bionty.CellType): 'hepatocyte', 'hematopoietic stem cell', 'T cell', 'my new cell type'
π tissue (4, bionty.Tissue): 'kidney', 'brain', 'heart', 'liver'
π disease (4, bionty.Disease): 'Alzheimer disease', 'liver lymphoma', 'chronic kidney disease', 'cardiac ventricle disorder'
Labels:
π tissues (4, bionty.Tissue): 'kidney', 'brain', 'heart', 'liver'
π cell_types (4, bionty.CellType): 'hepatocyte', 'hematopoietic stem cell', 'T cell', 'my new cell type'
π diseases (4, bionty.Disease): 'Alzheimer disease', 'liver lymphoma', 'chronic kidney disease', 'cardiac ventricle disorder'
Get a backed AnnData object#
adata = artifact.backed()
adata
AnnDataAccessor object with n_obs Γ n_vars = 40 Γ 100
constructed for the AnnData object HMKqtGOvwRyPJ8ZkEg6v.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
Subset dataset to specific cell types and diseases#
cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
[cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs Γ n_vars = 20 Γ 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
dtype: int64
Register the subsetted AnnData:
annotate = ln.Annotate.from_anndata(
adata_subset.to_memory(),
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"disease": bt.Disease.name,
"tissue": bt.Tissue.name,
},
organism="human"
)
annotate.validate()
Show code cell output
β 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
β to lookup categories, use lookup().columns
β to save, run add_new_from_columns
β
var_index is validated against Gene.ensembl_gene_id
β
cell_type is validated against CellType.name
β
disease is validated against Disease.name
β
tissue is validated against Tissue.name
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
True
artifact = annotate.save_artifact(description="anndata with obs subset")
Show code cell output
π‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/3Yc57o26VsGRx9lO961U.h5ad')
β
storing artifact '3Yc57o26VsGRx9lO961U' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/3Yc57o26VsGRx9lO961U.h5ad'
π‘ parsing feature names of X stored in slot 'var'
β
99 terms (100.00%) are validated for ensembl_gene_id
β
loaded: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
β
linked: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
π‘ parsing feature names of slot 'obs'
β
3 terms (75.00%) are validated for name
β 1 term (25.00%) is not validated for name: cell_type_id
β
loaded: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
β
linked: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
artifact.describe()
Artifact(uid='3Yc57o26VsGRx9lO961U', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', n_observations=20, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:25 UTC)
Provenance:
π storage: Storage(uid='oBe2joeXvS7u', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', instance_uid='C46ryhMWM0LV')
π transform: Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook')
π run: Run(uid='MBIkS3z6a0o6AwSc6fJK', started_at=2024-05-06 19:36:24 UTC, is_consecutive=True)
π created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
π cell_type (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
π tissue (2, bionty.Tissue): 'kidney', 'liver'
π disease (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
Labels:
π tissues (2, bionty.Tissue): 'kidney', 'liver'
π cell_types (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
π diseases (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
Examine data flow#
Query a subsetted .h5ad
artifact containing βhematopoietic stem cellβ and βT cellβ:
cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
suffix=".h5ad",
description__endswith="subset",
cell_types__in=[
cell_types.hematopoietic_stem_cell,
cell_types.t_cell,
],
).first()
my_subset
Artifact(uid='3Yc57o26VsGRx9lO961U', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', n_observations=20, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:25 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)
Common questions that might arise are:
What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?
Letβs answer this using LaminDB:
print("--> What is the history of this artifact?\n")
artifact.view_lineage()
print("\n\n--> Which features and labels are associated with it?\n")
logger.print(artifact.features)
logger.print(artifact.labels)
print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(artifact.transform)
print("\n\n--> By whom\n")
logger.print(artifact.created_by)
print("\n\n--> And which artifact is its parent\n")
display(artifact.run.input_artifacts.df())
--> What is the history of this artifact?
--> Which features and labels are associated with it?
Features:
var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
π cell_type (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
π tissue (2, bionty.Tissue): 'kidney', 'liver'
π disease (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
Labels:
π tissues (2, bionty.Tissue): 'kidney', 'liver'
π cell_types (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
π diseases (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
--> Which notebook analyzed and registered this artifact
Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-05-06 19:36:24 UTC, created_by_id=1)
--> By whom
User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-05-06 19:35:43 UTC)
--> And which artifact is its parent
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
1 | HMKqtGOvwRyPJ8ZkEg6v | 1 | None | .h5ad | AnnData | anndata with obs | None | 46992 | IJORtcQUSS11QBqD-nTD0A | md5 | None | 40 | 1 | 1 | 1 | True | 2024-05-06 19:36:23.640710+00:00 | 2024-05-06 19:36:23.702575+00:00 | 1 |
Show code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
sys.exit(main())
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 360, in __call__
return super().__call__(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 137, in delete
n_objects = check_storage_is_empty(
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 824, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb contains 4 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/3Yc57o26VsGRx9lO961U.h5ad', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/CHcL2Nf5ZY6iNieXqvnp.txt', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/_is_initialized', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/q52JEHDIjMX7rmp0g2mi.py']