Jupyter Notebook Binder

Analysis flow#

Here, we’ll track typical data transformations like subsetting that occur during analysis.

If exploring more generally, read this first: Project flow.

Setup#

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Hide code cell output
❗ Couldn't retrieve user id (the `created_by` field couldn't be set correctly).
Your user is not yet part of the User registry of this instance. Run
from lamindb_setup._init_instance import register_user
register_user(ln.setup.settings.user)
πŸ’‘ connected lamindb: testuser1/analysis-usecase
import lamindb as ln
import bionty as bt
from lamin_utils import logger

bt.settings.auto_save_parents = False
πŸ’‘ connected lamindb: testuser1/analysis-usecase

Register an initial dataset#

Here we register an initial artifact with a pipeline script register_example_file.py.

!python analysis-flow-scripts/register_example_file.py
Hide code cell output
πŸ’‘ connected lamindb: testuser1/analysis-usecase
πŸ’‘ saved: Transform(uid='K4wsS5DTYdFp6K79', name='register_example_file.py', key='register_example_file.py', version='0', type='script', updated_at=2024-05-06 19:35:46 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='rewwu3pwjpo45HEM7M6C', transform_id=1, created_by_id=1)
βœ… added 3 records with Feature.name for columns: ['cell_type', 'tissue', 'disease']
❗ 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
βœ… added 99 records from public with Gene.ensembl_gene_id for var_index: ['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', 'ENSG00000002079', 'ENSG00000002330', 'ENSG00000002549', 'ENSG00000002586', 'ENSG00000002587', 'ENSG00000002726', 'ENSG00000002745', 'ENSG00000002746', 'ENSG00000002822', 'ENSG00000002834', 'ENSG00000002919', 'ENSG00000002933', 'ENSG00000003056', 'ENSG00000003096', 'ENSG00000003137', 'ENSG00000003147', 'ENSG00000003249', 'ENSG00000003393', 'ENSG00000003400', 'ENSG00000003402', 'ENSG00000003436', 'ENSG00000003509', 'ENSG00000003756', 'ENSG00000003987', 'ENSG00000003989', 'ENSG00000004059', 'ENSG00000004139', 'ENSG00000004142', 'ENSG00000004399', 'ENSG00000004455', 'ENSG00000004468', 'ENSG00000004478', 'ENSG00000004487', 'ENSG00000004534', 'ENSG00000004660', 'ENSG00000004700', 'ENSG00000004766', 'ENSG00000004776', 'ENSG00000004777', 'ENSG00000004779', 'ENSG00000004799', 'ENSG00000004809', 'ENSG00000004838', 'ENSG00000004846', 'ENSG00000004848', 'ENSG00000004864', 'ENSG00000004866', 'ENSG00000004897', 'ENSG00000004939', 'ENSG00000004948', 'ENSG00000004961', 'ENSG00000004975', 'ENSG00000005001', 'ENSG00000005007', 'ENSG00000005020', 'ENSG00000005022', 'ENSG00000005059', 'ENSG00000005073', 'ENSG00000005075', 'ENSG00000005100', 'ENSG00000005102', 'ENSG00000005108', 'ENSG00000005156', 'ENSG00000005175', 'ENSG00000005187', 'ENSG00000005189', 'ENSG00000005194', 'ENSG00000005206', 'ENSG00000005238', 'ENSG00000005243', 'ENSG00000005249', 'ENSG00000005302', 'ENSG00000005339', 'ENSG00000005379', 'ENSG00000005381', 'ENSG00000005421', 'ENSG00000005436', 'ENSG00000005448', 'ENSG00000005469']
πŸ’‘ saving labels for 'cell_type'
βœ… added 3 records from public with CellType.name for cell_type: ['T cell', 'hematopoietic stem cell', 'hepatocyte']
❗ 1 non-validated categories are not saved in CellType.name: ['my new cell type']!
      β†’ to lookup categories, use lookup().cell_type
      β†’ to save, run .add_new_from('cell_type')
πŸ’‘ saving labels for 'tissue'
βœ… added 4 records from public with Tissue.name for tissue: ['kidney', 'liver', 'heart', 'brain']
πŸ’‘ saving labels for 'disease'
βœ… added 4 records from public with Disease.name for disease: ['chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease']
βœ… added 1 record with CellType.name for cell_type: ['my new cell type']
βœ… var_index is validated against Gene.ensembl_gene_id
βœ… cell_type is validated against CellType.name
βœ… tissue is validated against Tissue.name
βœ… disease is validated against Disease.name
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad')
βœ… storing artifact 'HMKqtGOvwRyPJ8ZkEg6v' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad'
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    99 terms (100.00%) are validated for ensembl_gene_id
βœ…    linked: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    3 terms (75.00%) are validated for name
❗    1 term (25.00%) is not validated for name: cell_type_id
βœ…    linked: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', created_by_id=1)
βœ… saved 2 feature sets for slots: 'var','obs'
βœ… linked feature 'cell_type' to registry 'bionty.CellType'
βœ… linked feature 'tissue' to registry 'bionty.Tissue'
βœ… linked feature 'disease' to registry 'bionty.Disease'
βœ… saved transform.source_code: Artifact(uid='q52JEHDIjMX7rmp0g2mi', suffix='.py', description='Source of transform K4wsS5DTYdFp6K79', version='0', size=699, hash='p9yfNIWwDKdfz8URlTmM0Q', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC, storage_id=1, created_by_id=1)
βœ… saved run.environment: Artifact(uid='CHcL2Nf5ZY6iNieXqvnp', suffix='.txt', description='requirements.txt', size=3428, hash='5BDBXeBEKrLUBHIkJCZ8hg', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC, storage_id=1, created_by_id=1)

Pull the registered dataset, apply a transformation, and register the result#

Track the current notebook:

ln.settings.transform.stem_uid = "eNef4Arw8nNM"
ln.settings.transform.version = "0"
ln.track()
πŸ’‘ notebook imports: bionty==0.42.9 lamin_utils==0.13.2 lamindb==0.71.0
πŸ’‘ saved: Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-05-06 19:36:24 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='MBIkS3z6a0o6AwSc6fJK', transform_id=2, created_by_id=1)
artifact = ln.Artifact.filter(description="anndata with obs").one()
artifact.describe()
Artifact(uid='HMKqtGOvwRyPJ8ZkEg6v', suffix='.h5ad', accessor='AnnData', description='anndata with obs', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', n_observations=40, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:23 UTC)

Provenance:
  πŸ“Ž storage: Storage(uid='oBe2joeXvS7u', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', instance_uid='C46ryhMWM0LV')
  πŸ“Ž transform: Transform(uid='K4wsS5DTYdFp6K79', name='register_example_file.py', key='register_example_file.py', version='0', type='script')
  πŸ“Ž run: Run(uid='rewwu3pwjpo45HEM7M6C', started_at=2024-05-06 19:35:46 UTC, finished_at=2024-05-06 19:36:23 UTC, is_consecutive=True)
  πŸ“Ž created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
    'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
  obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
    πŸ”— cell_type (4, bionty.CellType): 'hepatocyte', 'hematopoietic stem cell', 'T cell', 'my new cell type'
    πŸ”— tissue (4, bionty.Tissue): 'kidney', 'brain', 'heart', 'liver'
    πŸ”— disease (4, bionty.Disease): 'Alzheimer disease', 'liver lymphoma', 'chronic kidney disease', 'cardiac ventricle disorder'
Labels:
  πŸ“Ž tissues (4, bionty.Tissue): 'kidney', 'brain', 'heart', 'liver'
  πŸ“Ž cell_types (4, bionty.CellType): 'hepatocyte', 'hematopoietic stem cell', 'T cell', 'my new cell type'
  πŸ“Ž diseases (4, bionty.Disease): 'Alzheimer disease', 'liver lymphoma', 'chronic kidney disease', 'cardiac ventricle disorder'

Get a backed AnnData object#

adata = artifact.backed()
adata
AnnDataAccessor object with n_obs Γ— n_vars = 40 Γ— 100
  constructed for the AnnData object HMKqtGOvwRyPJ8ZkEg6v.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases#

cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs Γ— n_vars = 20 Γ— 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
dtype: int64

Register the subsetted AnnData:

annotate = ln.Annotate.from_anndata(
    adata_subset.to_memory(), 
    var_index=bt.Gene.ensembl_gene_id, 
    categoricals={
        "cell_type": bt.CellType.name, 
        "disease": bt.Disease.name, 
        "tissue": bt.Tissue.name,
    },
    organism="human"
)

annotate.validate()
Hide code cell output
❗ 1 non-validated categories are not saved in Feature.name: ['cell_type_id']!
      β†’ to lookup categories, use lookup().columns
      β†’ to save, run add_new_from_columns
βœ… var_index is validated against Gene.ensembl_gene_id
βœ… cell_type is validated against CellType.name
βœ… disease is validated against Disease.name
βœ… tissue is validated against Tissue.name
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/anndata/_core/anndata.py:1820: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
True
artifact = annotate.save_artifact(description="anndata with obs subset")
Hide code cell output
πŸ’‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/3Yc57o26VsGRx9lO961U.h5ad')
βœ… storing artifact '3Yc57o26VsGRx9lO961U' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/3Yc57o26VsGRx9lO961U.h5ad'
πŸ’‘ parsing feature names of X stored in slot 'var'
βœ…    99 terms (100.00%) are validated for ensembl_gene_id
βœ…    loaded: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
βœ…    linked: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
πŸ’‘ parsing feature names of slot 'obs'
βœ…    3 terms (75.00%) are validated for name
❗    1 term (25.00%) is not validated for name: cell_type_id
βœ…    loaded: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
βœ…    linked: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature', hash='K_R4HdQEmipIW7Qlst9P', updated_at=2024-05-06 19:36:23 UTC, created_by_id=1)
artifact.describe()
Artifact(uid='3Yc57o26VsGRx9lO961U', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', n_observations=20, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:25 UTC)

Provenance:
  πŸ“Ž storage: Storage(uid='oBe2joeXvS7u', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', instance_uid='C46ryhMWM0LV')
  πŸ“Ž transform: Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook')
  πŸ“Ž run: Run(uid='MBIkS3z6a0o6AwSc6fJK', started_at=2024-05-06 19:36:24 UTC, is_consecutive=True)
  πŸ“Ž created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
    'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
  obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
    πŸ”— cell_type (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
    πŸ”— tissue (2, bionty.Tissue): 'kidney', 'liver'
    πŸ”— disease (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
Labels:
  πŸ“Ž tissues (2, bionty.Tissue): 'kidney', 'liver'
  πŸ“Ž cell_types (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
  πŸ“Ž diseases (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'

Examine data flow#

Query a subsetted .h5ad artifact containing β€œhematopoietic stem cell” and β€œT cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Artifact(uid='3Yc57o26VsGRx9lO961U', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', n_observations=20, visibility=1, key_is_virtual=True, updated_at=2024-05-06 19:36:25 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the history of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
logger.print(artifact.features)
logger.print(artifact.labels)

print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(artifact.transform)

print("\n\n--> By whom\n")
logger.print(artifact.created_by)

print("\n\n--> And which artifact is its parent\n")
display(artifact.run.input_artifacts.df())
--> What is the history of this artifact?
_images/92d9fec769e416536c10566091f79112227f429d683c732d04c282dfdc1e772d.svg
--> Which features and labels are associated with it?

Features:
  var: FeatureSet(uid='gI0yydMuA0IFvftbrhEE', n=99, type='number', registry='bionty.Gene')
    'SKAP2', 'CD99', 'AK2', 'STPG1', 'SLC22A16', 'WDR54', 'KRIT1', 'CFTR', 'SLC4A1', 'DBNDD1', 'POLDIP2', 'MEOX1', 'M6PR', 'ANKIB1', 'POLR2J', 'GCFC2', 'PRSS22', 'PLXND1', 'THSD7A', 'RAD52', ...
  obs: FeatureSet(uid='5rcmhr35g88rJHIDUXm9', n=3, registry='core.Feature')
    πŸ”— cell_type (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
    πŸ”— tissue (2, bionty.Tissue): 'kidney', 'liver'
    πŸ”— disease (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
Labels:
  πŸ“Ž tissues (2, bionty.Tissue): 'kidney', 'liver'
  πŸ“Ž cell_types (2, bionty.CellType): 'hematopoietic stem cell', 'T cell'
  πŸ“Ž diseases (2, bionty.Disease): 'liver lymphoma', 'chronic kidney disease'
--> Which notebook analyzed and registered this artifact

Transform(uid='eNef4Arw8nNM6K79', name='Analysis flow', key='analysis-flow', version='0', type='notebook', updated_at=2024-05-06 19:36:24 UTC, created_by_id=1)
--> By whom

User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-05-06 19:35:43 UTC)
--> And which artifact is its parent
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 HMKqtGOvwRyPJ8ZkEg6v 1 None .h5ad AnnData anndata with obs None 46992 IJORtcQUSS11QBqD-nTD0A md5 None 40 1 1 1 True 2024-05-06 19:36:23.640710+00:00 2024-05-06 19:36:23.702575+00:00 1
Hide code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 360, in __call__
    return super().__call__(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 137, in delete
    n_objects = check_storage_is_empty(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 824, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb contains 4 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/3Yc57o26VsGRx9lO961U.h5ad', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/CHcL2Nf5ZY6iNieXqvnp.txt', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/HMKqtGOvwRyPJ8ZkEg6v.h5ad', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/_is_initialized', '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/q52JEHDIjMX7rmp0g2mi.py']