Standardize and append a batch of data¶

Here, we’ll learn

how to standardize a less well curated collection
how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.context.uid = "ManDYgmftZ8C0000"
ln.context.track()

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"

curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.cell_type.name: bt.CellType.name},
)

Standardize & validate genes ¶

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

curate = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type": bt.CellType.name},
)

curate.validate()

curate.add_validated_from_var_index()

Standardize & validate cell types ¶

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

curate.validate()

Register ¶

artifact = curate.save_artifact(description="10x reference adata")

artifact.view_lineage()

_images/29441eaba6dd5e5928d8c0d95dbe9eacb34f809cd1275b7755012ee7c712405a.svg

Append the dataset to the collection¶

Query the previous collection:

collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection")

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.artifacts.all()[0]],
    revises=collection_v1,
).save()

If you want, you can label the collection’s version by setting .version.

collection_v2.version = "2"
collection_v2.save()

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()

View data lineage:

collection_v2.view_lineage()

_images/ba4c6d8840aa5f91fd801f12bfd8174dec09837bdf257daa332eb16a3d80d967.svg