The long-term goal of the Human Cell Atlas, for instance, is to profile about 10 billion cells. Each cell itself contains tons of data on RNA expression, which can provide insight about cell behavior and disease progression.
With enough computation power, biologists can analyze full datasets, but it takes hours or days. Without those resources, it’s impractical. Sampling methods can be used to extract small subsets of the cells for faster, more efficient analysis, but they don’t scale well to large datasets and often miss less abundant cell types.
MIT researchers use a method that captures a fully comprehensive "sketch" of an entire dataset that can be shared and merged easily with other datasets. Instead of sampling cells with equal probability, it evenly samples cells from across the diverse cell types present in the dataset.
In experiments, the method generated sketches from datasets of millions of cells in a few minutes - as opposed to a few hours - that had far more equal representation of rare cells from across the datasets. The sketches even captured, in one instance, a rare subset of inflammatory macrophages that other methods missed.
Brian Hie, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and a researcher in the Computation and Biology group says: "Sketching gives a compact summary of a very large dataset that tries to preserve as much biological information as possible … so people don’t need to use so much computational power."
Humans have hundreds of categories and subcategories of cells, and each cell expresses a diverse set of genes. Techniques such as RNA sequencing capture all cell information in massive tables, where each row represents a cell and each column represents some measurement of gene expression.
As it happens, cell types with similar gene diversity - both common and rare - form similar-sized clusters that take up roughly the same space. But the density of cells within those clusters varies greatly: 1,000 cells may reside in a common cluster, while the equally diverse rare cluster will contain 10 cells. That’s a problem for traditional sampling methods that extract a target-size sample of single cells.
"If you take a 10-percent sample, and there are 10 cells in a rare cluster and 1,000 cells in a common cluster, you’re more likely to grab tons of common cells, but miss all rare cells," Hie says. "But rare cells can lead to important biological discoveries."
The researchers modified a class of algorithm that lays shapes over
datasets. Their algorithm covers the entire computational space with what they call a "plaid covering," which is like a grid of equal-sized squares but in many dimensions. It only lays these multidimensional squares where there’s at least one cell, and skips over any empty regions. In the end, the grid’s empty columns will be much wider or skinnier than occupied columns - hence the "plaid" description. That technique saves tons of computation to help the covering scale to massive
datasets.
They applied their sketching method to a dataset of around 250,000 umbilical cord cells that contained two subsets of a rare macrophages - inflammatory and anti-inflammatory. All other traditional sampling methods clustered both subsets together, while the sketching method separated them. Additional in-depth studies of these macrophage subpopulations could help reveal insight into inflammation and how to modulate inflammatory processes in response to disease, the researchers say.
MEDICA-tradefair.com; Source: Massachusetts Institute of Technology (MIT)