too-many-cells (à la Python)

It's Scanpy friendly!

Please remember to cite us

A Python package for spectral clustering based on the powerful suite of tools named too-many-cells. Note, this is not a wrapper. In essence, you can use toomanycells to partition a data set in the form of a matrix of integers or floating point numbers into clusters, where members of a cluster are similar to each other under a given similarity function. The rows represent observations and the columns are the features. However, sometimes just knowing the clusters is not sufficient. Often, we are interested on the relationships between the clusters, and this tool can help you visualize the clusters as leaf nodes of a tree, where the branches illustrate the trajectories that have to be followed to reach a particular cluster. Initially, this tool will partition your data set into two subsets (each subset is a node of the tree), trying to maximize the differences between the two. Subsequently, it will reapply that same criterion to each subset (node) and will continue bifurcating until the modularity of the node that is about to be partitioned becomes less than a given threshold value ($10^{-9}$ by default), implying that the elements belonging to the current node are fairly homogeneous, and consequently suggesting that further partitioning is not warranted. Thus, when the process finishes, you end up with a tree structure of your data set, where the leaves represent the clusters. As mentioned earlier, you can use this tool with any kind of data. However, a common application is to classify cells and therefore you can provide an AnnData object. You can read about this application in this Nature Methods paper.

Free software: GNU AFFERO GENERAL PUBLIC LICENSE
Documentation: https://JRR3.github.io/toomanycells

Dependencies

Version 1.0.40 no longer requires Graphviz. Thus, no need to install a separate C library!

Virtual environments

To have control of your working environment you can use a python virtual environment, which can help you keep only the packages you need in one location. In bash or zsh you can simply type

python -m venv /path/to/new/virtual/environment

To activate it you simply need

source pathToTheVirtualEnvironment/bin/activate

To deactivate the environment use the intuitive

deactivate

Installation

Caveat: I have tested the following steps in Python 3.9.18. For other versions, things might be different.

In theory, just typing

pip install toomanycells

in your home or custom environment should work. However, for reproducibility, here is the list of packages I had in my virtual environment in 2024:

anndata==0.10.9
array_api_compat==1.8
celltypist==1.6.3
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
et-xmlfile==1.1.0
exceptiongroup==1.2.2
fonttools==4.54.1
get-annotations==0.1.2
h5py==3.11.0
idna==3.10
igraph==0.11.6
importlib_resources==6.4.5
joblib==1.4.2
kiwisolver==1.4.7
legacy-api-wrap==1.4
leidenalg==0.10.2
llvmlite==0.43.0
matplotlib==3.9.2
natsort==8.4.0
networkx==3.2.1
numba==0.60.0
numpy==2.0.2
openpyxl==3.1.5
packaging==24.1
pandas==2.2.3
patsy==0.5.6
pillow==10.4.0
plotly==5.24.1
pynndescent==0.5.13
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
requests==2.32.3
scanpy==1.10.3
scikit-learn==1.5.2
scipy==1.13.1
seaborn==0.13.2
session-info==1.0.0
six==1.16.0
statsmodels==0.14.3
stdlib-list==0.10.0
tenacity==9.0.0
texttable==1.7.0
threadpoolctl==3.5.0
toomanycells==1.0.52
tqdm==4.66.5
tzdata==2024.2
umap-learn==0.5.6
urllib3==2.2.3
zipp==3.20.2

If you want to install an updated version, then please use the following approach.

pip install -U --no-deps toomanycells

Note that we are requiring to keep all the dependencies as they are. Otherwise they would get upgraded and that could potentially break the installation.

To install packages based on a list of requirements, i.e., the packages you want installed with the specific version, then use

pip install -r requirements.txt

where requirements.txt is a list like the one shown the above block of code.

Make sure you have the latest version. If not, run the previous command again.

Loading a TMC tree

If you have already compute the TMC tree and stored the output in a folder, then you can rebuild the original TMC object using these instructions.

First you need a path to the graph.json file and (only if you have it but not mandatory) the clusters.csv file.

path_to_graph = os.path.join(
    path_to_original_tmc,
    "graph.json",
)

path_to_clusters = os.path.join(
    path_to_original_tmc,
    "clusters.csv",
)

Then you will need a TooManyCells object to load the desired files:

obj = toomanycells.TooManyCells(
    path_to_adata,
    output_folder,
)

Lastly, you need to call the load_graph function to populate the corresponding data structures.

obj.load_graph(
    json_file_path = path_to_graph,
    clusters_file_path = path_to_clusters,
)

Pruning a TMC tree

To prune a TMC tree you can use the following instructions. First, imagine you have the following tree

Now imagine that you want to prune it in a way such that clusters that originate from branches whose modularity $Q$ is less than $-1.44$ MADs (median absolute deviations) from the median are to be removed. This is how you would achieve that.

obj.prune_tree_by_feature(
    feature = "Q",
    mad_multiplier = -1.44,
    modify_adata = True,
    cell_ann_col = "cell_labels",
)

The resulting is shown below.

Now imagine that you want to further prune the tree, but now you want to remove clusters whose size is below 10 MADs from the median. This is how you would do that.

obj.prune_tree_by_feature(
    feature = "size",
    mad_multiplier = 10,
    modify_adata = True,
    cell_ann_col = "cell_labels",
)

And this is the resulting tree.

Quick run (needs to be updated)

If you want to see a concrete example of how to use toomanycells, check out the jupyter notebook demo.

Quick plotting

If you are already familiar with toomanycells and want to generate a quick plot (an SVG) of your tree after calling

tmc_obj.run_spectral_clustering()

then use the following call

tmc_obj.store_outputs(
    cell_ann_col="name_of_the_column",
    plot_tree=True,
)

or

tmc_obj.store_outputs(
    cell_ann_col="name_of_the_column",
    plot_tree=True,
    draw_modularity=False,
    draw_node_numbers=False,
)

with the appropriate flags. If you already have the outputs and you just want to plot, then simply call

tmc_obj.easy_plot(
   cell_ann_col="name_of_the_column",
)

or

tmc_obj.easy_plot(
   cell_ann_col="name_of_the_column",
   draw_modularity=False,
   draw_node_numbers=False,
)

with the appropriate flags, where name_of_the_column is the name of the AnnData.obs column that contains the cell annotations. This function will look for the outputs in the folder that you defined for the output_directory as shown in step 2. Note that this function relies on too-many-cells (à la Haskell). So you need to have it installed. If you work within the cluster of your organization, maybe it has already been installed, and you could load it as follows.

module add too-many-cells

Otherwise, I recommend you to install it with Nix.

Generating a matrix market file

Sometimes you want to generate a matrix market file from a set of genes so that you can visualize them with other tools. The function that will help you with this is called create_data_for_tmci. Just for context, imagine you are interested in two genes, COL1A1, and TCF21. Moreover, imagine that you also want to include in your matrix another feature located in the obs data frame called total_counts. Finally, assume that the column that contains the labels for your cells is called cell_annotations. Please use this a template for your specific needs.

from toomanycells import TooManyCells as tmc
tmc_obj = tmc(A)
tmc_obj.run_spectral_clustering()                                 
tmc_obj.store_outputs(cell_ann_col="cell_annotations")            
list_of_genes = []                                            
list_of_genes.append("COL1A1")                                
list_of_genes.append("TCF21")                                 
list_of_genes.append("total_counts")                          
tmc_obj.create_data_for_tmci(list_of_genes=list_of_genes)

These lines of code will produce for you a folder named tmci_mtx_data with the expected outputs.

Starting from scratch

First import the module as follows

from toomanycells import TooManyCells as tmc

If you already have an AnnData object A loaded into memory, then you can create a TooManyCells object with
```
tmc_obj = tmc(A)
```
In this case the output folder will be called tmc_outputs. However, if you want the output folder to be a particular directory, then you can specify the path as follows.
```
tmc_obj = tmc(A, output_directory)
```
If instead of providing an AnnData object you want to provide the directory where your data is located, you can use the syntax
```
tmc_obj = tmc(input_directory, output_directory)
```
If your input directory has a file in the matrix market format, then you have to specify this information by using the following flag
```
tmc_obj = tmc(input_directory,
              output_directory,
              input_is_matrix_market=True)
```

Under this scenario, the input_directory must contain a .mtx file, a barcodes.tsv file (the observations), and a genes.tsv (the features).

Once your data has been loaded successfully, you can start the clustering process with the following command
```
tmc_obj.run_spectral_clustering()
```

In my desktop computer processing a data set with ~90K cells (observations) and ~30K genes (features) took a little less than 6 minutes in 1809 iterations. For a larger data set like the Tabula Sapiens with 483,152 cells and 58,870 genes (14.51 GB in zip format) the total time was about 50 minutes in the same computer.

At the end of the clustering process the .obs data frame of the AnnData object should have two columns named ['sp_cluster', 'sp_path'] which contain the cluster labels and the path from the root node to the leaf node, respectively.
```
tmc_obj.A.obs[['sp_cluster', 'sp_path']]
```

To generate the outputs, just call the function

tmc_obj.store_outputs()

or

tmc_obj.store_outputs(
    cell_ann_col="name_of_the_column",
    plot_tree=True,
    )

to include a plot of the graph.

This call will generate JSON file containing the nodes and edges of the graph (graph.json), one CSV file that describes the cluster information (clusters.csv), another CSV file containing the information of each node (node_info.csv), and two JSON files. One relates cells to clusters (cluster_list.json), and the other has the full tree structure (cluster_tree.json). You need this last file for too-many-cells interactive (TMCI).

If you already have the graph.json file you can load it with
```
tmc_obj.load_graph(json_fname="some_path")
```

Visualization with TMCI

If you want to visualize your results in a dynamic platform, I strongly recommend the tool too-many-cells-interactive. To use it, first make sure that you have Docker Compose and Docker. One simple way of getting the two is by installing Docker Desktop. Note that with MacOS the instructions are slightly different. If you use Nix, simply add the packages pkgs.docker and pkgs.docker-compose to your configuration or home.nix file and run

home-manager switch

If you installed Docker Desktop you probably don't need to follow this step. However, under some distributions the following two commands have proven to be essential. Use

sudo dockerd

to start the daemon service for docker containers and

sudo chmod 666 /var/run/docker.sock

to let Docker read and write to that location.

Now clone the repository

git clone https://github.com/schwartzlab-methods/too-many-cells-interactive.git

and store the path to the too-many-cells-interactive folder in a variable, for example path_to_tmc_interactive. Also, you will need to identify a column in your AnnData.obs data frame that has the labels for the cells. Let's assume that the column name is stored in the variable cell_annotations. Lastly, you can provide a port number to host your visualization, for instance port_id=1234. Then, you can call the function

tmc_obj.visualize_with_tmc_interactive(
         path_to_tmc_interactive,
         cell_annotations,
         port_id)

The following visualization corresponds to the data set with ~90K cells (observations).

And this is the visualization for the Tabula Sapiens data set with ~480K cells.

Running TMCI independently

In case you already have the outputs for TMCI, but you want to visualize a specific set of genes on top of your tree, you are going to need the expression matrix corresponding to those genes in the matrix marker format. You will also need a list of genes and the barcodes. All of that can be easily achieved with toomanycells (à la Python) after loading your matrix or AnnData object. If you are interested in only a few genes, you can call

tmc_obj.create_data_for_tmci(
   list_of_genes = ["G1","G2",...,"Gn"]
)

where G1,G2,...,Gn, are the labels of the genes of interest. If instead you have a table of genes stored as a text file, then use the call

tmc_obj.create_data_for_tmci(
   path_to_genes = "path/to/genes.csv"
)

Lastly, if you want to write all the available genes to a matrix, then simply call

tmc_obj.create_data_for_tmci()

but note that this could take a considerable amount of time, depending on how many genes are in your matrix. After calling this function, you will have a new folder called tmci_mtx_data which will contain the aforementioned files. It is also important to mention that you need a file wiht the labels

./start-and-load.sh \
 --matrix-dir /path_to/tmci_mtx_data \
 --tree-path /path_to/cluster_tree.json \
 --label-path /path_to/cell_annotations.csv \
 --port 2025 \
 --debug

What is the time complexity of toomanycells (à la Python)?

To answer that question we have created the following benchmark. We tested the performance of toomanycells in 20 data sets having the following number of cells: 6,360, 10,479, 12,751, 16,363, 23,973, 32,735, 35,442, 40,784, 48,410, 53,046, 57,621, 62,941, 68,885, 76,019, 81,449, 87,833, 94,543, 101,234, 107,809, and 483,152. The range goes from thousands of cells to almost half a million cells. These are the results. As you can see, the program behaves linearly with respect to the size of the input. In other words, the observations fit the model $T = k\cdot N^p$, where $T$ is the time to process the data set, $N$ is the number of cells, $k$ is a constant, and $p$ is the exponent. In our case $p\approx 1$. Nice!

Cell annotation

CellTypist

When visualizing the tree, we often are interested on observing how different cell types distribute across the branches of the tree. In case your AnnData object lacks a cell annotation column in the obs data frame, or if you already have one but you want to try a different method, we have created a wrapper function that calls CellTypist. Simply write

   tmc_obj.annotate_with_celltypist(
           column_label_for_cell_annotations,
   )

and the obs data frame of your AnnData object will have a column named like the string stored under the column_label_for_cell_annotations variable. By default we use the Immune_All_High celltypist model that contains 32 cell types. If you want to use another model, simply write

   tmc_obj.annotate_with_celltypist(
           column_label_for_cell_annotations,
           celltypist_model,
   )

where celltypist_model describes the type of model to use by the library. For example, if this variable is equal to Immune_All_Low, then the number of possible cell types increases to 98. For a complete list of all the models, see the following list. Lastly, if you want to use the fact that transcriptionally similar cells are likely to cluster together, you can assign the cell type labels on a cluster-by-cluster basis rather than a cell-by-cell basis. To activate this feature, use the call

   tmc_obj.annotate_with_celltypist(
           column_label_for_cell_annotations,
           celltypist_model,
           use_majority_voting = True,
   )

Filtering cells

If you want to select cells that belong to a class defined within a specific column of the .obs dataframe, you can use the following call.

 A = tmc_obj.filter_for_cells_with_property(
   "cell_type", "Neuro-2a")

In this case all cells that have the label Neuro-2a within the column cell_type in the .obs dataframe will be selected, and the resulting AnnData object A will only have these cells.

Graph operations

Using MEDUSA

If you use MEDUSA, please refer to this YAML file to have an idea of how you can use some of the following functions from TooManyCells à la Python.

Selecting cells through branches

Imagine you have a tree structure of your data like the one shown below. If you want to isolate all the cells that belong to branches 261 and 2, and produce an AnnData object with those cells, simply use the following call

   adata = tmc_obj.isolate_cells_from_branches(
      list_of_branches=[261,2])

If you have a CSV file that specifies the branches, then use the following call

   adata = tmc_obj.isolate_cells_from_branches(
    path_to_csv_file="list_of_branches.csv",
    branch_column="node",
   )

The name of the column that contains the branches or nodes is specified through the keyword branch_column. Lastly, if you want to store a copy of the indices, use the following call

   adata = tmc_obj.isolate_cells_from_branches(
    path_to_csv_file="list_of_branches.csv",
    branch_column="node",
    generate_cell_id_file=True,
   )

Mean expression of a branch

Imagine we have the following tree. If you want to quantify the mean expression of the marker CD9 on branch 261, you can use the following call

   m_exp = tmc_obj.compute_cluster_mean_expression(
        node=261, genes=["CD9"])

and you would obtain 12.791.

Looking at the above plot, this suggests that Neuro-2a cells highly express this marker. If instead we were interested in a different marker, like SDC1, this would be the corresponding color map expression across the nodes.

The above plot also illustrates that some Neuro-2a cells are rich in SDC1.

Median absolute deviation classification

First we introduce the concept of median absolute deviation. Imagine you have a list of $n$ observations $Z = [z_0,z_1,\ldots,z_{n-1}]$. Let $\mathcal{M}:\mathbb{R}^n \to \mathbb{R}$ be the function that computes the median of a list. Consider a new list $K=[k_0,k_1,\ldots,k_{n-1}]$, where $k_i = \left| z_i - \mathcal{M}(Z) \right|$. Then, the median absolute deviation of $Z$ is the median of the absolute differences between the original value and the median. Mathematically, $\text{MAD}(Z) = \mathcal{M}(K)$. For this section we will be indicating the expression of a gene in terms of MADs. The reason is that we want to classify cells, and using quantities that capture the dispersion of the data is a convenient approach for that purpose. An important point to mention is that for each gene, instead of considering the raw expression values across all cells as the elements of the list $Z$, we use the mean expression for each node of the tree. In other words, for a given gene, the element $z_k$ represents the mean expression of that gene for node $k$. Thus, $n$ indicates the number of nodes in the tree.

Based on the previous example, now imagine you want to find cells whose expression of two markers, CD9 and SDC1, is 1 MAD above the median. First, you need a CSV file containing the following information.

     Marker      Cell  Threshold Direction
       CD9  Neuro-2a        1.0     Above
      SDC1  Neuro-2a        1.0     Above

Let's call it marker_and_cell_info.csv. Note: For this discussion the cell types indicated in the Cell column are not relevant and will not be used. We quantify the mean expression of those markers for every node of the tree and store that information within each node. We can do that using the following call.

tmc_obj.populate_tree_with_mean_expression_for_all_markers(
    cell_marker_path="marker_and_cell_info.csv")

Then we compute basic statistics for each marker using the following function

tmc_obj.compute_node_expression_metadata()

These are the statistics associated to those markers.

           median        mad       min         max   min_mad      max_mad       delta
CD9      3.080538   1.918258  0.000890   22.944445 -1.605441    10.355182    0.797375
SDC1     2.989691   1.165005  0.001669    6.639456 -2.564814     3.132832    0.379843

Note that the maximum expression of CD9 is about 10 MADs above the median, while that of SDC1 is only about 3 MADs above the median. The plot corresponding to the distribution of those markers across all nodes can be generated through this call

tmc_obj.plot_marker_distributions()

The plots will be all contained in a dynamic html file. Here are some examples.

This is the distribution for CD9:

and with TooManyCellsInteractive

The distribution for SDC1 looks as follows.

If we want to isolate the cells that satisfy the conditions

     Marker      Cell  Threshold Direction
       CD9  Neuro-2a        1.0     Above
      SDC1  Neuro-2a        1.0     Above

We can use the call

tmc_obj.select_cells_based_on_inequalities(
    cell_ann_col="cell_type")

where the cell annotation column in the .obs dataframe is specified through the cell_ann_col keyword. This function will return an AnnData object with all the cells satisfying all the constraints.

This function will also produce multiple CSV files. One for each inequality specified through the file of constraints. For example, one for all cells whose expression of CD9 was above 1 MAD of the median expression of CD9, one for all cells whose expression of SDC1 was above 1 MAD of the median expression of SDC1, and one corresponding to the intersection of all of the above. The above function will modify the original AnnData object by adding to the .obs dataframe a column named Intersection indicating with a boolean value if a cell satisfies all the constraints.

Lastly, if the number of markers is less than or equal to three, then the .obs dataframe will include a column classifying the cells based on whether they express highly or not each of the markers. For instance, in this example we obtained the following outputs.

Class
CD9-Low-SDC1-Low      28729
CD9-High-SDC1-Low      6072
CD9-High-SDC1-High     4223
CD9-Low-SDC1-High      2058
Name: count, dtype: int64
Class
CD9-Low-SDC1-Low      0.699309
CD9-High-SDC1-Low     0.147802
CD9-High-SDC1-High    0.102794
CD9-Low-SDC1-High     0.050095
Name: proportion, dtype: float64

This indicates that the majority of the cells, i.e., about 70% of cells, are low in CD9 and low in SDC1, and about 10% of cells are high in both. Note that in this particular example when we say high it means that the expression is above 1 MAD from the median, and low is the complement of that.

Heterogeneity quantification

Imagine you want to compare the heterogeneity of cell populations belonging to different branches of the toomanycells tree. By branch we mean all the nodes that derive from a particular node, including the node that defines the branch in question. For example, we want to compare branch 1183 against branch 2. One way to do this is by comparing the modularity distribution and the cumulative modularity for all the nodes that belong to each branch. We can do that using the following calls. First for branch 1183

   tmc_obj.quantify_heterogeneity(
      list_of_branches=[1183],
      use_log_y=true,
      tag="branch_A",
      show_column_totals=true,
      color="blue",
      file_format="svg")

And then for branch 2

   tmc_obj.quantify_heterogeneity(
      list_of_branches=[2],
      use_log_y=true,
      tag="branch_B",
      show_column_totals=true,
      color="red",
      file_format="svg")

Note that you can include multiple nodes in the list of branches. From these figures we observe that the higher cumulative modularity of branch 1183 with respect to branch 2 suggests that the former has a higher degree of heterogeneity. However, just relying in modularity could provide a misleading interpretation. For example, consider the following scenario where the numbers within the nodes indicate the modularity at that node.

In this case, scenario A has a larger cumulative modularity, but we note that scenario B is more heterogeneous. For that reason we recommend also computing additional diversity measures. First, we need some notation. For all the branches belonging to the list of branches in the above function

quantify_heterogeneity, let $C$ be the set of leaf nodes that belong to those branches. We consider each leaf node as a separate species, and we call the whole collection of cells an ecosystem. For $c_i \in C$, let $|c_i|$ be the number of cells in $c_i$ and $|C| = \sum_i |c_i|$ the total number of cells contained in the given branches. If we let

$$p_i = \dfrac{|c_i|}{|C|},$$

then we define the following diversity measure

$$D(q) = \left(\sum_{i=1}^{n} p_i^q \right)^{\frac{1}{1-q}}. $$

In general, the larger the value of $D(q)$, the more diverse is the collection of species. Note that $D(q=0)$ describes the total number of species. We call this quantity the richness of the ecosystem. When $q=1$, $D$ is the exponential of the Shannon entropy

$$H = -\sum_{i=1}^{n}p_i \ln(p_i).$$

When $q=2$, $D$ is the inverse of the Simpson's index

$$S = \sum_{i=1}^{n} (p_i)^2,$$

which represents the probability that two cells picked at random belong to the same species. Hence, the higher the Simpson's index, the less diverse is the ecosystem. Lastly, when $q=\infty$, $D$ is the inverse of the largest proportion $\max_i(p_i)$.

In the above example, for branch 1183 we obtain

               value
Richness  460.000000
Shannon     5.887544
Simpson     0.003361
MaxProp     0.010369
q = 0     460.000000
q = 1     360.518784
q = 2     297.562094
q = inf    96.442786

and for branch 2 we obtain

               value
Richness  280.000000
Shannon     5.500414
Simpson     0.004519
MaxProp     0.010750
q = 0     280.000000
q = 1     244.793371
q = 2     221.270778
q = inf    93.021531

After comparing the results using two different measures, namely, modularity and diversity, we conclude that branch 1183 is more heterogeneous than branch 2.

Similarity functions

So far we have assumed that the similarity matrix $S$ is computed by calculating the cosine of the angle between each observation. Concretely, if the matrix of observations is $B$ ($m\times n$), the $i$-th row of $B$ is $x = B(i,:)$, and the $j$-th row of $B$ is $y=B(j,:)$, then the similarity between $x$ and $y$ is

$$S(x,y)=\frac{x\cdot y}{||x||_2\cdot ||y||_2}.$$

However, this is not the only way to compute a similarity matrix. We will list all the available similarity functions and how to call them.

Cosine (sparse)

If your matrix is sparse, i.e., the number of nonzero entries is proportional to the number of samples ($m$), and you want to use the cosine similarity, then use the following instruction.

tmc_obj.run_spectral_clustering(
   similarity_function="cosine_sparse")

By default we use the ARPACK library (written in Fortran) to compute the truncated singular value decomposition. The Halko-Martinsson-Tropp algorithm is also available. However, this one is not deterministic.

tmc_obj.run_spectral_clustering(
   similarity_function="cosine_sparse",
   svd_algorithm="arpack")

If $B$ has negative entries, it is possible to get negative entries for $S$. This could in turn produce negative row sums for $S$. If that is the case, the convergence to a solution could be extremely slow. However, if you use the non-sparse version of this function, we provide a reasonable solution to this problem.

Dimension-adaptive Euclidean Norm (DaEN)

If your data consists of points whose Euclidean norm varies across multiple length scales, then one option is to use a similarity function that can adapt to those changes in magnitude. Before I explain it in detail, here is how you can call it

tmc_obj.run_spectral_clustering(
    similarity_function="norm_sparse")

Cosine

If your matrix is dense, and you want to use the cosine similarity, then use the following instruction.

tmc_obj.run_spectral_clustering(
   similarity_function="cosine")

The same comment about negative entries applies here. However, there is a simple solution. While shifting the matrix of observations can drastically change the interpretation of the data because each column lives in a different (gene) space, shifting the similarity matrix is actually a reasonable method to remove negative entries. The reason is that similarities live in an ordered space and shifting by a constant is an order-preserving transformation. Equivalently, if the similarity between $x$ and $y$ is less than the similarity between $u$ and $w$, i.e., $S(x,y) < S(u,w)$, then $S(x,y)+s < S(u,w)+s$ for any constant $s$. The raw data have no natural order, but similarities do. To shift the (dense) similarity matrix by $s=1$, use the following instruction.

tmc_obj.run_spectral_clustering(
   similarity_function="cosine",
   shift_similarity_matrix=1)

Note that since the range of the cosine similarity is $[-1,1]$, the shifted range for $s=1$ becomes $[0,2]$. The shift transformation can also be applied to any of the subsequent similarity matrices.

Laplacian

The similarity matrix is given by

$$ S(x,y)=\exp(-\gamma\cdot \left\lVert x-y \right\rVert _1). $$

This is an example:

tmc_obj.run_spectral_clustering(
   similarity_function="laplacian",
   similarity_gamma=0.01)

This function is very sensitive to $\gamma$. Hence, an inadequate choice can result in poor results or no convergence. If you obtain poor results, try using
a smaller value for $\gamma$.

Gaussian

The similarity matrix is given by

$$ S(x,y)=\exp(-\gamma\cdot \left\lVert x-y\right\rVert _2^2). $$

This is an example:

tmc_obj.run_spectral_clustering(
   similarity_function="gaussian",
   similarity_gamma=0.001)

As before, this function is very sensitive to $\gamma$. Note that the norm is squared. Thus, it transforms big differences between $x$ and $y$ into very small quantities.

Divide by the sum

The similarity matrix is given by

$$ S(x,y)=1-\frac{ \left\lVert x-y \right\rVert_p }{ \left\lVert x \right\rVert_p + \left\lVert y \right\rVert_p }, $$

where $p =1$ or $p=2$. The rows of the matrix are normalized (unit norm) before computing the similarity. This is an example:

tmc_obj.run_spectral_clustering(
   similarity_function="div_by_sum")

Normalization

TF-IDF

If you want to use the inverse document frequency (IDF) normalization, then use

tmc_obj.run_spectral_clustering(
   similarity_function="some_sim_function",
   use_tf_idf=True)

If you also want to normalize the frequencies to unit norm with the $2$-norm, then use

tmc_obj.run_spectral_clustering(
   similarity_function="some_sim_function",
   use_tf_idf=True,
   tf_idf_norm="l2")

If instead you want to use the $1$-norm, then replace "l2" with "l1".

Simple normalization

Sometimes normalizing your matrix of observations can improve the performance of some routines. To normalize the rows, use the following instruction.

tmc_obj.run_spectral_clustering(
   similarity_function="some_sim_function",
   normalize_rows=True)

Be default, the $2$-norm is used. To use any other $p$-norm, use

tmc_obj.run_spectral_clustering(
   similarity_function="some_sim_function",
   normalize_rows=True,
   similarity_norm=p)

Gene expression along a path

Introduction

Imagine you have the following tree structure after running toomanycells. Further, assume that the colors denote different classes satisfying specific properties. We want to know how the expression of two genes, for instance, Gene S and Gene T, fluctuates as we move from node $X$ (lower left side of the tree), which is rich in Class B, to node $Y$ (upper left side of the tree), which is rich in Class C. To compute such quantities, we first need to define the distance between nodes.

Distance between nodes

Assume we have a (parent) node $P$ with two children nodes $C_1$ and $C_2$. Recall that the modularity of $P$ indicates the strength of separation between the cell populations of $C_1$ and $C_2$. A large the modularity indicates strong connections, i.e., high similarity, within each cluster $C_i$, and also implies weak connections, i.e., low similarity, between $C_1$ and $C_2$. If the modularity at $P$ is $Q(P)$, we define the distance between $C_1$ and $C_2$ as

$$d(C_1,C_2) = Q(P).$$

We also define $d(C_i,P) = Q(P)/2$. Note that with those definitions we have that

$$d(C_1,C_2)=d(C_1,P) +d(P,C_2)=Q(P)/2+Q(P)/2=Q(P),$$

as expected. Now that we know how to calculate the distance between a node and its parent or child, let $X$ and $Y$ be two distinct nodes, and let ${(N_{i})}_{i=0}^{n}$ be the sequence of nodes that describes the unique path between them satisfying:

$N_0 = X$,
$N_n=Y$,
$N_i$ is a direct relative of $N_{i+1}$, i.e., $N_i$ is either a child or parent of $N_{i+1}$,
$N_i \neq N_j$ for $i\neq j$.

Then, the distance between $X$ and $Y$ is given by

$$d(X,Y) = \sum_{i=0}^{n-1} d(N_{i},N_{i+1}).$$

Gene expression

We define the expression of Gene G at a node $N$, $Exp(G,N)$, as the mean expression of Gene G considering all the cells that belong to node $N$. Hence, given the sequence of nodes

$$(N_i)_{i=0}^{n}$$

we can compute the corresponding gene expression sequence

$$(E_{i})_{i=0}^{n}, \quad E_i = Exp(G,N_i).$$

Cumulative distance

Lastly, since we are interested in plotting the gene expression as a function of the distance with respect to the node $X$, we define the sequence of real numbers

$$(D_{i})_{i=0}^{n}, \quad D_{i} = d(X,N_{i}).$$

Summary

The sequence of nodes between $X$ and $Y$ $${(N_{i})}_{i=0}^{n}$$
The sequence of gene expression levels between $X$ and $Y$ $${(E_{i})}_{i=0}^{n}$$
And the sequence of distances with respect to node $X$ $${(D_{i})}_{i=0}^{n}$$

The final plot is simply $E_{i}$ versus $D_{i}$. An example is given in the following figure.

Example

Note how the expression of Gene A is high relative to that of Gene B at node $X$, and as we move farther towards node $Y$ the trend is inverted and now Gene B is highly expressed relative to Gene A at node $Y$.

ATAC-seq matrix construction and clustering workflow

This tool implements a preprocessing and clustering pipeline for single-cell ATAC-seq data inspired by the Too Many Peaks framework. The goal is to construct a cell-level representation of chromatin accessibility that supports hierarchical, cluster-aware peak discovery while remaining scalable to large datasets.

Input data

The pipeline starts from a standard scATAC-seq fragments file ,e.g., fragments.tsv.gz, where each row represents a sequenced DNA fragment associated with a single cell barcode. Each fragment corresponds to two Tn5 transposase insertion events in the accessible chromatin.

Fragment preprocessing

Fragments are first standardized and filtered to remove technical artifacts
Only standard chromosomes (based on the selected genome build) are retained.
Fragments are sorted by genomic coordinate.
Genome blacklist regions are downloaded (e.g. ENCODE blacklist) and fragments overlapping blacklist intervals are removed.

These steps ensure consistent genomic coordinates and reduce spurious signal arising from low-mapping regions.

Genome binning and fragment assignment

The genome is partitioned into fixed-width bins (e.g. 5 kb). Chromatin accessibility is quantified at the level of bins rather than predefined peaks to avoid bias toward globally dominant regulatory elements.

Fragments are assigned to bins using a cut-site-based strategy
Each fragment contributes two insertion events corresponding to its start and end coordinates.
A bin is considered accessible in a given cell if it contains at least one insertion event.

The resulting cell x bin matrix is binarized (0/1), indicating presence or absence of accessibility. This representation reflects what ATAC-seq directly measures (Tn5 insertions) and avoids over-weighting bins with high fragment counts.

AnnData construction

The binarized accessibility matrix is stored in an AnnData object:

Rows (adata.obs): cells (barcodes)
Columns (adata.var): genomic bins with coordinate metadata
Matrix (adata.X): sparse binary cell × bin accessibility matrix

This format enables efficient downstream processing and interoperability with standard single-cell analysis tools.

TF-IDF normalization and LSI

To account for variability in sequencing depth and bin prevalence, the accessibility matrix is transformed using TF–IDF normalization:

Term Frequency (TF): per-cell depth normalization
Inverse Document Frequency (IDF): down-weighting bins accessible in many cells

Dimensionality reduction is then performed using truncated singular value decomposition (SVD), yielding a low-dimensional latent semantic indexing (LSI) embedding (e.g. 50 components), which is stored in adata.obsm["X_lsa"]. This representation emphasizes cell-type–specific accessibility patterns and is suitable for graph-based clustering.

Hierarchical clustering with TooManyCells

Cells are clustered using recursive spectral clustering via the TooManyCells framework. Clustering is performed in LSI space using diameter-normalized Euclidean similarity (DNES):

A similarity graph is constructed for the current set of cells.
The graph Laplacian is computed and partitioned using the Fiedler vector.
After each split, similarities are recomputed locally within each subset. Local renormalization at each branching point allows the algorithm to adapt to the intrinsic geometry of each subpopulation and improves sensitivity to fine-grained structure during recursive partitioning.

Tree pruning and cluster refinement

Recursive spectral clustering intentionally over-partitions the data. The resulting hierarchy is pruned to obtain a smaller set of stable, interpretable clusters by:

Enforcing minimum cluster size and balance constraints during recursion
Removing unstable splits based on resampling or separability criteria
Merging clusters with highly similar pseudo-bulk accessibility profiles

This strategy favors mild over-partitioning followed by principled merging, ensuring that biologically meaningful structure is preserved.

Output

The final output is an AnnData object containing:

A binarized genome-wide accessibility matrix
TF-IDF LSI embeddings
Hierarchical and pruned cluster assignments

This representation is designed to support downstream cluster-aware peak calling, motif analysis, and regulatory interpretation in line with the Too Many Peaks philosophy.

Acknowledgments

I would like to thank the Schwartz lab (GW) for letting me explore different directions and also Christie Lau for providing multiple test cases to improve this implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 347 Commits
.github		.github
docs		docs
tests		tests
toomanycells		toomanycells
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
toomanycells_demo.ipynb		toomanycells_demo.ipynb

Folders and files

Latest commit

History

Repository files navigation

too-many-cells (à la Python)

It's Scanpy friendly!

Please remember to cite us

Dependencies

Virtual environments

Installation

Loading a TMC tree

Pruning a TMC tree

Quick run (needs to be updated)

Quick plotting

Generating a matrix market file

Starting from scratch

Visualization with TMCI

Running TMCI independently

What is the time complexity of toomanycells (à la Python)?

Cell annotation

CellTypist

Filtering cells

Graph operations

Using MEDUSA

Selecting cells through branches

Mean expression of a branch

Median absolute deviation classification

Heterogeneity quantification

Similarity functions

Cosine (sparse)

Dimension-adaptive Euclidean Norm (DaEN)

Cosine

Laplacian

Gaussian

Divide by the sum

Normalization

TF-IDF

Simple normalization

Gene expression along a path

Introduction

Distance between nodes

Gene expression

Cumulative distance

Summary

Example

ATAC-seq matrix construction and clustering workflow

Input data

Fragment preprocessing

Genome binning and fragment assignment

AnnData construction

TF-IDF normalization and LSI

Hierarchical clustering with TooManyCells

Tree pruning and cluster refinement

Output

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 65

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages