Using allCools generate-dataset to predict the accessibility of human cell types using a Genomics Kit for single-nucleus experiments
The manufacturer recommended that the 10X Genomics kit be used for all single-nucleus experiments. Library preparation was performed according to the manufacturer’s recommendation. Libraries were pooled and sequenced on NovaSeq S2.
The data from Epi-retro-seq was mapped to the mm10 genome in the previous study. The whole genome was parsed into 100 kb non-overlapping genomic bins (Chr. 1:0–100,000; Chr. 1: 100,000–200,000; and so on) using bedtools make-window, and for each single cell, we counted the methylated and total basecalls for all 100-kb bins using ALLCools generate-dataset. We counted the expansion of all the genes in both directions. The data is in Zarr format, so chunk loading can be done with it. To avoid the methylation differences being driven by the active and inactive X chromosomes, we used only the autosomal bins and genes in our analyses. The input for all downstream analyses is the cell-by-bin and cell-by-gene methylation levels that were computed.
The model’s accuracy was evaluated when applied to human cell types. We first identified matched human cell types from a previous study23. For each subclass in human and mouse cCREs, we performed spearman correlation across orthologous cCREs (Extended Data Fig. 16). We next selected pairs based on correlation and annotation matching. We used the model to predict the accessibility of the human cell types. We looked at the prediction accuracy within and across different cell types.
Deep Learning in the Brain with a Novel Loss Function based on CRNA-Seq and snATAC-seq Data
We trained the subclass-level deep-learning model on four NVIDIA A100 80 GB GPUs using the script basenji_train.py from Basenji65. For training, we set the parameter batch size to 32, epochs to 150 and patience to 30.
I and j represent the type of cell. The ytrue is a matrix of bin-by-type calculated from true signals. The ypred represents the predicted genomic bin-by-type matrix. The pairwise covariance wi,i was calculated between cell types. We use the scores across the rows and the number of cell types as weights. Last, the weights w was dot multiplied by the original poisson loss.
The model architecture and parameters are adapted from the mouse model with modifications only in the last output head layer. To encourage the model to predict cCREs in under-represented cell types, we created one novel loss function:
We adapted the recently published Python package CellOracle52 on our data to infer GRNs for each cell subclass across the whole mouse brain based on our integration analysis between our snATAC–seq data and the scRNA-seq data5. Three steps were followed. First, we identified the co-accessibility distal-to-proximal pairs, which was described previously for each subclass. Second, we mapped the distal cCREs to TFs. Lastly, we identified the regulatory relationships between TFs and the potential target genes by fitting a regularized linear regression model using scRNA-seq data. For the second step, according to the CellOracle tutorial, we used the Python package gimmemotifs53 for the TF-binding-motif scan with the mouse genome mm10 and the default motif database provided by CellOracle. The proximal cCREs were mapped to the genes based on GENCODE mm10 (v.23, the same as above). We used Seurat32 to randomly sample 1,000 cells per subclass (all of the cells of a cell subclass were used if it had <1,000 cells). To select the variable features, we performed the FindVariableFeatures function of Seurat to select the top 3,000 genes, and then we manually added the 499 TFs (if any of them were missed in the previous 3,000 genes) that were reported in the scRNA-seq data of ref. 5. For each subclass, we performed CellOracle on the scRNA-seq data with the default parameters. We used the weights and the edges to filter out the predicted interactions between genes and TFs, as suggested by CellOracle. Finally, 267 out of 275 subclasses successfully had the predicted GRNs.
We used clusterProfiler 100,101 to perform GO enrichment analysis. The background genes were selected on the basis of the enrichment analysis. The Benjamini–Hochberg method was used to adjust the P value for multiple comparisons.
To analyse the TE-accessible variability with decreased noise, the TE signal was aggregated from the TE-cCREs. We normalized the correlation between the two, as well as the correlation between the two for all subclasses. mCG signal for each TE in matched subclasses from the companion paper41. To calculate the correlation between chromatin accessibility and RNA expression, we aggregated RNA signals at TE-cCREs of each TE in matched subclasses from a previous study99.
The TE annotation of cCREs was annotated using Homer45 and UCSC mm10 refGene and RepeatMasker annotation. To define the high TE-cCREs fraction of subclasses, we fitted a mixture model for the TE-cCRE fraction across all subclasses using the R package mixtools97 (v.2.0.0). The P value was calculated based on the null distribution.
The two peak modules that show global accessibility are based on the NMF analysis. We ranked the proximal–distal connections based on their Cicero scores and then took all of the cCREs into account. We treated them as global proximal–distal connections and performed the Hi-C signals by aggregating all of the Hi-C data. We observed the strong enrichment signals for the global connections from the heat maps.
Using permutation analysis to determine the enumeration of frequently interacting regions in the mouse brain: A phenomenological study on cCREs
We next tested whether cCREs are enriched at FIREs through permutation analysis. In brief, we shuffled the mouse genome 1,000 times, each time generating 1,053,811 random regions with equivalent sizes as the cCREs. Each shuffle we calculated how many overlaps there were between the randomly generated regions and the FIREs. It was found that cCREs are enriched at FIREs with more overlaps than expected.
We called frequently interacting regions (FIREs) in the mouse cortex38 by applying the criteria in our group’s FIRE paper95. In the mouse brain, most FIREs (3,158 out of 3, 169) overlap with cCREs, and a fraction of cCREs overlap with FIREs.
The coefficients were used to associate the modules with the cell clusters. Each row is a module while each column is a cell cluster. The values in the matrix indicate the weights of the clusters in their corresponding module. The coefficient matrix was then scaled by column (cluster) from 0 to 1. Subsequently, we used a coefficient > 0.1 (~95th percentile of the whole matrix) as a threshold to associate a cluster with a module.
The basis and coefficient matrix are used to define cCREs and cell cluster components. The rank R is the number of modules and is the key issue in determining a reasonable value for the matrix. Several criteria have been proposed to decide whether a given rank R decomposes the occupancy profile matrix into meaningful clusters. We applied two measures, Sparseness. And. Entropy42, to evaluate the clustering result. The average values were calculated using five NMF runs at each rank, with a random seed, to ensure that the measurements are stable.
We then retained only reproducible peaks on chromosome 1–19 and both sex chromosomes and filtered ENCODE mm10 blacklist regions. A union peak list for the whole dataset was obtained by merging peak sets from all of the cell clusters using BEDtools91.
For each cell cluster, we did a combination of all properly pairs reads to create a pseudobulk data set for individual biological replicates. Half of the reads from each biological replicate were created with two pseudoreplicates. We called peaks for each of the four datasets and a pool of both replicates independently. Peak calling was done on the Tn5-corrected single- base inserts using the following parameters. –shift -75 –extsize 150 –nomodel –call-summits –SPMR -q 0.01. Finally, we extended peak summits by 250 bp on either side to a final width of 501 bp for merging and downstream analysis. If the number of cells in the pseudobulk ATAC–seq from one biological replicate or individual pseudoreplicates is less than 200, we didn’t run MACS2 for it. We reduced the potential false negatives when the next step was made possible by a limited number of cells.
In order to generate a list of peaks that were detected in the pooled dataset, we retained peaks that were detected in individual replicates or that were found in both pseudoreplicates.
The same procedure was used to pick peaks at the bulk and single-cell levels. 9a) as in our previous study19. Before calling peaks, we merged the clusters if they shared the same cell cluster annotation and were in the same L3-level cluster. Next, 1,463 subtypes (including merged ones) were used.
The scRNA-seq data in ref. 8 were downloaded from the Gene Expression Omnibus with the identifier GSE133912. The retro-seq data were integrated to the unbiased scRNA-seq data for thalamic neurons, using the 5,404 CEGs of 1,128 L3 scRNA-seq clusters. PCA was fitted with scRNA-seq data and retro-seq data were transformed. The CCA framework was used to find anchors between the two datasets, and the transformed PCs of retro-seq were aligned to the scRNA-seq PCs. To compare the distribution of retro-seq cells and epi-retro-seq cells across thalamic cell clusters, we used the label transfer method described in the previous section to transfer the mC–RNA co-cluster label and the joint t-SNE coordinates from scRNA-seq cells to the retro-seq cells, considering five neighbouring scRNA-seq cells for each retro-seq cell.
We then applied the function knn from SnapATAC2 to construct the k-nearest neighbour graph using the parameter n_neighbors = 50 and the parameter method was set to ‘kdtree’. The function leiden is used in clustering with the object function set as modularity. The resolution which affects the number of clusters was used for the step size in the python package. We checked the UMAP 81 for each result to make sure the resolution was in line with the top silhouette coefficients. UMAP projections were calculated using the Python package umap with the parameters a as 1.8956, b as 0.8005 and init as spectral. All of the resolution parameters during clustering are provided in Supplementary Table 3. The final clusters from L 3-level clustering and L4-level clustering were represented using the term’subsequent’.
The high-dimension sparse500 bp genomic bin features per cell can be converted to low-dimensional representations with the use of the function of spectral from SnapATAC2. For L1-level and L2-level clustering, we chose 50 as the dimension of the low-dimensional representation space as usually a large number of cells and potentially diverse cell types was involved in the two levels. We used ‘elbow plot’ to rank all of the principal components to make sure that the top 50 components were sufficient for our analysis. We decided to use 30 for later analysis. The parameter ‘weighted_by_sd’ in the function spectral was set to be true for all dimensional reduction. The sample_size was not used in the functionspectral, so no approximation method was used. We have a computing system that has about 300 GB of memory.
AAV-retro-Cre-injected brains scanned with Fluorescence Microscopy and scanned on NovaSeq 6000
Brain dissections were done previously. The brains were taken from the 50 to 70 day old mice and immediately submerged in the ice-cold slicing buffer. The cation2, the magnesium and the NaH2PO4 were cut into 0.6-mm sections by the side of the pole. From each AAV-retro-Cre-injected brain, the slices were kept in the ice-cold dissection buffer from which selected brain regions (Fig. 1b) were manually dissected under a fluorescence dissecting microscope (Olympus SZX16), following the Allen Mouse Common Coordinate Framework, Reference Atlas, Version 3 (2015; Extended Data Fig. 1). The dissected brain tissues were transferred to prelabelled microcentrifuge tubes, immediately frozen in dry ice, and subsequently stored at −80° C. During the dissection, the injection site was visually inspected to verify the accuracy of the injection. Only brains with accurate injections were dissected for further analysis. The dimensions of the cellSens were used for image acquisition.
The library preparation and conversion were done with the help of the snmC-seq protocol. DNA samples from single nuclei were barcoded with random primers after the conversion, pooled through two rounds of cleaning up, then added with and amplified to generate libraries. Libraries were pooled, cleaned and normalized using the S4 flow cell 2 150 base-pair mode, and then analyzed on Illumina Novaseq 6000. The library preparation version of FreedomEVOware was used, as were the NovaSeq 6000 control software and Real-Time Analysis. Technicians doing nucleus preparations and snmC-seq analyses were blind to the injection sites used for each sample.
Without the need for a defined peak set, the data quality was quantified using the enrichment of ATAC–seq accessibility. We followed a previously described procedure86, and used the function filter_cells in SnapATAC2 to calculate TSS enrichment (TSSe). TSS positions were obtained from the GENCODE87 database v.16. Tn5-corrected inserts were aggregated 2000 bp to each distinct TSS genome, and the negative strand reads were shifted as a result. This profile was then normalized to the mean accessibility ±1,900–2,000 bp from the TSS and smoothed every 11 bp. The maximum of the smoothed profile was taken as the TSSe.
Nuclei with ≥1,000 uniquely mapped fragments and TSSe ≥ 10 were filtered for each of 234 samples according to the ENCODE ATAC–seq data standards and process pipeline (https://www.encodeproject.org/atac-seq/). The function of the filter_cells was used to achieve this.
We used a different version of Scrublet28 to remove doublets. First, we used the add_tile_matrix function to add the 500 bp genomic bin features, then used the select_features function to filter out the features with frequencies along the samples of lower than 0.5% or higher than 99.5%. The scrublet function ofSnapATAC2 was used to get the doublet scores. The parameter expected_doublet_rate was set to 0.08, which is based on our previous experiment on the snATAC–seq pipeline19. The potential doublets were removed from the analysis because of the barcode’s scrublet scores.
We compared Scrublet with another recently published method named AMULET30, which is used for doublet detection and removal in snATAC–seq data. In order to evaluate the performance of the two methods, we used the precision-recall curve (PRC) and the area under the PRC.
A Cluster-Buster46 Analysis of the Mesoscopic Expression Levels of Projection and Ventrolateral HY Nervous Cells (VMHvl)
Note that because the GRNs were constructed based only on correlations between data modalities and the binding motifs, they inevitably capture indirect and false-positive relationships. It would be necessary to conduct perturbation experiments to verify the connections between cis and trans regulators.
49,504 position weight matrices were collected from 29 sources and contained in the ensemble motif database from SCENIC+. TOMTOM45 calculated the distances of the thousands of motifs that were combined into 8,045 clusters. Each motif cluster was annotated with one or more mouse TF genes. We used a method called Cluster-Buster46 to calculate the occurrence of motifs, which was done by scanning them together with hidden models.
pseudobulk methylation profiles were created with the merged snmC-seq cells. The epi-retro-seq cells were not used owing to the different genome backgrounds of the mice to avoid confounding results. AllCools were used to identify DMRs within each region group. The calculation was done between the two fractions, the DMR mCG and the mCH fraction. The genes and DMRs within each sample were shuffled to calculate the null PCC. We looked at the target edges with FDR.
The normalized expression level of each single cell was determined by both the UMI count and log transformed. We compared clusters that were associated with projection neurons. For each cluster pair, the P values were derived with two-sided Wilcoxon rank-sum tests, and the fold change is computed as the ratio between the average expression level across cells in the two clusters. The genes with the highest absolute value of log2 fold change more than 1 and the smallest FDR values are differentially expressed. The heatmaps were generated from the merged DEGs. Only the top 100 DEGs ranked by FDRs were used if there were more than 100 DEGs identified between a pair of clusters.
Only the 23,345 Act-seq cells from 16 ventromedial HY (VMH) neuron clusters were considered as ventrolateral VMH cells (VMHvl) and were used for the behaviour-association studies in ref. 7. However, almost none of them correspond to projection-associated clusters in our data (Extended Data Fig. 7a–c. All of our clusters have corresponding clusters in the act-seq data, and we further compared them with the other clusters. 7d, in red). Among them, clusters 4 and 64 showed weak but significant increases in proportions of Fos+ cells labelled during certain behaviours (Extended Data Fig. 7f).
Finally, the neurons from each region group were selected and integrated with the scRNA-seq cells from the same region group, using the same procedure as described above. The mC–RNA co-cluster labels were transferred from the scRNA-seq cell to the MERFISH cells. The MERFISH cells assigned to the MM and DCO subclasses in the last step were also excluded as those clusters were not included in the co-cluster analysis as described in the previous section.
The common coordinate framework could also be achieved through registration of the MERFISH DAPI images. It is important to use cell types with known locations as landmarks in order to refine the registration because the procedure is relatively challenging.
The MERFISH experiments were conducted as described in ref. Gene panel design, tissue preparation, image, data processing, and annotations are included. The dataset includes two sagittal slices (S1 and S2, with S1 being more lateral and S2 being more medial) and 14 coronal slices (C2, C4, C6, C8, C10, C12 and C14, roughly corresponding to slices 2, 4, 6, 8, 10, 12 and 14 in Extended Data Fig. 1, with two replicates for each slice, represented as R1 and R2). The same naming of slices was used throughout this manuscript (Figs. 3e, 4d and 5c and Extended Data Fig. 6).
In this manuscript, we used the absolute proportions but not the relative ones to the unbiased data owing to the different profiling strategies between the two datasets. We pooled the different dissections into 32 different sources so that the proportion of cells from different sources was used for FANS. However, the unbiased snmC-seq profiled all of the dissection regions separately and sequenced the same number of cells in each dissection, which manually amplified the proportion of cells from the smaller or sparser dissection regions relative to the larger or denser ones, and limited the power to estimate the real proportion of neurons in each cluster from the sources.
In general, there are two intuitive ways to quantify the enrichment of projection neurons in a cluster. One is to directly find the clusters with a high absolute proportion of epi-retro-seq neurons projecting to a target. One method is to find clusters that are captured at a higher rate in the projection enriched data. The two methods have flaws and advantages. For example, the contaminated cells from inaccurate labelling or gating are likely to have a similar distribution across clusters to unbiased profiling. The contaminated clusters might be better excluded from a comparison using unbiased data. However, if most of the neurons from a projection type are in the clusters that are originally abundant cell types in the source, by comparing with unbiased data, we would miss the predominant clusters making the projection.
The model we used to find the five nearest snmC- was derived from snmC-seq data and was used to fit a PCA model. We transform the snmC-seq cells into epi-retro-seq cells by finding five nearest Epi-retro-seq cells for each cell. The mutual nearest neighbours between the two datasets were used as anchors and scored using the same method as Seurat v3.
We adopted a similar framework to that of Seurat v3 (ref. 41) for data integration by first identifying the mutual nearest neighbours as anchors between datasets, and then aligning the datasets through the anchors.
Improper labelling of Isocortical Neurons with FANS/AAV-retro-Cre: Step Four – Identification of the contaminant neurons
The rationale for step four is to eliminate the potential forContamination in the dataset that could have resulted from incorrect gating of GFP+N+ cells and AAV-retro-Cre injections that passed through overlying source brain regions. Inaccurate gating of GFP+NeuN+ cells could be more common in the experiments of some weak projections, in which very few neurons were retrogradely labelled, resulting in small proportions of cells passing FANS gating criteria and subsequent inclusion of high proportions of cells accepted from the edges of FANS gates. When targeting a deep structure in the brain, it is more common for inaccurate labelling to happen when getting cells from near the target. The on-target and off- target clusters were clear in this section of the experiment and it was only done for isocortical neurons. Since the prior knowledge of cell types associated with projection isn’t complete, it’s more challenging to estimate purity using this method. The projected data in the subcortical structures are usually strong and don’t involve adding sources or targets, which can lead to a lower noise level. After these steps, there are still expected to be some cells that have been contaminated.
There are a number of reasons that could contribute to a low prediction performance. The following are biological reasons. First, some neurons make projections to several targets simultaneously. These could result in the neurons being captured by several retrograde labelling experiments of different targets. The models we have for this type of neuron wouldn’t work for a single label. Second, some neurons project to different target regions but have tiny epigenetic differences. There’s still more that needs to be done to distinguish the first and second reasons.
We first created inhibitory and excitatory cell type groups based on their neurotransmitter expression as above. We classified cell types expressing GABA or GLY neurotransmitters as inhibitory and those expressing VGLUT neurotransmitters as excitatory. The cell type that was assigned to an excitatory and inhibitory identity was classified as the inhibitory type. For each region on the Slide-seq array, we labelled its beads as excitatory or inhibitory by whether they confidently mapped into members of the corresponding cell type groups, with additional filtering to ensure that these mappings were one of top two ranked cell types per bead. We defined the number of excitatory mapped beads to be (#I) and (#E)
To identify variable genes, we used a binomial model of homogenous expression and looked for deviations from that expectation, similar to a recently described approach74. The relative bulk expression was computed by summing up the counts across cells and dividing by the total UMIs of the population. This is the proportion of all counts that are assigned to that gene. This value is equivalent to in a Poisson model for observing a gene in a cell with n counts. The proportion with non-zero counts will be expected.
A Co-Embedding Space for Datasets with and without Labels based on One-Hot Encoding of Nearest Neighbours
Two datasets with and without labels are in a co-embedding space. Each cell in B is considered a query cell when it is found its k-nearest neighbours in A with an equivalent distance to the neighbours. After the transformation, the closer neighbours have higher weights, and the weights of all neighbours sum up to 1. To transfer a categorical label from A to B, we used one-hot encoding to represent the label and the label vectors corresponding to the k neighbours in A of the query cell (k-by-#categories, denoted as ({L}_{{\rm{ref}}})) were averaged with the weights w. The probabilities of each query cell are represented by the (bf L_rrrmermf) The category with the maximum probability is used as the final assignment.
In QC step 3, the experiments with fewer than 20 neurons were excluded to ensure the statistical power of projection analysis, resulting in 39,461 cells from 519 experiments left. After 34,643 neurons remained, the non-neuronal cells were also removed. The next section describes the cell-type classification method.
In quality control (QC) step 1, the cells included in the analysis are required to have a median mCCC level of the experiment < 0.025; 500,000 < nonclonal reads < 10,000,000; and mCCC level < 0.05. In total, 56,843 cells from 703 experiments satisfied these requirements (Extended Data Fig. 2a,b).
Source: Brain-wide correspondence of neuronal epigenomics and distant projections
rAAV-retro-Cre injections for retrograde labelling of neurons in the Salk Institute Animal Facility. Detailed data and conditions for a laboratory animal study
As described previously2, to label neurons projecting to regions of interest, injections of rAAV-retro-Cre (produced by Salk Vector Core or Vigene, 2 × 1012 to 1 × 1013 viral genomes per millilitre, produced with capsid from Addgene plasmid No. 81070 packaging pAAV-EF1a-Cre from Addgene plasmid No. 55636) were made into both hemispheres of the INTACT mice. The animals were placed in a stereotaxic frame after being anezoned with either of the two drugs. The glass micropipettes were used to make pressure injections of 0.05 to 0.25 l of AAV per injection site. Iontophoresis was used to ensure confined viral infections for the PFC, ENT, PIR, AMY, HY and CBN. The injections were made with a glass pipette with a tip around 10 m. For most of the desired target areas, injections were made at different depths, and/or at different anterior–posterior or medial–lateral coordinates to label neurons throughout the target area. More detailed injection coordinates and conditions are listed in Supplementary Table 1. At most two males and two females were injected. No sample size calculation was carried out. To get minimum reproducibility we used two mice of the same sex for each injection. The animals that were used for the injections were randomly selected.
All live procedures using animals were approved by the animal care and use committee. The mouse line used in epi-retro-sq2 was maintained on a C57BL/6J background. Adult male and female INTACT mice were used for the retrograde labelling experiments. Animals were housed in an accredited facility at the Salk Institute. The light was turned off on the dark cycle. Temperature was monitored and adjusted in accordance with Guide for the Care and Use of Laboratory Animals. The humidity was monitored and not controlled. As all air coming in is 100% fresh air (not recirculated), humidity in the animal facilities is approximately the same as the outside ambient air. San Diego averages 40–60% humidity year-round. Animals were 35–54 days old at the time of surgery for viral vector injections, were killed 13–17 days later, and were 50–70 days old on the day of dissection. The C57BL/6J mice were used for the experiments.
Using scDRS to identify candidate telencephalic RNA-seq cells enriched for specific GWAS traits
To determine which of our snRNA-seq cell types was enriched for specific GWAS traits, we used scDRS (v.1.0.2)50 with default settings. scDRS operates at the single-cell level to compute disease association scores, while considering the distribution of control gene scores to identify significantly associated cells.
We used the older version of MAGMA. To map single-nucleotide polymorphisms to genes using a 10 kb window. We used annotations and GWAS summary statistics to calculate the MAGMAz score, which is associated with a certain trait. Human genes were converted to their mouse orthologs using a homology database from Mouse Genome Informatics (MGI). The top MAGMA z scores were weighted to pick the 1,000 disease genes used for scDRS. MAGMA scores were used instead of results from the tests, because many of them had previously been computed.
To identify the transcription factors selectively enriched for telencephalic excitatory or inhibitory populations, we performed gene set enrichment analysis (GSEA) on a ranked gene list against a curated transcription factor gene set. We ranked genes by their average correlation with Fos compared to the average telencephalic inhibitory Fos correlation. The Genes from our telencephalic cells were ranked according to order. We were able to build a gene set of transcription factors by using enrichR99 databases. Every database we subsetted for had at least ten unique targets for human transcription factors. The final gene set consisted of these human transcription factors with targets composed of the intersection between any two databases. The R package FGsea was used to run the Gene Set enrichment analysis and compare the normalized pseudocell data with a background of all protein-coding genes. P values were computed using a positive one-tailed test.
To further characterize our candidate ARGs, we performed ward. D2 hierarchical clustering based on their Fos correlations across brain regions. We cut the dendrogram at a height that divided our ARGs into seven clusters. To assess the overlap between our ARG clusters and the ARGs reported in Tyssowski et al.42, we computed a Fisher’s exact test between two given gene sets using the R package GeneOverlap v.1.30.0 (ref. 98). P values were corrected for multiple hypothesis testing.
log2 transformed count per million and quantile normalized the pseudocell level expression of genes. We normalized the expression of the genes to have a mean of zero and a standard deviation of one. Normalized pseudocell counts were used for downstream analysis.
To generate our pseudocells, we first performed dimensionality reduction at the single-cell level. Single cells were divided into 27 groups, consisting of glial cell classes and neuronal populations further divided by neurotransmitter usage. We chose genes that were highly variable in a specific number of mouse donors so that a maximum of 5000 genes would be used for subsequent scaling. We then ran principal component analysis on the scaled expression data (50 principal components for glia and 250 principal components for neurons). Next, we constructed pseudocells by grouping single cells within each cell type. Within a cell type of size n, cells were assigned to pseudocells of size s such that the pseudocell size correlated with cell type size:
Using scOnline, we aggregated our snRNA-seq expression data into pseudocells: aggregations of cells with similar gene expression profiles. Working at the pseudocell resolution (rather than with individual cells) eliminates the technical variation issues of single-cell transcriptomic data, such as low capture rate from dropouts and pseudoreplication through averaging expression of similar cells93,94, while avoiding issues of pseudobulk approaches, such as low statistical power and high variation in sample sizes95.
To estimate the uncertainty, we used the exact method of binconf function in Hmisc R package 92. For plotting clarity, regions with fewer than five total inhibitory and excitatory cells were excluded.
If the percentage of cells with non-zero expression of at least one NPR was greater than or equal to 0.2, and average expression of at least one NPR was greater than or equal to 0.5 count per cell, each cell type was assigned aNPR identity.
Each cell type was assigned to an NP ligand identity if (1) the percentage of its cells with non-zero expression of the NP was greater than or equal to 0.3 and (2) the average expression of the NP was greater than or equal to 0.5 counts per cell. We observed that the expression of four NPs showed greater contamination across other cell types: OXT, AVP, PMCH and AGRP. For NPs, the percentage of cells that did not have zero expression was necessary to be greater than or equal to 0.8 and average expression to be greater than five counts per cell.
166 cell types of neuronal cells did not match the above conditions, and we looked at their top expressing transporters and assigned neurotransmitters accordingly.
Source: The molecular cytoarchitecture of the adult mouse brain
Generating Universe Set Cover Genes using Mixed Integer Linear Programming (JULI) Solvers and CPLEX Solvers
We asked the question because we were interested in the genes that allowed us to distinguish one cell type from the others. In the set cover problem, we find the smallest subfamily of a family of sets that can still cover all the elements in the universe set. We can define this as a mixed integer linear programming model programmatically using the JuMP domain-specific modelling language in Julia (refs. 33,89). We are using the HiGHS open-sourced and the IBM ILOG CPLEX commercial solvers. Supplementary Methods uses a linear programming model derivation and CPLEX solvers.
To generate the neighbourhood ridge plots in Fig. 2c, we first identified the interneuron and projection metaclusters for the isocortex, striatum and cerebellum, detailed in Supplementary Table 13. Supplementary Table 5 shows the cell types within each metacluster.
To generate the force-directed graph of regional cell type similarity, as in Fig. We weighted each pair of the DeepCCF regions using a metric called the Jaccard similarity metric. We used the qgraph package to generate a force-directed graph.
We looked at the minimum number of cell types required to cover a majority of the beads to assess cluster heterogeneity. For each region, we computed the number of confidently mapped beads for each cell type sorted in descending order by the number of beads. Next, we determined the number of cell types necessary for the running sum of beads to reach 95% of the total mapped beads.
Given an index neuronal cell type, to find its proximate neighbourhood within the dendrogram, we consecutively aggregated descendants from successively more distant ancestors. We continued to add cells until there was at least 100 cell types in the neighbourhood and at least 60 non-neuronal.
The genes were first rearranged using theR package slanter’s permuting method. The cell types were then reordered to comply with the cell type dendrogram structure using a dynamic programming tree-crossing minimization optimization88.
Given this tree structure, we optimized the leaf node sequence in the tree by selectively swapping the order of the children of internal nodes. The elements are grouped around the diagonal by iteratively perming the columns and rows of a normalized cell type. The genes Tbr1, Fezf2, Dlx1, Lhx6, Foxg1, Neurod6, Lhx8, Sim1, Lmx1a, Lhx9, Tal1, Pax7, Hoxc4, Gata3, Hoxb5 and Phox2b were chosen to be discrete, biologically interpretable markers—mostly transcription factors that relate to overall neuronal cell lineage.
We assigned the CCF regions into at least one of the eight large-scale regions from above. Then, for each Slide-seq puck, we grouped the beads on the puck into at least one of the large-scale regions using our CCF alignment. For each large-scale region on each puck, we ran RCTD using the corresponding tailored reference cell types and tailored gene list. We additionally considered only beads that had at least 150 UMIs across all genes and at least 20 UMIs within the tailored gene list.
We modified the way RCTD scores explicit pairs of cell types to help in mapping large references with similar clusters. We did not use the result of single-cell type pair to determine the best pair, we used the pairs that scored similar to the best- scoring pair, with a likelihood score of less than 30. Then, we collated the frequency of each cell type occurring in these well-fitting pairs and divided by the total occurrences of all the cell types to make a confidence score. Throughout the paper, we use 0.3 (of a maximum score of 0.5) as the threshold for a ‘confident’ mapping.
In accordance with the explicit cell type pairs used within RCTD’s doublet mode, we subdivided this filtered list, pulling out the 10 cell types deemed most likely to be associated with the given bead. When modelling how well these cell types mapped to a given bead, we exhaustively used one cell type from the top 10 list and one cell type from the rest of the prefiltered list. For the cerebellum and striatum, the number of cell types considered was sufficiently low that we were able to run the algorithm using all pairs.
The OSQP 81 is a programming tool that scales better for larger matrices due to the larger set of cell types to be mapped. We rewrote the inner loops of functions with RCPP, which is more efficient. We used the cloud computing services to allow for large-scale parallelization with the use of Hail Batch and GNU Parallel 85.
Source: The molecular cytoarchitecture of the adult mouse brain
Systematic design of quality networks for immune cell clusters identified in snRNA-seq data and its application to neurons and glial cells
We identified 16 cell classes in our snRNA-seq data, 6 of which were excluded from the majority of our analyses (dendritic cell, granulocyte, lymphocyte, myeloid, olfactory ensheathing and pituitary). The statistics show that the majority of these excluded clusters are classified as immune cell types. 1a,d,h and Supplementary Tables 2 and 3. In addition, we mapped many immune cell populations.
We used the R package SCOPIT v.1.1.4 (ref. 80) to estimate the sequencing saturation of our dataset. Under the prospective sequencing model, SCOPIT calculates the multinomial probability of sequencing enough cells, n, above some success probability, p, in a population containing k rare cell types of size N cells, from which we want to sample at least c cells in each cell type:
Next, we constructed a cell ‘quality network’ to systematically identify and remove remaining low-quality cells and artefacts from the dataset. By considering multiple quality metrics, we can increase power to identify low-quality cells and circumvent the issues related to setting hard thresholds on multiple quality metrics. We considered the following metrics when constructing the quality network: proportion of genes involved in oxidative phosphorylation, proportion of genes involved in mitochondrial genes, proportion of genes coding ribosomal genes, and proportion of genes IEG expression. Given their inherently distinct distributions of quality metrics, we separately constructed quality networks for neurons and glial cells. The quality network was made using Seurat v.4.2.0 and shared nearest neighbours. Our strategy was to remove any cluster from the quality network with ‘outlier’ distribution of quality metric profiles. A distribution of quality metric was considered as an outlier if its median was above 85% of cells in three features of the quality network: oxidative phosphorylation, mitochondrial and ribosomal protein expression. The remaining clusters were all removed with fewer than 15 cells.
For high-dimensional visualization, as in Fig. Each of the clusters had to be subanalyzed with a maximum of 2,000 nuclei. Using the Scanpy package, we calculated the first 250 principal components of our subsampled cells. To create a t-S NE we used a perplexity of 350 and an exaggeration factor of four, which were the same as the OpenTSNE v.1.0.0.
If the next level of clustering resulted in a set of differential clusters that passed this test, this was the exception to the above rule. We retained the clusters because they may contain additional structure.
To test for differential markers, we considered each leaf versus its sibling leaves. We used a Mann–Whitney U-test to assess whether any genes are differentially expressed. We required that a Gene be observed in less than 10% of the lower population and more than 20% of the higher population to ensure that there is a difference in expression between the two populations. We required every cluster to have at least three marker genes distinguishing it from its neighbours as well as three marker genes in the other direction. The leaves were merged if the cluster failed the test.
If the resolution sweep concluded at the highest resolution without finding any clusters, this is also indicative of a homogenous population and it would be considered complete.
If the shared neighbour graph was not a single connected component, there is no resolution low enough to form a single cluster, and so, the resolution sweep was not possible. This would occur if there were very few variable genes, which is indicative of a cell population.
Once we computed the shared nearest neighbour graph, we used the Leiden algorithm to identify cell clusters using the Constant Potts Model for modularity77. This method is sensitive to a resolution parameter, which can be interpreted as a density threshold that separates intercluster and intracluster connections. A sweep strategy was implemented to find resolution parameters. We started with a low-resolution value and created a single cluster of cells. We gradually increased the resolution until there were at least two clusters and the size ratio between the largest and second-largest cluster was at most 20, meaning that at least 5% of the cells are not in the largest cluster. Any cluster of fewer than (\sqrt{N}) cells was discarded, where N was the number being clustered in that round. This set made up roughly 1.6% of the total cells.
After selecting variable genes, we constructed a shared nearest neighbour graph75,76. First, we transformed the counts with the square-root function and then computed the k-nearest neighbour (kNN) graph using cosine distance and (k=50) (not including self-edges). From the kNN graph, we compute the shared neighbour graph, where the weight between a pair of cells is the Jaccard similarity over their neighbours:
We compared this expected value with the observed percentage of non-zero counts and selected all genes that are observed at least 5% less than expected in a given population.
We analysed three genes with highly stereotyped and regional expression, Dsp, Ccn2 and Tmem212, which correspond to the CCF regions detailed in Supplementary Table 12.
The hierarchy was categorized into 12 main areas: isocortex, olfactory areas, Hippocampal formation, striatum, pallidum, hypothalamus and thalamus. In Supplementary Table 4, we can see that we grouped ourselves into the ‘DeepCCF’ regions.
Our series of Nissl sections, downsampled to 50 μm resolution by local averaging, were aligned to the 50 μm CCF by jointly estimating three transformations. The three-dimensional diffeomorphism modeled any shape differences between the sample and the brain. This transformation is modeled using the Large Deformation diffeomorphic Metric mapping framework. Second is a three-dimensional affine transformation that shows the pose and scale of differences between the sample and the atlas. A two-dimensional rigid transformation on each slice was used to model the positioning of samples on the slides.
We used an objective function to determine the dissimilarity between the transformed atlas and our data, after transforming the contrast of the atlas to match the Nissl data at each slice. To match the red, green and blue channels of our Nissl dataset, the third-order polynomial was estimated on each slice of the transformed atlas. The data of missing tissue, artifacts and outliers are used to estimate their weights in the weighted sum of square error.
As a pre-processing step for the alignment of Slide-seq arrays to Nissl images, for each puck we generated a greyscale intensity image from the Slide-seq data by summing the UMI counts (across all genes) at each bead location on the puck and normalizing by the maximum UMI count value across the entire puck. We then performed the alignment of these images to the adjacent Nissl images in two steps. Each Nissl image was transformed using a manual rigid transformation. The purpose of this first transformation is to bring all the Nissl images to an approximately equivalent upright orientation, which made the second step of alignment easier. In the second part, we used the Slicer3D tool to identify the fiducial markers in the Nissl and Slide-seq images. Through thin-plate spline interpolation, we calculated the bead positions using the parameters found on the fiducial markers.
The sequenced reads were aligned to GRCm39.103 reference and processed using the Slide-seq tools pipeline (https://github.com/MacoskoLab/slideseq-tools; v.0.2) to generate the gene count matrix and match the bead barcode between array and sequenced reads.
Source: The molecular cytoarchitecture of the adult mouse brain
Histological library preparation and serial alignment of an OCT embedded P56 mouse brain with a slide-slide prior to electrode equilibration
An OCT embedded P56 wild-type mouse brain was mounted precisely so that it would maintain an accurate alignment, after being equilibrated in the cryostat for 30 min. The anterior slide was set at the end of the olfactory bulb region. This starting slide was marked, and the following adjacent 10 μm section was used for Slide-seq library preparation. For each tissue slice used for Slide-seq, a 10 μm pre-slide and a 10 μm post-slide were collected for histology. These histology slides were Nissl stained according to our previously released protocol64. The slide interval was made 100 m apart after an 80 m gap had been trimmed. A total of 114 sets of three consecutive slides were collected. All pre- and post-slides for histology registration were stored at −80 °C until the slides were Nissl stained. Optimizations were performed to be able to hold the Slide-seq tissue slices frozen onto their respective pucks at −80 °C during the 2 days required to complete serial sectioning.
Slide-seq arrays were generated as previously described2 with slight modifications. Larger-diameter gaskets were used to generate 5.5 × 5.5 mm2, 6.0 × 6.2 mm2 and 6.5 × 7.5 mm2 bead arrays. These sizes were chosen to facilitate different anterior to posterior coronal section sizes. To facilitate image processing, we utilized 2 × 2 digital binning on the collected data, resulting in 1.3 μm per pixel.
Sequencing reads were demultiplexed and aligned to a GRCm39.103 reference using CellRanger v.5.0.1 using default settings (except for an additional parameter to include introns). We used CellBender v.3-alpha63 to remove cells contaminated with ambient RNA.
Source: The molecular cytoarchitecture of the adult mouse brain
Brain anaesthesia and dissection of C57BL/6J mice using isoflurane and plastic freezing cassettes for the care and use of laboratory animals
At 56 days of age, C57BL/6J mice were anaesthetized by administration of isoflurane in a gas chamber flowing 3% isoflurane for 1 min. Anaesthesia was confirmed by checking for a negative tail pinch response. Animals were placed into a dissection tray and then anaesthesia was continued with a nose cone for the duration of the procedure. Transcardial perfusions were performed with ice-cold pH 7.4 HEPES buffer containing 110 mM 10 mM HePES, 25 mM glucose, 75 mM sucrose, 7.5 mM rgt and 2.5 mM kCl are what is used to remove the blood from brain and other organs. For use in regional tissue dissections, the brain was removed immediately; the meninges was peeled away from the entire brain surface, then frozen for 3 min in liquid nitrogen vapour and moved to −80 °C for long-term storage. For use in generation of the Slide-seq dataset through serial sectioning, the brains were removed immediately, blotted free of residual liquid, rinsed twice with OCT to assure good surface adhesion and then oriented carefully in plastic freezing cassettes filled with OCT. These cassettes were vibrated in a bath for five minutes at a temperature of 55C to remove air bubbles and make them adhere to the brain. The orientation of the brain was reset just before the liquid nitrogen reached the boiling point. Blocks were kept at 80 C.
Animals were group housed with a 12-h light–dark schedule and allowed to acclimate to their housing environment (20–22.2 °C, 30–50% humidity) for 2 weeks post-arrival. The Massachusetts Institute of Technology Committee on animal care approved the procedures involving animals that were done in accordance with the US National Institute of Health Guide for the Care and Use of Laboratory Animals. All procedures involving animals at the Broad Institute were done in a way that followed the US National Institute of Health Guide for the Care and Use of Laboratory Animals.
It was to use single cell and spatial-genomics technologies to define brain cell types in humans, non-human primates and mice. The ultimate aim is to improve understanding of the cellular mechanisms behind brain disorders, a leading cause of death and disability worldwide. In October, researchers from BICCN created the largest atlas of human brain cells yet.
Funded by the National Institutes of Health to the tune of US$375 million, the BICCN was launched in 2017 as a five-year collaboration between 250 scientists at 45 institutions across three continents.
These include Japan’s Brain/MINDS project to map the marmoset (Callithrix jacchus) brain4, China’s Brain Project, focusing on macaque monkeys (Macaca spp.)5 and the Korea Brain Initiative, which aims to develop specialized brain maps for both mice and humans6.
The Human Brain Project was based in Switzerland, and has now created the Human Brain Atlas. The HBP has an open platform called EBRAINS. The atlas uses post-mortem brain-imaging data and depicts the multilevel organization of the brain, from its cellular and molecular architecture to its functional modules and neural connections.
The Swiss Blue Brain Project, which will wrap up in 2024 after nearly 20 years, will also release a digital model of the mouse brain, based on imaging data, with detailed mapping of neurons and connectivity circuits.
Connecting Cells, Models and Code: A unified Framework for Collectivity, Analysis and Multi-Sequence Data (Extended Abstract)
There is a need for better communication between the projects as they progress, but it is important in their own right. Several of the projects are using the same technologies. It makes sense for the teams to liaise more closely, at the very least to begin a discusson on how to establish shared data standards, which they have not yet done.
The data, models and code needs to be open. There’s still a challenge to reproducibility even though there are large data sets. That is why a standardized framework for data collection and analysis, including definitions for types of cell clusters, as well as unified cell classifications and names across species, will eventually be needed.