The CIRM Genomics Data Coordination and Management Center has made steady progress on our milestones over the last six months. We continued development of a database that stores files and descriptive tags for stem cell genomics projects, and developed a web site that allows authorized users controlled access to these data. The site includes a file browser that displays quality statistics, labels, and tags for each file. For many file types, the file browser provides a link to the UCSC Genome Browser where the data inside the file appears as a track. We imported test data sets from the labs of Stephen Quake (CIP2, Stanford) and Michael Snyder (CIP1, Stanford) into a test version of the database. We imported our first CIRM-funded dataset from the Kristin Baldwin lab (Scripps) into a firewall-protected production version of the database. We interviewed several additional labs, some of which may have data ready by the next reporting period, and have started building software in anticipation of their needs.
Reporting Period:
Year 2
In Year 2, the Data Center Management Group (DCM) wrangled in production or pilot data from CIP1, CIP2, and from all but one first round CRP labs, and initiated contact with labs funded in the second CRP round. We created an infrastructure for higher level whole dataset views in addition to our existing file-by-file views. We implemented a simple, informative visualization of neighbor-joining type RNA-seq clustering that can display multiple types of metadata in the same display. We developed interactive figures for displaying the results of principal components analysis and T-SNE clustering that are particularly useful for single-cell data, and tested these on datasets of up to 6000 cells. We extended the SCHub capabilities to handle mouse as well as human data, and extended the website to allow authorized users to download both data and metadata without the need for a Unix account.
We elected to defer one milestone, a public-facing portal of CIRM production data, until Y3Q1Q2 to focus on the larger than expected new CRP awardees. This also allowed us to wrangle in 4 additional pilot data sets from existing labs. This delay has minimal impact on the overall project.
The secondary work of the DCM is to provide visualizations of the data. There is a growing demand for the DCM to produce new data visualizations, as well as to run both the standard and a new, CIP4-invented analysis on the data. Given these increased requests for visualization and analysis, which often go hand-in-hand, as well as the additional wrangling work from the large number of new CRP-funded labs, we would like to discuss adding an FTE to the DCM budget.
In the next six months we hope to produce web displays showing the results of CIP4 analysis. We plan to help the data curation group (DCG) roll out their metadata standards, help train the labs in these protocols, and perform QA to ensure adherence to the standards. We will continue to import production data as it becomes available. We will work with the labs that were awarded second round CRP funding, on pilot data if production data sets are not yet available. We plan to release a set of new features and usability enhancements on the web site, and develop a public-facing portal for with a wider range of published data than is currently available. We will assist CIP4 in cross-lab and cross-dataset analysis.
The Data Curation Group (DCG) has worked to established procedures to richly annotate the data sets collected through the genomics center. Because the CIRM collection will include the results of various different labs and projects, it is critical to establish methods to describe the experiments and results in a coherent fashion. The definition of uniform terminology will aid downstream computational analysis to maximally shed light on any known and novel cell types identified through these experiments. The effort is particularly challenging given the constantly evolving nature of both the genomics technologies employed in these investigations and our understanding about the fundamental cellular and molecular entities that exist or contribute to the various assays.
In light of the dynamic nature of these investigations, the DCG has established a communication mechanism with the DCM to provide standard terminology for describing experiments wherever possible. At the same time, the DCG will be able to respond by expanding and adapting the ontology specifications as needed by the consortium.
During the last six months, the DCG has developed minimum information standard (MISC) and worked with several of the collaborating labs as a pilot project to complete a full cycle in which metadata from labs was obtained, mapped to the standard, and returned to the labs for incorporation into submission of subsequent experiments. Because MISCE will be used throughout, this will ensure that future collaborating labs can also take advantage of the controlled terms. To support labs with adopting these standards, the DCG is creating a portal based on JCVI’s previously developed O-META system, to allow investigators to annotate their own experimental samples using the standards developed by the DCG and DCM procedures. In the next year, the DCG will work to further develop and test this system as well as on ways to assist labs in annotating their own experiments.
To provide a backdrop of cell state information against which samples collected within the consortium can be compared, the DCG is collecting a set of publicly available datasets from GEO and the SRA. These datasets include tens of thousands of samples for which gene expression data has been collected on microarrays and through RNA sequencing. The DCG has processed over 80,000 RNA-Seq samples at this point and is in the process of using semi-automated methods to assign cell states to each. Once completed, these samples can be used by the center to develop machine-learning predictors to connect new experiments with those established in the literature.
Reporting Period:
Year 3
The Data Center Management group (DCM) has been very productive during the past two quarters. The number of labs now producing genomics data has continued to grow. During this period we integrated files and metadata for eight new datasets, and added new analysis files or primary data for eight additional datasets. We released analysis pages for eight datasets, and created summary pages for fifteen datasets, resulting in a complete set of summaries for all the datasets that have been submitted to date. We developed a new public-facing portal for the publicly released data that includes a landing page suitable for a less technical audience, and are well into the development phase of a new interactive viewer for single cell data. The DCM collaborated with the Data Curation Group (DCG) in the production of the Minimum Information about a Stem Cell Experiment (MISCE) v1.1 reference, and started collaboration with the CIP4 group to integrate their work into the dataset summary and analysis pages.
The DCG made significant progress these past six months. The metadata standard, MISCE, has now had a second version released (ver 1.2) and has been incorporated into the annotation pipeline of SCHub. Over the last six months, we have worked with the DCM to enact a new protocol for annotating newly contributed samples whereby the SCHub contacts work directly with labs and the ontology designers (at JCVI) work with SCHub (rather than also directly with the labs). This has greatly streamlined the interaction and avoids the confusion of having collaborating labs interfacing with two different informatics groups. All decisions can then be made in a centralized and controlled fashion. Further development of the MISCE standards will be ongoing for the next reporting period and we will publish the descriptions in the coming months.
We have also formed two new working groups to organize collaborative work on thematic projects -- the Brain of Cells and the Cardiac Analysis Working Groups (AWGs). We are collecting a large set of single cell RNA-Seq data for the BoC project to rally the analysis around identifying new cell types in this fetal brain collection. The CIP4 methods will be tested and evaluated on this the BoC. A “report card” that summarizes how a newly contributed sample relates to a sample seen before in the BoC will be implemented into the SCHub data summaries in the next few months. Details on each milestone are enumerated below.
Reporting Period:
Year 4
The Data Coordination and Management Core (DCM) has continued to bring in a large amount of data over the past six months. We've brought in data for a new lab (Chi), brought in 10x data (Bruneau, Chi and Corn), and extended data sets for thirteen (13) more labs (Bruneau, Chi, Corn, Crooks, Fan, Frazer, Geschwind, Jones, Kriegstein, Loring, Snyder, Weissman, Yeo). We are in the process of bringing in new data for four additional labs (Kriegstein, Quake, Clarke, Sanford).
The Data Curation Group (DCG) has now completely shifted its attention from establishing metadata standards to developing supportive informatics tools for the analysis of collaborator and compendium datasets. In the next year, the DCG will work to solidify the methodologies it has experimented with so they can be incorporated into SCHub pipelines. As much of the code and models will also be coordinated with a new parallel effort that is part of a newly launched Chan-Zuckerberg Institute initiative on which the Stuart, Kent, and Schauermann labs participate as funded contributing projects.
Reporting Period:
Year 5
DCM Summary of Scientific Progress
The Data Coordination and Management Core (DCM) has continued to bring in a large amount of data over the past six months. We’ve brought in two new data sets (Quake and Yeo) and extended data for several more data sets (Chi, Crooks, Fan, Frazer, Sanford, Yeo, Jones, Loring, Bruneau, Belmonte). We’ve also added analysis files, and coordinated data going through the Uniform Processing Pipeline as data sets conclude data generation. We are in the process of bringing in additional data for the Yeo, Sanford, Kriegstein, Loring and Corn labs.
DCG Summary of overall progress
The Data Curation Group (DCG) continues its work during the no-cost-extension period as described in the previous report. An agreement on uniform pipelines (from quantified CellRanger expression to batch correction to normalization and through clustering and trajectory inference) was established soon after the April face-to-face meeting at Stanford. The group has incorporated additional alternatives into these pipelines to suit the needs of the HoC collaborators, for example replacing tSNE w/ UMAP for dimensionality reduction, or using URD in place of Monocle2 for trajectory inference.
In addition to solidifying such processing pipelines, interfaces in SCHub to search for marker genes of clusters and for predicting cell types with SamplePsychic will be finalized over the next reporting period. Anticipating that a number of new datasets will become available at the end of this reporting period, we are now automating the identification and comparison of dataset clusters and the RDF descriptions for gene markers found as distinguishing of these clusters. Improvements in marker gene identification (NS-Forest) and feature transformations of scRNA-Seq datasets has progressed. Further annotation of progenitor cells has been established (using StemID). Both the JCVI and UCSC groups are establishing methods for detecting cell types from gene expression data and for making the 1) predictors available in a portal (SamplePsychic) and 2) making the marker gene lists and cell ontology terms available to index SCHub datasets. JCVI continues to improve their NS-Forest algorithm and is now applying the updated version to BoC and HoC datasets, which is still in progress. Controlled representations of the marker genes and cell types will be developed using RDF so that they can be retrieved by user queries.
Reporting Period:
Year 6 NCE
The Data Coordination and Management Core (DCM) has continued to bring in a large amount of data over the past six months. We’ve brought in two new datasets (quakeFetalPancreas and pyleSkeletalMuscle) and extended primary data for several more datasets (belmonteMouseDnmt3a and sanfordRnaRegulation1). We’ve also added analysis files for eleven datasets and coordinated data going through the Uniform Processing Pipeline as datasets conclude data generation. We are finalizing the ingestion of all remaining analysis files as we receive them from the Stanford CESCG Core.
Over the past six months, our team worked closely with CESCG labs to update summary pages and identify data that is ready to be shared on the public SCHub. We pushed five datasets to the public site (kriegsteinBrainOrganoids2, fanIcf1, jonesYap, pyleSkeletalMuscle, and belmonteMouseDnmt3a). Additionally, we have made updates to the Cell Browser by adding a new user interface and command-line software features. The Cell Browser has also expanded with twelve new datasets, three of which were added to aid researchers working on COVID-19 therapeutics and vaccines.
Data ingestion and annotation into SCHub are now complete. Pipelines for analyzing scRNA-seq and scATAC-seq also have been completed. In the latest development, methods for linking together scRNA-seq and ATAC-Seq have also been installed during this past NCE period (see CIP4 report). Several dimensionality reduction techniques are part of the pipelines including tSNE, PCA, and UMAP. Clustering approaches including DBSCAN, Louvain, and graph-based spectral methods have been installed. Trajectory methods for identifying trends have been installed including Scimitar, Monocle2, Slingshot, and URD. The DCG has created various pipelines to ingest and analyze SCHub datasets. We developed analytical approaches for trajectory analysis, machine-learning to annotate datasets such as JCVI’s NS-Forest algorithm, and new data sharing platforms for investigators to compare results of their experiments. A legacy of datasets and methodologies are now publicly available to the research community and will have a lasting value in follow-on projects and as open source community-contributed efforts.
The DCG worked most closely with the Heart-of-Cells (HoC) investigators to test methodology. The collaboration involved weekly, and later bi-weekly, telephone calls to advance the analysis of two main HoC manuscripts. During the NCE, we finalized JCVI’s NSForest to find marker gene sets, created an ontology-aware annotation tool called TreeMAP, and a signature collection called scBeacon that collects, shares, and displays cell types found from RNA-seq data through their cluster signatures using a peer-to-peer system. If widely adopted, this system will allow investigators to share the results of their experimental findings in a decentralized, yet fully verifiable setting. As a side point, but with relevance to the current pandemic, the same scBeacon system is also being deployed to share the results of COVID-19 diagnostic test results starting with the UCSC campus and soon the county, allowing patients to be in complete control of the access or privacy of their health information.
Grant Application Details
Application Title:
Center of Excellence for Stem Cell Genomics
Public Abstract:
The Center of Excellence in Stem Cell Genomics will bring together investigators from seven major California research institutions to bridge two fields – genomics and pluripotent stem cell research. The projects will combine the strengths of the center team members, each of whom is a leader in one or both fields. The program directors have significant prior experience managing large-scale federally-funded genomics research programs, and have published many high impact papers on human stem cell genomics. The lead investigators for the center-initiated projects are expert in genomics, hESC and iPSC derivation and differentiation, and bioinformatics. They will be joined by leaders in stem cell biology, cancer, epigenetics and computational systems analysis. Projects 1-3 will use multi-level genomics approaches to study stem cell derivation and differentiation in heart, tumors and the nervous system, with implications for understanding disease processes in cancer, diabetes, and cardiac and mental health. Project 4 will develop novel tools for computational systems and network analysis of stem cell genome function. A state-of-the-art data management program is also proposed. This research program will lead the way toward development of the safe use of stem cells in regenerative medicine. Finally, Center resources will be made available to researchers throughout the State of California through a peer-reviewed collaborative research program.
Statement of Benefit to California:
Our Center of Excellence for Stem Cell Genomics will help California maintain its position at the cutting edge of Stem Cell research and greatly benefit California in many ways. First, diseases such as cardiovascular disease, cancer, neurological diseases, etc., pose a great financial burden to the State. Using advanced genomic technologies we will learn how stem cells change with growth and differentiation in culture and can best be handled for their safe use for therapy in humans. Second, through the collaborative research program, the center will provide genomics services to investigators throughout the State who are studying stem cells with a goal of understanding and treating specific diseases, thereby advancing treatments. Third, it will employ a large number of “high tech” individuals, thereby bringing high quality jobs to the state. Fourth, since many investigators in this center have experience in founding successful biotech companies it is likely to “spin off” new companies in this rapidly growing high tech field. Fifth, we believe that the iPS and information resources generated by this project will have significant value to science and industry and be valuable for the development of new therapies. Overall, the center activities will create a game-changing network effect for the state, propelling technology development, biological discovery and disease treatment in the field.