"Statistical Approaches for Meeting Challenges in the Analysis and Interpretation of mRNA-Seq"
For the past decade, microarrays have been the assays of choice for high-throughput studies of gene expression. Recent improvements in the efficiency, quality, and cost of genome-wide sequencing have prompted biologists to rapidly abandon microarrays in favor of ultra high-throughput sequencing, a.k.a., second-generation or next-generation sequencing: e.g., Applied Biosystems' SOLiD, Helicos BioSciences' HeliScope, Illumina's Genome Analyzer, and Roche's 454 Life Sciences sequencing systems. These high-throughput sequencing technologies have already been applied to monitor genome-wide transcription levels (mRNA-Seq), DNA-protein interactions (ChIP-Seq), DNA copy number (DNA-Seq), chromatin structure, and NA methylation. While sequencing-based gene expression studies have been touted as overcoming longstanding limitations of microarray-based studies, these new biotechnologies raise similar as well as novel statistical and computational challenges, in areas such as image analysis, base-calling, read-mapping, and (differential) expression inference.
This talk concerns statistical methods and software for the analysis of high-throughput transcriptome sequencing (mRNA-Seq) data, with emphasis on mapped reads from the Illumina Genome Analyzer. We address the following main questions, which trace the process of deriving accurate measures of (differential) expression for genomic regions of interest (ROI) such as individual exons or multiple isoforms of a given gene.
1. Experimental design: Guidelines for the effective allocation of input mRNA samples (e.g., in terms of library preparation, flow-cells, lanes) and the use of control sequences.
2. Exploratory data analysis: Toolbox of numerical and graphical summaries for mapped reads to detect the main and as well as aberrant features of mRNA-Seq data.
3. Normalization and expression quantitation: Methods for inferring ROI-level expression from base-level mapped read counts, while adjusting for experimental/technical effects (e.g., library preparation/flow-cell/lane) and sequence-specific effects (e.g., GC-content).
4. Differential expression: Methods for inferring differential expression between ROI and/or input samples.
5. Software: Open-source statistical software implementing the methodology discussed above.
We report on our investigation of several mRNA-Seq datasets, in organisms from yeast to human: inference of (differential) gene expression in reference samples from the MicroArray Quality Control (MAQC) Project; genome annotation and the discovery of novel transcripts in Saccharomyces cerevisiae; evolutionary genetics based on allele-specific expression in a Saccharomyces diploid hybrid; regulation of alternative splicing in Drosophila melanogaster.
References (manuscripts and presentations) are posted on the website: www.stat.berkeley.edu/~sandrine.
Sandrine Dudoit is Associate Professor of Biostatistics and Statistics and Chair of the Graduate Group in Biostatistics at the University of California, Berkeley. Professor Dudoit's research and teaching activities concern the development and application of statistical and computational methods for the analysis of biomedical and genomic data. Her methodological research interests regard high-dimensional inference and include loss-based estimation with cross-validation (classification and regression, density estimation, model selection) and multiple hypothesis testing. Much of her methodological work is motivated by statistical inference questions arising in biological research, including: the design and analysis of high-throughput microarray and sequencing gene expression experiments; nucleotide and protein sequence analysis; the genetic mapping of complex human traits; biological annotation metadata analysis. Professor Dudoit is also interested in statistical computing and is a founding core developer of the Bioconductor Project, an open-source software project for the analysis of biological data (www.bioconductor.org). She is a co-author of the book "Multiple Testing Procedures with Applications to Genomics" and a co-editor of the book "Bioinformatics and Computational Biology Solutions Using R and Bioconductor". She is Associate Editor of six journals, including "The Annals of Applied Statistics", "BMC Bioinformatics", "Statistical Applications in Genetics and Molecular Biology", and "IEEE/ACM Transactions on Computational Biology and Bioinformatics".
Professor Dudoit obtained a Bachelor's (1992) and Master's (1994) degree in Mathematics from Carleton University, Ottawa, Canada. She first came to UC Berkeley as a graduate student and earned a PhD degree in 1999 from the Department of Statistics. Her doctoral research, under the supervision of Professor Terence P. Speed, concerned the linkage analysis of complex human traits. From 1999 to 2000, she was a postdoctoral fellow at the Mathematical Sciences Research Institute, Berkeley. Before joining the Faculty at UC Berkeley in July 2001, she underwent a year of postdoctoral training in genomics in the laboratory of Professor Patrick O. Brown, Department of Biochemistry, Stanford University. Her work in the Brown Lab involved the development and application of statistical methods and software for the analysis of microarray gene expression data.