more details on NB distribution

31743197 · altuna akalin · e1bc9100 · 31743197
Commit 31743197 authored 5 years ago by altuna akalin
--- a/08-rna-seq-analysis.Rmd
+++ b/08-rna-seq-analysis.Rmd
@@ -168,7 +168,7 @@ None of these metrics (CPM, RPKM/FPKM, TPM) account for the other important conf

 ### Exploratory analysis of the read count table

-A typical quality control, in this case interrogating the RNA-seq experiment design, is to measure the similariy of the samples with each other in terms of the quantified expression level profiles accross a set of genes. One important observation to make is to see, whether the most similar samples to any given sample are the biological replicates of that sample. This can be computed using unsupervised clustering techniques such as hierarchical clustering and visualized as a heatmap with dendrograms. Another most commonly applied technique is a dimensionality reduction technique called Principal Component Analysis (PCA) and visualized as a two-dimensional (or in some cases three-dimensional) scatter plot. In order to find out more about the clustering methods and PCA, please refer to the Chapter \@ref(unsupervisedLearning). 
+A typical quality control, in this case interrogating the RNA-seq experiment design, is to measure the similarity of the samples with each other in terms of the quantified expression level profiles across a set of genes. One important observation to make is to see, whether the most similar samples to any given sample are the biological replicates of that sample. This can be computed using unsupervised clustering techniques such as hierarchical clustering and visualized as a heatmap with dendrograms. Another most commonly applied technique is a dimensionality reduction technique called Principal Component Analysis (PCA) and visualized as a two-dimensional (or in some cases three-dimensional) scatter plot. In order to find out more about the clustering methods and PCA, please refer to the Chapter \@ref(unsupervisedLearning). 
  
 #### Clustering

@@ -208,7 +208,7 @@ pheatmap(tpm[selectedGenes,], scale = 'row',

 #### PCA

-Let's make a PCA plot to see the clustering of replicates as a scatter plot in two dimensions (Figure \@ref(fig:pca1)). 
+Let's make a PCA plot \index{Principal Component Analysis (PCA)} to see the clustering of replicates as a scatter plot in two dimensions (Figure \@ref(fig:pca1)). 

 ```{r pca1, fig.cap='PCA plot of samples using TPM counts'}
 library(stats)
@@ -235,7 +235,7 @@ summary(pcaResults)

 #### Correlation plots

-Another complementary approach to see the reproducibility of the experiments is to compute the correlation scores between each pair of samples and draw a correlation plot. 
+Another complementary approach to see the reproducibility of the experiments is to compute the correlation scores between each pair of samples and draw a correlation plot. \index{correlation}

 Let's first compute pairwise correlation scores between every pair of samples.

@@ -283,14 +283,16 @@ pheatmap(correlationMatrix,

 ### Differential expression analysis

-Differential expression analysis allows to test tens of thousands of hypotheses (one test for each gene) against the null hypothesis that the activity of the gene stays the same in two different conditions. There are multiple limiting factors that influence the power of detecting genes that have real changes between two biological conditions. Among these are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes [@love_moderated_2014]. Tools such as `edgeR` and `DESeq2` address these limitations using sophisticated statistical models in order to maximize the amount of knowledge that can be extracted from such noisy datasets.   
-`DESeq2`workflow: 
+Differential expression analysis allows to test tens of thousands of hypotheses (one test for each gene) against the null hypothesis that the activity of the gene stays the same in two different conditions. There are multiple limiting factors that influence the power of detecting genes that have real changes between two biological conditions. Among these are the limited number of biological replicates, non-normality of the distribution of the read counts, and higher uncertainty of measurements for lowly expressed genes than highly expressed genes [@love_moderated_2014]. Tools such as `edgeR` and `DESeq2` address these limitations using sophisticated statistical models in order to maximize the amount of knowledge that can be extracted from such noisy datasets. In essence, these models assume that for each gene the read counts are generated by a negative binomial distribution. This distribution can be specified with mean, $m$, and dispersion, $\alpha$ parameters. Estimating these parameters are crucial for differential expression tests. The methods used in `edgeR` and `DESeq2` uses dispersion estimates from other genes with similar counts to precisely estimate the per-gene dispersion values. With accurate dispersion parameter estimate, one can estimate the variance more precisely which in turn
+improve the result of the differential expression test. Although statistical models are different, the process here is similar to moderated t-test we introduced in Chapter \@ref(stats).

-1. Normalizes the read counts by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions. 
-2. For each gene, estimates the dispersion value of the gene among the biological replicates. The dispersion value computed by `DESeq2` is equal to the squared coefficient of variation (variation divided by the mean).  
-3. A line is fitted across the dispersion estimates of all genes computed in 2) versus the mean normalized counts of the genes. 
+Now let us take a closer look at `DESeq2`workflow and how it calculates differential expression: 
+
+1. The read counts are normalized by computing size factors, which addresses the differences not only in the library sizes, but also the library compositions. 
+2. For each gene, a dispersion estimate is calculated. The dispersion value computed by `DESeq2` is equal to the squared coefficient of variation (variation divided by the mean).  
+3. A line is fit across the dispersion estimates of all genes computed in 2) versus the mean normalized counts of the genes. 
 4. Dispersion values of each gene is shrunken towards the fitted line in 3). 
-5. A Generalized Linear Model is fitted which considers additional confounding variables related to the experimental design such as sequencing batches, treatment, temperature, patient's age, sequencing technology etc. 
+5. A Generalized Linear Model is fitted which considers additional confounding variables related to the experimental design such as sequencing batches, treatment, temperature, patient's age, sequencing technology etc. and uses negative binomial distribution for fitting count data.
 6. For a given contrast (e.g. treatment type: drug-A versus untreated), a test for differential expression is carried out against the null hypothesis that the log fold change of the normalized counts of the gene in the given pair of groups is exactly zero. 
 7. Adjusts p-values for multiple-testing.