Skip to content
Snippets Groups Projects
Commit 3d8dc108 authored by unknown's avatar unknown
Browse files

Final changes

Added missing figure captions

Typos
parent 186993ff
No related branches found
No related tags found
No related merge requests found
...@@ -6,6 +6,7 @@ knitr::opts_chunk$set(echo = TRUE, ...@@ -6,6 +6,7 @@ knitr::opts_chunk$set(echo = TRUE,
message = FALSE, message = FALSE,
error = FALSE, error = FALSE,
cache = TRUE, cache = TRUE,
warning = FALSE,
out.width = "60%", out.width = "60%",
fig.width = 5, fig.width = 5,
fig.align = 'center') fig.align = 'center')
...@@ -379,7 +380,7 @@ ChIP-seq pipeline. ...@@ -379,7 +380,7 @@ ChIP-seq pipeline.
In this tutorial, for performance considerations, we have taken a subset of the In this tutorial, for performance considerations, we have taken a subset of the
data which corresponds to the human chromosome 21 (chr21). data which corresponds to the human chromosome 21 (chr21).
The data sets are located in the **compGenomRData** package. The data sets are located in the **compGenomRData**\index{R Packages!\texttt{compGenomRData}} package.
The location of the data sets can be accessed using the **system.file** command, The location of the data sets can be accessed using the **system.file** command,
in the following way: in the following way:
...@@ -1005,13 +1006,13 @@ cpm = t(t(counts)*(1000000/colSums(counts))) ...@@ -1005,13 +1006,13 @@ cpm = t(t(counts)*(1000000/colSums(counts)))
cpm_log = log10(cpm+1) cpm_log = log10(cpm+1)
``` ```
Combine the cpm values with the GC content, and plot the results. Combine the cpm values with the GC content,
```{r gc-combine} ```{r gc-combine}
gc = cbind(data.frame(cpm_log), GC = nuc['GC']) gc = cbind(data.frame(cpm_log), GC = nuc['GC'])
``` ```
and plot the results.
```{r gc-plot, fig.cap='GC content abundance in a ChIP-seq experiment'} ```{r gc-plot, fig.cap='GC content abundance in a ChIP-seq experiment'}
ggplot( ggplot(
...@@ -1043,16 +1044,21 @@ highly diverging enrichment of different ChIP regions, then ...@@ -1043,16 +1044,21 @@ highly diverging enrichment of different ChIP regions, then
some of the samples were affected by unknown batch affects. Such effects some of the samples were affected by unknown batch affects. Such effects
need to be taken into account in downstream analysis. need to be taken into account in downstream analysis.
Firstly, we will reorder the columns of the data.frame using the gather Firstly, we will reorder the columns of the data.frame using the **pivot_longer**
function from the tidyr package. function from the tidyr package.
```{r gc-tidy} ```{r gc-tidy}
# load the tidyr package # load the tidyr package
library(tidyr) library(tidyr)
# gather converts a fat data.frame into a tall data.frame, # pivot_longer converts a fat data.frame into a tall data.frame,
# which is the format used by the ggplot package # which is the format used by the ggplot package
gcd = pivot_longer(data = gc, experiment, cpm, -GC) gcd = pivot_longer(
data = gc,
cols = -GC,
names_to = 'experiment',
values_to = 'cpm'
)
# we select the ChIP files corresponding to the ctcf experiment # we select the ChIP files corresponding to the ctcf experiment
gcd = subset(gcd, grepl('CTCF', experiment)) gcd = subset(gcd, grepl('CTCF', experiment))
...@@ -1065,7 +1071,7 @@ Now we can visualize the relationship using a scatterplot. ...@@ -1065,7 +1071,7 @@ Now we can visualize the relationship using a scatterplot.
Figure \@ref(fig:gc-tidy-plot) compares the GC content dependency on the CPM between Figure \@ref(fig:gc-tidy-plot) compares the GC content dependency on the CPM between
the first and the second CTCF replicate. the first and the second CTCF replicate.
```{r gc-tidy-plot} ```{r gc-tidy-plot, warning = FALSE, fig.cap='Comparison of GC content and signal abundance between two CTCF biological replicates'}
ggplot(data = gcd, aes(GC, log10(cpm+1))) + ggplot(data = gcd, aes(GC, log10(cpm+1))) +
geom_point(size=2, alpha=.05) + geom_point(size=2, alpha=.05) +
theme_bw() + theme_bw() +
...@@ -1259,7 +1265,7 @@ annotateReads = function(bam_file, annotation_list){ ...@@ -1259,7 +1265,7 @@ annotateReads = function(bam_file, annotation_list){
# find overlaps between reads and annotation # find overlaps between reads and annotation
result = as.data.frame( result = as.data.frame(
findOverlaps(bam, annotation_list)) findOverlaps(bam, annotation_list)
) )
# appends to the annotation index the corresponding # appends to the annotation index the corresponding
...@@ -1302,7 +1308,7 @@ annotateReads = function(bam_file, annotation_list){ ...@@ -1302,7 +1308,7 @@ annotateReads = function(bam_file, annotation_list){
We execute the annotation function on all files. We execute the annotation function on all files.
```{r read-annot.loop} ```{r read-annot.loop, warning = FALSE}
# list all bam files in the folder # list all bam files in the folder
bam_files = list.files(data_path, full.names=TRUE, pattern='bam$') bam_files = list.files(data_path, full.names=TRUE, pattern='bam$')
...@@ -1331,8 +1337,13 @@ annot_reads_df$experiment = experiment_name ...@@ -1331,8 +1337,13 @@ annot_reads_df$experiment = experiment_name
And plot the results. And plot the results.
```{r read-annot-plot} ```{r read-annotation-plot, eval=TRUE, warning = FALSE, fig.cap='Read distribution in genomice functional annotation categories'}
ggplot(data = annot_reads_df, aes(experiment, frequency, fill=annotation)) + ggplot(data = annot_reads_df,
aes(
x = experiment,
y = frequency,
fill = annotation
)) +
geom_bar(stat='identity') + geom_bar(stat='identity') +
theme_bw() + theme_bw() +
scale_fill_brewer(palette='Set2') + scale_fill_brewer(palette='Set2') +
...@@ -1346,10 +1357,11 @@ ggplot(data = annot_reads_df, aes(experiment, frequency, fill=annotation)) + ...@@ -1346,10 +1357,11 @@ ggplot(data = annot_reads_df, aes(experiment, frequency, fill=annotation)) +
ggtitle('Percentage of reads in annotation') ggtitle('Percentage of reads in annotation')
``` ```
Figure \@ref(fig:read-annot-plot) shows a slight increase of **H3K36me3** on the exons Figure \@ref(fig:read-annotation-plot) shows a slight increase of **H3K36me3** on the exons
and introns, and **H3K4me3** on the **TSS**. Interestingly, both replicates of the **ZNF143** and introns, and **H3K4me3** on the **TSS**. Interestingly, both replicates of the **ZNF143**
transcription factor show increased read abundance around the TSS. transcription factor show increased read abundance around the TSS.
## Peak calling\index{Peak calling} ## Peak calling\index{Peak calling}
After we are convinced that the data is of sufficient quality, we can After we are convinced that the data is of sufficient quality, we can
...@@ -1862,14 +1874,13 @@ cont_track = DataTrack( ...@@ -1862,14 +1874,13 @@ cont_track = DataTrack(
```{r peak-calling-signal-profile-plot, fig.cap='ChIP and Input signal profile in around the peak centers.'} ```{r peak-calling-signal-profile-plot, fig.cap='ChIP and Input signal profile in around the peak centers.'}
plotTracks(list(chr_track, plotTracks(
axis, trackList = list(chr_track, axis, peaks_track, chip_track, cont_track),
peaks_track, sizes = c(.2,.5,.5,1,1),
chip_track, background.title = "black",
cont_track), from = start(ctcf_peaks)[1] - 1000,
sizes=c(.2,.5,.5,1,1), background.title = "black", to = end(ctcf_peaks)[1] + 1000
from = start(ctcf_peaks)[1] - 1000, )
to = end(ctcf_peaks)[1] + 1000)
``` ```
Figure \@ref(fig:peak-calling-signal-profile-plot) ChIP sample looks as expected. Figure \@ref(fig:peak-calling-signal-profile-plot) ChIP sample looks as expected.
...@@ -1970,8 +1981,8 @@ peak_track = AnnotationTrack(reduce(h3k36_peaks), name='H3K36me3') ...@@ -1970,8 +1981,8 @@ peak_track = AnnotationTrack(reduce(h3k36_peaks), name='H3K36me3')
# plots the enriched region # plots the enriched region
plotTracks( plotTracks(
plotTracks = c(chr_track, axis, gene_track, peak_track, data_tracks), trackList = c(chr_track, axis, gene_track, peak_track, data_tracks),
sizes = c(.5,.5,.5,.1,1,1), sizes = c(.5,.5,.5,.1,1,1),
background.title = "black", background.title = "black",
collapseTranscripts = "longest", collapseTranscripts = "longest",
transcriptAnnotation = "symbol", transcriptAnnotation = "symbol",
...@@ -2054,7 +2065,11 @@ h3k36_counts$qvalue = NULL ...@@ -2054,7 +2065,11 @@ h3k36_counts$qvalue = NULL
# reshape the data.frame into a long format # reshape the data.frame into a long format
h3k36_counts_df = tidyr::pivot_longer( h3k36_counts_df = tidyr::pivot_longer(
data = h3k36_counts, experiment, counts, -enriched) data = h3k36_counts,
cols = -enriched,
names_to = 'experiment',
values_to = 'counts'
)
# sum the number of reads in the Peak and Not Peak regions # sum the number of reads in the Peak and Not Peak regions
h3k36_counts_df = group_by(.data = h3k36_counts_df, experiment, enriched) h3k36_counts_df = group_by(.data = h3k36_counts_df, experiment, enriched)
...@@ -2135,6 +2150,9 @@ motifs ...@@ -2135,6 +2150,9 @@ motifs
We will extract the CTCF from the [@khan_2018] database. We will extract the CTCF from the [@khan_2018] database.
```{r, peak-quality.jaspar} ```{r, peak-quality.jaspar}
# based on the MotifDB version, the location of the CTCF motif
# might change, if you do not get the expected results please try
# to subset with different indices
ctcf_motif = motifs[[1]] ctcf_motif = motifs[[1]]
``` ```
...@@ -2614,12 +2632,12 @@ motif visualization. ...@@ -2614,12 +2632,12 @@ motif visualization.
We will use the plot function to visualize the most enriched DNA motif: We will use the plot function to visualize the most enriched DNA motif:
```{r motif-discovery.seqlogo, fig.cap='Motif with highest enrichment in top 500 CTCF peaks', width = 6, height = 3} ```{r motif-discovery-logo, fig.cap='Motif with highest enrichment in top 500 CTCF peaks', width = 6, height = 3}
# visualize the resulting motif # visualize the resulting motif
plot(novel_motifs[1]) plot(novel_motifs[1])
``` ```
The motif show in Figure \@ref(fig:motif-discovery.seqlogo) corresponds to the The motif show in Figure \@ref(fig:motif-discovery-logo) corresponds to the
previously visualized CTCF motif. Nevertheless, we will now computationally previously visualized CTCF motif. Nevertheless, we will now computationally
annotate our motif by querying the JASPAR [@khan_2018] database. annotate our motif by querying the JASPAR [@khan_2018] database.
...@@ -2721,7 +2739,7 @@ expression, or assays of 3D genome structure. ...@@ -2721,7 +2739,7 @@ expression, or assays of 3D genome structure.
The choice of downstream analysis is guided by the biological question of interest. The choice of downstream analysis is guided by the biological question of interest.
Often we want to compare our samples to other available ChIP-seq experiments. Often we want to compare our samples to other available ChIP-seq experiments.
It is possible to look at the pairwise differences between samples using It is possible to look at the pairwise differences between samples using
differential peak calling [@zhang_2014,@lun_2014,@allhoff_2014,@allhoff_2016]. differential peak calling [@zhang_2014; @lun_2014; @allhoff_2014; @allhoff_2016].
It is a procedure analogous to the differential expression analysis, except it It is a procedure analogous to the differential expression analysis, except it
results in sets of coordinates that are differentially bound in two biological results in sets of coordinates that are differentially bound in two biological
conditions. We can then search for specific DNA binding motif in such regions, conditions. We can then search for specific DNA binding motif in such regions,
...@@ -2730,7 +2748,7 @@ With an increase in number of ChIP experiment, pairwise comparisons becomes ...@@ -2730,7 +2748,7 @@ With an increase in number of ChIP experiment, pairwise comparisons becomes
combinatorially complex. In this case we can segment the genome into multiple classes, where combinatorially complex. In this case we can segment the genome into multiple classes, where
each class corresponds to a combination of bound transcription factors. each class corresponds to a combination of bound transcription factors.
Genome segmentation is usually done using probabilistic models (such as hidden Genome segmentation is usually done using probabilistic models (such as hidden
Markov models [@ernst_2012,@hoffman_2012]), or machine learning algorithms [@mortazavi_2013]. Markov models [@ernst_2012; @hoffman_2012]), or machine learning algorithms [@mortazavi_2013].
## Exercises: ## Exercises:
...@@ -2797,7 +2815,8 @@ How many motifs do you observe? How do the motifs look like (visualize the motif ...@@ -2797,7 +2815,8 @@ How many motifs do you observe? How do the motifs look like (visualize the motif
Where are the motifs located? [Difficulty: Advanced] Where are the motifs located? [Difficulty: Advanced]
3. Scan the CTCF peaks with top motifs identified in **ZNF143** peaks. 3. Scan the CTCF peaks with top motifs identified in **ZNF143** peaks.
Where are the motifs located? Where are the motifs located? What can you conclude from the previous exercises?
[Difficulty: Advanced]
```{r include=FALSE, eval=TRUE, echo=FALSE} ```{r include=FALSE, eval=TRUE, echo=FALSE}
rm(list=ls()) rm(list=ls())
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment