R Tutorial: ChIP-seq Workflow

  • 🎬 Video
  • ℹ️ Published 3 years ago

Now that you have learned about why ChIP-seq analyses are carried out and have had a chance to take a first look at the data it is time to talk about what a ChIP-seq analysis workflow looks like. This video will give you an overview of the workflow before we dive into the details during the following chapters.

The first step in a ChIP-seq analysis is to take the collection of reads obtained from the sequencing machine and to locate their position in the genome. This step, known as "read mapping" involves identifying the best match for each read sequence in a standardised version of the genome, the reference genome.

Once the reads are mapped to the reference genome they are combined into a coverage profile, i.e. for each position in the genome the total number of reads overlapping with that position is determined. Specialised algorithms are then used to identify peaks in this coverage profile. These correspond to the likely location of binding sites for the protein of interest.

While it is possible to perform read mapping and peak calling in R, typical ChIP-seq pipelines use dedicated tools for these steps. Following this practice, you will start the R workflow in Chapter 2 by importing mapped reads and peak calls into R.

Before starting the main analysis it is important to ensure that the data is of good quality and to deal with any apparent problems. You will learn about quality control procedures in the second part of Chapter 2.

Once the data have been cleaned of artifacts the main analysis can begin. The first goal is to identify interesting peaks. For a peak to be of interest it has to be more than just a protein binding site. It needs to play a direct role in the difference between samples. For our example this means that you will identiyfy AR binding sites that are preferentially used in either primary or treatment resistent tumors. You will learn how to do this in Chapter 3.

Once you have a list of interesting peaks the challange is to understand what it all means. Simply having a list with the location of binding sites is not very helpful in understanding the biological mechanisms that are responsible for treatment resistance in prostate cancer. In Chapter 4 you will learn how to use a variety of data sources to attach meaning to the observed differences between samples.

Before moving on, let's look at some ways in which we can visualize these data at a high level. Heat maps can be useful to highlight broad similarities and differences between samples. A heat map arranges samples and genomic regions by similarity. This creates groupings of samples that can be compared to known experimental conditions. The grouping of genomic regions helps to emphasize common patterns across samples. The `heatmap()` function facilitates the creation of these plots.

Another helpful way to compare samples is to consider the number of peaks shared between samples as well as those unique to a given sample. The *UpSetR* package provides useful plots for this purpose. This makes it easy to quickly assess the degree of similarity between different sample sets.

Before you start working through the steps of the workflow in Chapter 2, let's take a look at where all this is going. In the following exercises you have a chance to take a first look at some of the results.

#R #RTutorial #DataCamp #ChIPseq #Bioconductor