Table of Contents

ChIP-seq analysis

ChIP-seq analysis

Alignment

Basically all ChIP-seq experiments start with alignment of the reads to the reference genome. There are many aligners that you can use, in the case of ChIP-seq, it doesn't really make a big difference which one you choose. Two of the most commonly used ones are bowtie and BWA. We will use BWA in this tutorial. To use the aligner you will first have to create an index of the reference genome. This index allows the aligner to map the reads with acceptable speeds, however, alignment of millions of reads to the full reference genome still takes quite a while. Therefore we will limit the alignment in this workshop to one chromosome and a subset of the reads.

The X.tropicalis genome, JGI version 7.1, is located in the /data/genomes/ directory. The first 10 scaffolds (scaffold_1, scaffold_2, ...) correspond to the 10 tropicalis chromosomes. Pick one of these scaffolds, and copy the FASTA file to your own directory.

However, let's first start with creating a specific directoy for today's exercises, to keep things tidy.

$ mkdir chip-seq

$ cp /data/genomes/xt_7.1/scaffolds/scaffold_7.fa /home/simon/chip-seq/

Now we can create the index with the bwa command.

$ cd /home/simon/chip-seq
$ bwa index scaffold_7.fa

OK, this will take a couple of minutes. While bwa is running, let's prepare our dataset. Switch to another terminal, or open a new Putty session to the server. We will use the T-box factor ChIP-seq data from the Smith lab: In Vivo T-Box Transcription Factor Profiling Reveals Joint Regulation of Embryonic Neuromesodermal Bipotency. This dataset is present in /data/devcom/raw/. As alignment of all data would take to long for this workshop, we will randomly sample a limited number of reads from one of the FASTQ files. This can be done using seqtk.

$ seqtk sample /data/devcom/raw/SRR926401.GSM1180932.XBRA12.1.fq.gz 1000000 > xbra_sample.fq

How many lines will the output file xbra_sample.fq contain?

Once bwa index and seqtk are finished, let's start the alignment!

$ bwa aln -t 4 scaffold_7.fa xbra_sample.fq | bwa samse scaffold_7.fa - xbra_sample.fq > xbra_sample.sam

Essentially, we are now running two commands, chained together by a pipe. The first command, bwa aln, performs the alignment and produces a sai file. This is a format that we can't really use, so we use the bwa samse command to convert this file to a SAM file. We could also run these two commands seperately, but here we don't save the intermediate sai file to disk. It is kept in memory, which saves disk space.

Convert the aligned SAM file to a sorted, indexed BAM file.
What is the fraction of mapped and unmapped reads?
How many reads are uniquely mapped?
Can you rewrite the bwa alignment step so that it directly produces a sorted bam file, without writing a SAM file to disk? Hint: use pipelines!

There is a SAM flag that tells you if a read is a duplicate or not. Check which flag this is at http://broadinstitute.github.io/picard/explain-flags.html and check how many duplicates are present in your BAM file.

There are none. Surprised?

Many (all?) aligners don't mark duplicates by default. This is something that you will have to do yourself, as it's very useful, if not essential, to have this information. You can use Picard Tools, which is a set of command line tools that do many useful things with high-throughput sequencing data. One of them is marking duplicates.

$ MarkDuplicates INPUT=in.bam OUTPUT=out.bam METRICS_FILE=metrics.txt ASSUME_SORTED=true

There are several required arguments. The input and output files, here in.bam and out.bam respectively, and a file to write metrics to. The last parameter ASSUME_SORTED=true is also required in our case, as our file is sorted, with samtools sort. This tool does not set a flag in the header that specifies that the file is correctly sorted, but Picard Tools requires this.

Create an indexed BAM file with duplicates marked. How many duplicates are in your BAM file?

Visualization

For all next-generation sequence experiments, including ChIP-seq, it is essential to be able to visualize your data. This will allow you to check your findings, spot artefacts and generate hypotheses. We will use the UCSC Genome Browser. Alternatives are the Ensembl Genome Browser or IGV.

The current version of the X.tropicalis genome is not yet on UCSC, so we will use the mirror kindly provided by Mike: http://genomes.nimr.mrc.ac.uk/.

Upload your BAM file to the UCSC Genome Browser at NIMR. This will allow you to inspect the reads. However for ChIP-seq this view is limited. We will create a summarized view of the signal and make a bigWig track. This can be done using bedtoools, the "swiss-army knife of tools for a wide-range of genomics analysis tasks".

First, convert the BAM file to a BED file.

$ bedtools bamtobed -i XBRA.bam > XBRA.bed

Now, we can extend the reads to the estimated fragment length. There are ways to calculate this, but in this case we will just use 200 bp. This bedtools command, like many other commands, requires a genome file. This is just a tab-separated text file containing the chromosome names and sizes. One of the handy ways to create this file is from the BAM header, that contains this information. The samtools view -H command shows only the header of a BAM file:

$ samtools view -H XBRA.bam | head -n 5

@CO Duplicates marked (bamUtil Version: 1.0.2)
@SQ SN:scaffold_1   LN:215906545    AS:xenTro3beta
@SQ SN:scaffold_2   LN:150913983    AS:xenTro3beta
@SQ SN:scaffold_3   LN:117219152    AS:xenTro3beta
@SQ SN:scaffold_3b  LN:4137994  AS:xenTro3beta

Now, let's grep out the @SQ lines and retrieve the columns of interest:

$ samtools view -H XBRA.bam | grep "@SQ" |cut -f2,3 | head -n 5

SN:scaffold_1   LN:215906545
SN:scaffold_2   LN:150913983
SN:scaffold_3   LN:117219152
SN:scaffold_3b  LN:4137994
SN:scaffold_4   LN:140444050

Almost there! Remember the sed command?

$ samtools view -H XBRA.bam | grep "@SQ" |cut -f2,3 | sed 's/SN://' | sed 's/LN://' | head -n 5

scaffold_1  215906545
scaffold_2  150913983
scaffold_3  117219152
scaffold_3b 4137994
scaffold_4  1404440

Perfect! Let's redirect this to a file (remember to remove the head command).

$ samtools view -H XBRA.bam | grep "@SQ" |cut -f2,3 | sed 's/SN://' | sed 's/LN://' > genome.txt

Now we can extend the reads using bedtools slop.

$ bedtools slop -i XBRA12.bed -r 160 -l 0 -s -g genome.txt > XBRA12.extend.bed

The bedtools genomecov command creates a bedGraph file with the read coverage at each position.

$ bedtools genomecov -i XBRA12.extend.bed -bg -g genome.txt  > XBRA12.extend.bg

Almost done. The last step is to convert the bedGraph to bigWig file.

$ wigToBigWig XBRA12.extend.bg genome.txt XBRA12.extend.bw

This bigWig file needs to be present at a web-accessible location so that the UCSC browser can access it. For this purpose we have created a shared location that you can use to put your files. You can transfer your files (with scp or WinSCP) to mb06.science.ru.nl with the account devcom. Password will be shared during the workshop. Make a directory with your own name and put your files there, as the general directory is shared between all workshop particpants.

Let's say that I have uploaded XBRA12.extend.bw to this location in the simon directory. This will now be accessible at the following location:

http://mbdata.science.ru.nl/share/devcom/simon/XBRA12.extend.bw

We now have to create a custom track Wiggle file that described the bigWig file (details here. In my example this will look something like this:

track type=bigWig name="xbra" descriptions="xbra stage 12" bigDataUrl=http://mbdata.science.ru.nl/share/devcom/simon/XBRA12.extend.bw

This the minimum, you can add a lot of option values to control scale, color, etc.

Peak calling

For ChIP-seq, the most common scenario is that you want to identify enriched regions, also known as peaks. As usual there is a multitude of programs available to do this. We will use MACS, one of the more common peak callers, which usually works well and gives reasonably good results.

To keep the running time short, we will run MACS on only one chromosome (or scaffold, in the current tropicalis assembly). The full datasets ara available in /data/devcom/bam/. You can use samtools to filter out all reads for one specific chromosome. Let's take XBRA as an example.

$ samtools view -b /data/devcom/bam/SRR926401.GSM1180932.XBRA12.1.bam scaffold_1  > XBRA.scaffold_1.bam
$ samtools index XBRA.scaffold_1.bam

And also take the input track along:

$ samtools view -b /data/devcom/bam/SRR926402.GSM1180933.INPUT_XBRA12.1.bam scaffold_1  > XBRA.input.scaffold_1.bam
$ samtools index XBRA.input.scaffold_1.bam

Now we can run MACS.

$ macs2 callpeak -t XBRA.scaffold_1.bam -c XBRA.input.scaffold_1.bam -n XBRA -f BAM -g 215906545

Notice the -g parameter? This the genome size. However, in our case we're only using reads mapped to scaffold_1, so it's the size of scaffold_1. Normally you would supply the complete genome size for your species.

Once MACS is finished, you'll see several different files:

XBRA_model.r
XBRA_peaks.xls
XBRA_peaks.narrowPeak
XBRA_summits.bed

Let's create a BED file with peaks you can upload the Genome Browser.

$ cut -f1-3 XBRA_peaks.narrowPeak > XBRA_peaks.bed

Add the following line to the top of the file as the first line (for instance using nano)

track name=XBRA_peaks

Now you can upload this file to the UCSC Genome Browser at NIMR. How do the peaks look? Do you trust the peak calling? Do all the peaks look good to you? Did MACS miss any peaks? If you want to change the threshold of the peak calling, you can use the -q parameter. This increases or decreases the FDR (False Discovery Rate) cutoff. Try the peak calling again with -q 1e-5 and have a look at the differences.

Call peaks on all datasets (VEGT, XBRA, EOMES, H3K4me3, H3K4me1, p300) in the /data/devcom/bam directory. The last three datasets have a shared input track as control (Input_stage9_12.bam). Remember to filter out one specific scaffold, and have a look at the results, tweak the -q parameter if necessary.

Intersections

You can use bedtools intersect to create intersections between BED files. See this page for an overview.

How many peaks are shared between the different T-box factors? Are there specific peaks? Do the profiles in the Genome Browser support your hypothesis?
Use the peak files you created to determine a set of promoters and a set of enhancers. Determine the number of fraction of T-box factor peaks in promoters and enhancers.