In this tutorial we are going to analyze ChIP-seq datasets for transcription factor motifs. We will have a detailed look at the motifs in the peaks of three T-Box transcription factors, vegt, eomes and t (brachyury), in gastrula-stage X.tropicalis embryos. The data is decribed in Gentsch et al., 2013.
If you have followed the ChIP-seq tutorial you might have already created some of these files
As an analysis of all genome-wide peaks would take too long, we are going to focus on scaffold_1. Let's first create filtered BAM files.
export d_dir=/data/devcom/bam/
samtools view -b $d_dir/SRR926401.GSM1180932.XBRA12.1.bam scaffold_1 > t.bam
samtools view -b $d_dir/SRR926402.GSM1180933.INPUT_XBRA12.1.bam scaffold_1 > t_input.bam
samtools view -b $d_dir/SRR926403.GSM1180934.XEOMES12.1.bam scaffold_1 > eomes.bam
samtools view -b $d_dir/SRR926404.GSM1180935.INPUT_XEOMES12.1.bam scaffold_1 > eomes_input.bam
samtools view -b $d_dir/SRR926405.GSM1180936.XVEGT12.1.bam scaffold_1 > vegt.bam
samtools view -b $d_dir/SRR926406.GSM1180937.INPUT_VEGT12.1.bam > vegt_input.bam
Note that I'm simplifying the names (using the official gene symbols), and that I name the bam files consistently. This will allow for easier downstream processing.
Now, create the peaks. As we have consistent names, we can do this in a loop.
for f in eomes t vegt; do
macs2 callpeak -t ${f}.bam -c ${f}_input.bam -f BAM -g 2.15e8 -q 1e-5 -n ${f}
done
I'm using a strict threshold of -q 1e-5 and -g 2.15e8 is the size of scaffold_1.
As noticed in Gentch et al., these three factors share a lot of peaks. How can they differentially regulate their target genes? One explanation can be different co-factors. Let's see if we can find evidence for this.
We will create peaks of the same size, centered at the peak summit. This allows for comparisons between peaks. In addition I will create a FASTA file of the peaks.
bedtools slop -i eomes_summits.bed -b 100 -g genome.txt > eomes_summits.200.bed
bedtools getfasta -fi /data/genomes/xt_7.1/xt_7.1.fa -bed eomes_summits.200.bed -fo eomes_summits.200.fa
We will use a database of vertebrate motifs in combination with GimmeMotifs for motif analysis. The motif database is here:
/usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm
Now, let's scan our FASTA file with the motifs.
gimme scan eomes_summits.200.fa /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm > eomes.motifs.gff
This will scan the sequences in eomes_summits.200.fa for all the motifs present in /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm. While this process is running (it will take a while), try to answer the following question:
How many motifs are in
/usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm?
Now, gimme scan should be finished. Let's have a look at the results:
$ head -n 1 eomes.motifs.gff; tail -n1 eomes.motifs.gff
scaffold_1:70431351-70431552 pwmscan misc_feature 95 109 11.2618421899 + . motif_name "ZSCAN4_C2H2.1" ; motif_instance "TGCACACCCTCTAAT"
scaffold_1:189910768-189910969 pwmscan misc_feature 145 152 7.5823179192 - . motif_name "GMEB1/2_C2H2" ; motif_instance "GTGCGTAA"
We see the file is in GFF format, with the original sequence name (location) in the first column and the motif information in the 9th column. How many motifs did gimme scan identify?
$ wc -l eomes.motifs.gff
79335 eomes.motifs.gff
That's a lot.
cut -f1 eomes.motifs.gff | sort | uniq -c | head
32 scaffold_1:100143253-100143454
30 scaffold_1:100409349-100409550
37 scaffold_1:100411740-100411941
32 scaffold_1:100462444-100462645
29 scaffold_1:100563619-100563820
36 scaffold_1:100565461-100565662
42 scaffold_1:100627751-100627952
30 scaffold_1:100857546-100857747
31 scaffold_1:100868530-100868731
31 scaffold_1:100909190-100909391
Around 30-40 motifs per peak. Let's see which motifs are identified. We will use a trick with cut to get the motif name out.
$ cut -f2 -d\" eomes.motifs.gff | sort -u | wc -l
230
So, the big majority of motifs is found in these peaks. Is that really useful information? Does it make sense?
The problem with transcription factor motif scanning is that you will always find matches. Furthermore, we need to control for random occurrence and motif length. First, we will define a motif-specific threshold, based on a False Discovery Rate (FDR) in random sequences. We will select 10,000 random sequences of the same length as our peaks (200) from the X.tropicalis genome.
$ gimme background random_10k.200.fa random_genomic -l 200 -n 10000 -g xt_7.1
Next we can use this to calculate a threshold based on a specified FDR, for instance 1%.
$ gimme threshold /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm random_10k.200.fa 0.01 > threshold_fdr1.txt
Now let's create a new set of background sequences, and scan both this background set and our peaks for motifs. You can use the threshold file with the -c parameter of gimme scan.
Now you can compare the motif frequencies in the eomes peak file to the motif frequencies in the background file.
Let's check the eomes motif.
grep -i eomes /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm
>EOMES_T-box
It's called EOMES_T-box.
grep -w EOMES_T-box eomes.motifs.fdr1.gff | wc -l
271
grep -w EOMES_T-box random.motifs.fdr1.gff | wc -l
98
This means the enrichment of the eomes motif in peaks compared to background sequences is (271 / 2558) / (98 / 10000) = 10.81. A clear enrichment, so that's comforting. Of course it's not so useful to calculate this enrichment for each motif seperately. Let's do a little shell work. You can get a list of motifs in the eomes peaks with the following command:
$ cut -f2 -d\" eomes.motifs.fdr1.gff | sort -u > eomes.motifs.list
Note what we're doing by supplying the
"as delimiter tocut?
You can also use tools that try to predict TF motifs from the data. The GimmeMotifs pipeline combines several of these tools and provides a comprehensive output report that you can use to evaluate these motifs.
You can run GimmeMotifs as follows:
$ gimme motifs <input.fa> -n <name> -a xl -g xt_7.1
Let's see if we can use GimmeMotifs to find out the differences between the three T-box datasets.
bedtools, of t-specific peaks, peaks that are not shared with eomes or vegt. Run gimme motifs on that peak set. Do you get any good motifs?What would be the most likely candidate for a t-specific motif? What is the closest match in de database?**
gimme motifs in a similar analysis as before (gimme scan, gimme threshold) to compare the motif occurrences in vegt, eomes and t peaks. Is there indeed a motif that is specific for t?