Motif Analysis

In this tutorial we are going to analyze ChIP-seq datasets for transcription factor motifs. We will have a detailed look at the motifs in the peaks of three T-Box transcription factors, vegt, eomes and t (brachyury), in gastrula-stage X.tropicalis embryos. The data is decribed in Gentsch et al., 2013.

Create the datasets

If you have followed the ChIP-seq tutorial you might have already created some of these files

As an analysis of all genome-wide peaks would take too long, we are going to focus on scaffold_1. Let's first create filtered BAM files.

export d_dir=/data/devcom/bam/

samtools view -b $d_dir/SRR926401.GSM1180932.XBRA12.1.bam scaffold_1 > t.bam
samtools view -b $d_dir/SRR926402.GSM1180933.INPUT_XBRA12.1.bam scaffold_1 > t_input.bam

samtools view -b $d_dir/SRR926403.GSM1180934.XEOMES12.1.bam scaffold_1 > eomes.bam
samtools view -b $d_dir/SRR926404.GSM1180935.INPUT_XEOMES12.1.bam scaffold_1 > eomes_input.bam

samtools view -b $d_dir/SRR926405.GSM1180936.XVEGT12.1.bam scaffold_1 > vegt.bam
samtools view -b $d_dir/SRR926406.GSM1180937.INPUT_VEGT12.1.bam > vegt_input.bam

Note that I'm simplifying the names (using the official gene symbols), and that I name the bam files consistently. This will allow for easier downstream processing.

Now, create the peaks. As we have consistent names, we can do this in a loop.

for f in eomes t vegt; do
  macs2 callpeak -t ${f}.bam -c ${f}_input.bam -f BAM -g 2.15e8 -q 1e-5 -n ${f}
done

I'm using a strict threshold of -q 1e-5 and -g 2.15e8 is the size of scaffold_1.

As noticed in Gentch et al., these three factors share a lot of peaks. How can they differentially regulate their target genes? One explanation can be different co-factors. Let's see if we can find evidence for this.

We will create peaks of the same size, centered at the peak summit. This allows for comparisons between peaks. In addition I will create a FASTA file of the peaks.

bedtools slop -i eomes_summits.bed -b 100 -g genome.txt > eomes_summits.200.bed
bedtools getfasta -fi /data/genomes/xt_7.1/xt_7.1.fa -bed eomes_summits.200.bed -fo eomes_summits.200.fa

Known motifs

We will use a database of vertebrate motifs in combination with GimmeMotifs for motif analysis. The motif database is here:

/usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm

Now, let's scan our FASTA file with the motifs.

gimme scan eomes_summits.200.fa /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm > eomes.motifs.gff

This will scan the sequences in eomes_summits.200.fa for all the motifs present in /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm. While this process is running (it will take a while), try to answer the following question:

How many motifs are in /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm?

Now, gimme scan should be finished. Let's have a look at the results:

$ head -n 1 eomes.motifs.gff; tail -n1 eomes.motifs.gff 
scaffold_1:70431351-70431552    pwmscan misc_feature    95  109 11.2618421899   +   .   motif_name "ZSCAN4_C2H2.1" ; motif_instance "TGCACACCCTCTAAT"
 scaffold_1:189910768-189910969 pwmscan misc_feature    145 152 7.5823179192    -   .   motif_name "GMEB1/2_C2H2" ; motif_instance "GTGCGTAA"

We see the file is in GFF format, with the original sequence name (location) in the first column and the motif information in the 9th column. How many motifs did gimme scan identify?

$ wc -l eomes.motifs.gff
79335 eomes.motifs.gff

That's a lot.

cut -f1 eomes.motifs.gff | sort | uniq -c | head
     32 scaffold_1:100143253-100143454
     30 scaffold_1:100409349-100409550
     37 scaffold_1:100411740-100411941
     32 scaffold_1:100462444-100462645
     29 scaffold_1:100563619-100563820
     36 scaffold_1:100565461-100565662
     42 scaffold_1:100627751-100627952
     30 scaffold_1:100857546-100857747
     31 scaffold_1:100868530-100868731
     31 scaffold_1:100909190-100909391

Around 30-40 motifs per peak. Let's see which motifs are identified. We will use a trick with cut to get the motif name out.

$ cut -f2 -d\" eomes.motifs.gff  | sort -u | wc -l
230

So, the big majority of motifs is found in these peaks. Is that really useful information? Does it make sense?

The problem with transcription factor motif scanning is that you will always find matches. Furthermore, we need to control for random occurrence and motif length. First, we will define a motif-specific threshold, based on a False Discovery Rate (FDR) in random sequences. We will select 10,000 random sequences of the same length as our peaks (200) from the X.tropicalis genome.

$ gimme background random_10k.200.fa random_genomic -l 200 -n 10000 -g xt_7.1

Next we can use this to calculate a threshold based on a specified FDR, for instance 1%.

$ gimme threshold /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm random_10k.200.fa 0.01 > threshold_fdr1.txt

Now let's create a new set of background sequences, and scan both this background set and our peaks for motifs. You can use the threshold file with the -c parameter of gimme scan.

Now you can compare the motif frequencies in the eomes peak file to the motif frequencies in the background file.

Let's check the eomes motif.

grep -i eomes /usr/share/gimmemotifs/motif_databases/vertebrate_clusters.pwm
>EOMES_T-box

It's called EOMES_T-box.

grep -w EOMES_T-box eomes.motifs.fdr1.gff | wc -l
271
grep -w EOMES_T-box random.motifs.fdr1.gff | wc -l
98

This means the enrichment of the eomes motif in peaks compared to background sequences is (271 / 2558) / (98 / 10000) = 10.81. A clear enrichment, so that's comforting. Of course it's not so useful to calculate this enrichment for each motif seperately. Let's do a little shell work. You can get a list of motifs in the eomes peaks with the following command:

$ cut -f2 -d\" eomes.motifs.fdr1.gff | sort -u > eomes.motifs.list

Note what we're doing by supplying the " as delimiter to cut?

De novo motifs

You can also use tools that try to predict TF motifs from the data. The GimmeMotifs pipeline combines several of these tools and provides a comprehensive output report that you can use to evaluate these motifs.

You can run GimmeMotifs as follows:

$ gimme motifs <input.fa> -n <name> -a xl -g xt_7.1 

Let's see if we can use GimmeMotifs to find out the differences between the three T-box datasets.

What would be the most likely candidate for a t-specific motif? What is the closest match in de database?**