Table of Contents

FASTQ basics and quality control

De /data/devcom/raw/ directory contains the FASTQ files for use in this practical. These are the raw data of the ChIP-seq practical, all produced by an Illumina sequencer.

How many FASTQ files are there? What is the total size of these FASTQ files?

You can see that these files are compressed with gzip by the .gz extension. Now, we could gunzip these files, but that would mean they would take up much more space. Instead we can (most of the time) directly work with gzipped files. One handy utility is zcat, which does the same as cat, but works on gzipped files instead. For instance, to view the first 10 lines of a gzipped FASTQ file:

zcat my_fastq_file.fq.gz | head

View the first 20 lines of a FASTQ file. How many sequences do these 20 lines contain?
Can you deduce on which lane this experiment was run?
How many sequences are present in total in this FASTQ file?

The command wc -L prints the length of the longest line in a file (note: -L instead of -l). For these FASTQ files, from an Illumina sequence, you can assume that all sequences are equal in length. This is not always true, other sequencing techologies can produce variable length reads.

What is the sequence length of the reads in your FASTQ file? Would this command work on every FASTQ file? If not, can you think of a command that would?

These were all very basic commands to inspect FASTQ files. The FastQC program allows you to get a much more comprehensive view of the quality of your sequences.

You can run FastQC with the fastqc command. Use the --help option to have a look at all the options.

$ fastqc --help

The basic usage is very simple:

$ fastqc my_fastq_file.fq.gz -o output_dir

The FastQC file can be gzipped (but doesn't have to be). The output_dir is important in our case, as it will try to write in the same location as the FASTQ file otherwise.

So, for example:

$ pwd

/home/simon

$ mkdir fastqc_reports

$ fastqc /data/devcom/raw/SRR926409.GSM1180940.XBRA20.1.fq.gz -o /home/simon/fastqc_report

This produces a html file in the /home/simon/fastqc_report directory with a lot of quality metrics.

Run FastQC on one (or more) FASTQ file(s) in the /data/devcom/raw/ directory. Transfer the resulting html file to your computer using WinSCP and view it in a web browser.
Have a look at the 'Per base sequence quality'. What do you notice with regard to read quality and position in the read?
How do the other quality metrics look? Any specifics that jump out?

Bonus exercise: trimming reads

There are various tools to trim reads and remove adapter and/or low quality sequences. Examples are (in no particular order):

cutadapt - http://code.google.com/p/cutadapt/
PRINSEQ - http://prinseq.sourceforge.net/
TrimGalore - http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
FASTX - http://hannonlab.cshl.edu/fastx_toolkit/
Trimmomatic - http://www.usadellab.org/cms/index.php?page=trimmomatic

Try to figure out how to run one of these tools on a FASTQ file of interest.

What is the effect?
How many sequences are completely removed?
What is the average size of the trimmed reads?