De /data/devcom/raw/
directory contains the FASTQ files for use in this practical. These are the raw data of the ChIP-seq practical, all produced by an Illumina sequencer.
You can see that these files are compressed with gzip
by the .gz
extension. Now, we could gunzip these files, but that would mean they would take up much more space. Instead we can (most of the time) directly work with gzipped files. One handy utility is zcat
, which does the same as cat
, but works on gzipped files instead. For instance, to view the first 10 lines of a gzipped FASTQ file:
zcat my_fastq_file.fq.gz | head
View the first 20 lines of a FASTQ file. How many sequences do these 20 lines contain?
Can you deduce on which lane this experiment was run?
How many sequences are present in total in this FASTQ file?
The command wc -L
prints the length of the longest line in a file (note: -L
instead of -l
). For these FASTQ files, from an Illumina sequence, you can assume that all sequences are equal in length. This is not always true, other sequencing techologies can produce variable length reads.
These were all very basic commands to inspect FASTQ files. The FastQC program allows you to get a much more comprehensive view of the quality of your sequences.
You can run FastQC with the fastqc
command. Use the --help
option to have a look at all the options.
$ fastqc --help
The basic usage is very simple:
$ fastqc my_fastq_file.fq.gz -o output_dir
The FastQC file can be gzipped (but doesn't have to be). The output_dir
is important in our case, as it will try to write in the same location as the FASTQ file otherwise.
So, for example:
$ pwd
/home/simon
$ mkdir fastqc_reports
$ fastqc /data/devcom/raw/SRR926409.GSM1180940.XBRA20.1.fq.gz -o /home/simon/fastqc_report
This produces a html
file in the /home/simon/fastqc_report
directory with a lot of quality metrics.
Run FastQC on one (or more) FASTQ file(s) in the /data/devcom/raw/
directory. Transfer the resulting html
file to your computer using WinSCP and view it in a web browser.
Have a look at the 'Per base sequence quality'. What do you notice with regard to read quality and position in the read?
How do the other quality metrics look? Any specifics that jump out?
There are various tools to trim reads and remove adapter and/or low quality sequences. Examples are (in no particular order):
Try to figure out how to run one of these tools on a FASTQ file of interest.