Shell exercises 1
=================

Login and file transfer
-----------------------

1.  Copy the file
    <http://mbdata.science.ru.nl/courses/ghe/ghe_files_practical.tgz> to
    your home directory on the server.

Files and directories
---------------------

1.  Unpack `ghe_files_practical.tgz`. The `.tgz` file is a *compressed
    tarball*, a file packaged with `tar` and compressed with `gzip`. You
    unpack such a file using the following command:

``` {.bash}
$ tar xzfv ghe_files_practical.tgz
```

The `tar` command is followed by four arguments: `x` for extract, `z` to
uncompress using `gzip`, `f` to specify we have a file and `v` for
verbose.

How many files do you now have in your home directory?

2.  Let's clean up a bit. Delete `ghe_files_practical.tgz`. We don't
    need it anymore, as it is unpacked.

3.  Create a directory called `experiments` and a `data` directory.

4.  Move all `.txt` files to `experiments` and all `.bam` files to
    `data`. How many files are now in each directory?

There's one file, `exercises.txt`, that doesn't belong in the
`experiments` directory. Let's move it back to your home directory.
We're now going to use this file to record the answers to these
questions.

5.  Move the file `exercises.txt` back to your home directory.

6.  Make a copy of this file, called `answers.txt`.

7.  Edit the file `answers.txt` using nano and include the answers and
    all the commands you used. *Hint: use the \<up\> arrow to see
    previous commands.*

Redirection and pipes
---------------------

Go to the `experiments` directory, located in your home directory and
answer the following questions. Include the command you used with your
answer.

1.  What are the first three lines of `E-GEUV-2.idf.txt`?

2.  What is the last line of `E-GEUV-2.idf.txt`?

3.  How many samples are present in `E-GEUV-1.sdrf.txt`?

4.  The Geuvadis project sequenced RNA from different populations. Which
    column contains the population information in `E-GEUV-1.sdrf.txt`?
    How many different populations were sequenced? How many samples from
    each population are present in `E-GEUV-1.sdrf.txt`? How many samples
    are present from each population in all *.sdrf.txt files (*hint*:
    use the `cat` command to concatenate multiple files*).

5.  How long were the reads that were sequenced in this project?

6.  Create a file called `source_names.txt` with all unique 'Source
    Names' from the `sdrf.txt` files.

BAM files and the `samtools` commands
-------------------------------------

In the `data` directory are several `.bam` files. These files are in BAM
format, a binary form of the [SAM
format](http://samtools.github.io/hts-specs/SAMv1.pdf). This format is
suitable for storage, as it can be compressed, which saves disk space.
However, it not easy for a human to work with. That is why we will use
`samtools` to convert a BAM file to the human-readable, tab-separated
[SAM format](http://samtools.github.io/hts-specs/SAMv1.pdf).

You can view a BAM file with `samtools view <file.bam>`, where
`<file.bam>` represents your input file.

1.  Convert the file `HG00096.1.M_111124_6.bam` to a SAM file called
    `HG00096.1.M_111124_6.sam` using `samtools`. How big are these files
    in MB? How many reads are present in the SAM file?

As SAM files can get quite big, we usually work with the BAM files
directly whenever we can, using `samtools` in combination with Unix
pipes. Answer the following questions without creating a SAM file.

2.  How many reads are present in `HG00099.5.M_120131_3.bam` and
    `HG00097.7.M_120219_2.bam`?

Column 3 and 4 in the SAM format contain the chromosome and position to
which a read is mapped; column 5 contains the mapping quality.

3.  Create a list of chromosomes present in `HG00099.5.M_120131_3.bam`,
    with the number of reads mapped to it. How many chromosomes have
    mapped reads? Which chromosome has the least amount of reads? How
    many?

4.  What is the minimum mapping quality of `HG00097.7.M_120219_2.bam`?
    What is the maximum mapping quality? What are the three most common
    mapping qualities and how often do they occur in this file?

BAM has the advantage that it can be indexed. This means that we, or the
computer programs that we use and create, don't have to read the whole
BAM file at once, but that we can retrieve specific reads from an
indexed BAM file. First, the index has to be created with the
`samtools index` command. For instance, the following command will
create the index for `HG00096.1.M_111124_6.bam`:

``` {.bash}
 $ samtools index HG00096.1.M_111124_6.bam
```

Now we can query this BAM file.

``` {.bash}
 $ samtools view HG00096.1.M_111124_6.bam chr1 > reads_on_chr1.sam
```

This will create a SAM file with all the reads on chromosome 1.

5.  Create an index for every BAM file in the `data` directory. Which
    file is created by the `samtools index` command?

6.  How many reads are mapped to chromosome 6 in
    `HG00099.5.M_120131_3.bam`? What is the first position on chromosome
    6 that has a read mapped to it? How many reads mapped to chromosome
    6 have the highest mapping quality?

The `samtools` command can also do some filtering with additional
command line options. For instance, the `-q` argument can be used

7.  What is the percentage mapped reads in each BAM file with mapping
    quality greater than 30?

