Please read the following:

Question 1

Read the following file into a tibble: https://mbdata.science.ru.nl/share/heeringen/gbd_exam/genome_size.XdPo.csv. This file contains information on the genome size and chromosome number of various species.

  1. How many observations (organisms) does this data set contain?

  2. Create a new tibble (or data.frame) with the column Organism_Name renamed to Name and the column Organism_Group renamed to Group.

  3. How many eukaryotes have a known number of chromosomes (higher than 0)?

  4. Which organism has the largest number of assemblies?

  5. Is there a significant correlation between the number of chromosomes and the genome size?

  6. Create a histogram of genome sizes. Facet by organism group and make sure the x-axis is in log10 scale.

Question 2

For this question you are going to work with the curated database of single cell studies described by Svensson et al.  You can use the following two commands to read in this data (you don’t need to save or download the file first). Ignore any warnings.

sc <- read_tsv('http://www.nxn.se/single-cell-studies/data.tsv')
sc <- sc %>% rename_all(~str_replace_all(., '\s+', '_'))
  1. How many studies had 5 or more cell types or clusters? Don’t include studies where this information is missing.

  2. The columns Tissue and Cell_source contain information about the cell type on which the single cell experiment had been performed. What is the most common cell source for the tissue "Culture" (cultured cells)?

  3. Create a scatter plot of the number of cells against the date. Facet by organism, and only use the four organisms with the highest number of experiments (Human, Mouse, Drosophila and Zebrafish). Make sure the y-axis is on a log10 scale.

  4. We have the hypothesis that publications in Science will have a larger number of cells than publications in Cell Reports. Use everything that you have learned to test this hypothesis. What is your conclusion?