Overview of shared files


I added four tables (as csv) with raw data to the shared files (These tables still contain ERCC genes!):

  • RNA raw, straight from kallisto

    2021-12-21_ST-99_raw_RNA_counts.csv

  • RNA with wells assigned (as in position in 384 well plate)

    2021-12-21_ST-99_raw_RNA_counts_annotated.csv

  • ADT raw, straight from CITEseq

    2021-12-21_ST-99_raw_ADT_counts.csv

  • ADT with wells assigned (as in position in 384 well plate)

    2021-12-21_ST-99_raw_ADT_counts_annotated.csv


I also added text-files which contains the metadata for the cells.

  • Cells in RNA set

    2021-12-21_ST-99_metadata_cells_RNA.txt

  • Cells in ADT set

    2021-12-21_ST-99_metadata_cells_ADT.txt


Next to this, I also added a seurat object (.rds) that already contains all this data with the cells matched (you can also extract the data from there). ERCC genes are removed.

2121-12-21_ST-99_matched_raw_seurat-object.rds

The seurat object with the filtering that I have shown before, ERCC genes are removed.

2121-12-21_ST-99_matched_filtered_seurat-object.rds

And lastly, the Seurat object with filtering based on raw counts and TMM normalised ADT counts. This is the filtering I am currently using for my analysis.

2121-12-21_ST-99_matched_filtered-TMM-raw-counts_seurat-object.rds

Seurat.filtered <- subset(Seurat.matched, subset= nCount_RNA > 300 & nCount_RNA < 7500 & nFeature_RNA >200 & nFeature_RNA <4000 & percent.mito > -Inf & percent.mito < 0.4 & nFeature_ADT > 60 & nCount_ADT <40000)

TMM filtering: TMM>0.38

Below in this html, I have described how I generated the different count-tables for RNA and ADT data, for which you can also look at the code for processing in R in the “211221_loading_files_generating_tables.Rmd”.

Count tables RNA

The count tables were generated using kallisto (psuedo-counting) using the following command:

kb count -i /scratch/etanis/refs/erccrep_index/GRCh38_transcriptome.idx -g /scratch/etanis/refs/erccrep_index/GRCh38_transcripts_to_genes.txt -w /scratch/etanis/refs/barcode_384_1_column.tab --overwrite --verbose -t 40 -o "kb_output/${output_name}" -x 0,8,16:0,0,8:1,0,0 ${r1} ${r2} >> log.out 2>&1;

Reads were mapped against an index prepared based on GRCH38(.99) and processed into a .mtx file. These are loaded into the R environment using an R script that loops over the (plate) folders and appends the subsequent cells to the growing matrix. (I can provide this if necessary).

The matrix now contains gene symbol as rownames and cell-BC_plate# as cell name. The cell-BC sequences in the colnames are converted to human readable well assignments.

Count tables ADT

The count tables were generated using CITEseq using the following command:

CITE-seq-Count -R1 ${r1} -R2 ${r2} -t tags.csv -cbf 9 -cbl 16 -umif 1 -umil 8 -cells 372 -wl /ceph/rimlsfnwi/data/moldevbio/mulder/etanis/1_RAID/refs/barcode_384_1_column.tab --max-error 1 --bc_collapsing_dist 1 --umi_collapsing_dist 1 -o ./${output_name}_tag_counts -T 10 >> log.citeseq.strict.txt

This means in short that I only allow for 1 mismatch in the cell-BC and UMI. These are the most strict settings.

The resulting UMI count matrices are loaded into the R environment using a script that loops over the (plate) folders and appends the subsequent cells to the growing matrix. (I can provide this if necessary).

The matrixs contains tag-BCs and antibody names as rownames and cell-BC_plate# as cellname/column names. The cell-BC sequences in the colnames are converted to human readable well assignments.