I added four tables (as csv) with raw data to the shared files (These tables still contain ERCC genes!):
RNA raw, straight from kallisto
2021-12-21_ST-99_raw_RNA_counts.csv
RNA with wells assigned (as in position in 384 well plate)
2021-12-21_ST-99_raw_RNA_counts_annotated.csv
ADT raw, straight from CITEseq
2021-12-21_ST-99_raw_ADT_counts.csv
ADT with wells assigned (as in position in 384 well plate)
2021-12-21_ST-99_raw_ADT_counts_annotated.csv
I also added text-files which contains the metadata for the cells.
Cells in RNA set
2021-12-21_ST-99_metadata_cells_RNA.txt
Cells in ADT set
2021-12-21_ST-99_metadata_cells_ADT.txt
Next to this, I also added a seurat object (.rds) that already contains all this data with the cells matched (you can also extract the data from there). ERCC genes are removed.
2121-12-21_ST-99_matched_raw_seurat-object.rds
The seurat object with the filtering that I have shown before, ERCC genes are removed.
2121-12-21_ST-99_matched_filtered_seurat-object.rds
And lastly, the Seurat object with filtering based on raw counts and TMM normalised ADT counts. This is the filtering I am currently using for my analysis.
2121-12-21_ST-99_matched_filtered-TMM-raw-counts_seurat-object.rds
Seurat.filtered <- subset(Seurat.matched, subset= nCount_RNA > 300 & nCount_RNA < 7500 & nFeature_RNA >200 & nFeature_RNA <4000 & percent.mito > -Inf & percent.mito < 0.4 & nFeature_ADT > 60 & nCount_ADT <40000)
TMM filtering: TMM>0.38
Below in this html, I have described how I generated the different count-tables for RNA and ADT data, for which you can also look at the code for processing in R in the “211221_loading_files_generating_tables.Rmd”.
The count tables were generated using kallisto (psuedo-counting) using the following command:
kb count -i /scratch/etanis/refs/erccrep_index/GRCh38_transcriptome.idx -g /scratch/etanis/refs/erccrep_index/GRCh38_transcripts_to_genes.txt -w /scratch/etanis/refs/barcode_384_1_column.tab --overwrite --verbose -t 40 -o "kb_output/${output_name}" -x 0,8,16:0,0,8:1,0,0 ${r1} ${r2} >> log.out 2>&1;
Reads were mapped against an index prepared based on GRCH38(.99) and processed into a .mtx file. These are loaded into the R environment using an R script that loops over the (plate) folders and appends the subsequent cells to the growing matrix. (I can provide this if necessary).
The matrix now contains gene symbol as rownames and cell-BC_plate# as cell name. The cell-BC sequences in the colnames are converted to human readable well assignments.
The count tables were generated using CITEseq using the following command:
CITE-seq-Count -R1 ${r1} -R2 ${r2} -t tags.csv -cbf 9 -cbl 16 -umif 1 -umil 8 -cells 372 -wl /ceph/rimlsfnwi/data/moldevbio/mulder/etanis/1_RAID/refs/barcode_384_1_column.tab --max-error 1 --bc_collapsing_dist 1 --umi_collapsing_dist 1 -o ./${output_name}_tag_counts -T 10 >> log.citeseq.strict.txt
This means in short that I only allow for 1 mismatch in the cell-BC and UMI. These are the most strict settings.
The resulting UMI count matrices are loaded into the R environment using a script that loops over the (plate) folders and appends the subsequent cells to the growing matrix. (I can provide this if necessary).
The matrixs contains tag-BCs and antibody names as rownames and cell-BC_plate# as cellname/column names. The cell-BC sequences in the colnames are converted to human readable well assignments.