The latest ANANSE output were generated using the ANANSE development branch, and the wrapper script "anansnake". The wrapper takes care of DE analysis, motif scanning and of course ANANSE itself. Anansnake, including a short setup and example data, can be found on https://github.com/vanheeringen-lab/anansnake Anansnake runs on 4 sets of input: - an ATAC-seq based table with reads under (normalized) peaks - a gimmemotifs database (mapped to Smed factors) - RNA-seq counts and TPM tables - a genomepy-installed assembly The rest of this document explains how we generated these, what choices were made, and finally which choices you could potentially improve upon. # ANANSE input files The ATAC-seq based table with reads under (normalized) peaks were generated with peaks/atac_counts.ipynb. ATAC peaks and deduplicated reads were obtained from the raw data. Using samtools mpileup, all deduplicated reads were used to obtain the summit and read under each peak (included in peaks/raw is the merged.bam file we used for the pileups). Peak regions were normalized to 200bp centered on the summits, and the peak set was filtered for peaks with 4 or more total reads. The effect of peak filtering was visualized on the trackhub (see below for the links) and in the notebook. To fine-tune the peak filtering, you could extend the raw peaks (example in the notebook), and tinker around with the minimum number of reads. Two ortholog tables were generated for Schmidtea to human orthologs. The notebook and output are included in orthologs/. First, we connected Schmidtea genes to orthologs as given in the curated table. Next, we added unmatched Schmidtea genes to orthogroups in the curated table. Third, we linked unmatched Schmidtea genes to orthologs found be Neiro et al. Fourth, we partnered unmatched Schmidtea genes to orthologs found be Neiro et al. using their old symbols. Finally, we used orthofinder to match as many remaining unmatched Schmidtea genes to orthogroups with 4 or fewer members. The difference between the tables is the application of the final step. This directory also contains the assembly analysis notebook, which details my orthofinder run. The important orthofinder output file is stored in orthologs/raw/Orthogroups.tsv. The gimmemotifs databases and notebook are included in m2f/. To adapt the database to Schmidtea genes we modified the JASPAR 2022 database: for each human gene, we connected the motif to any Schmidtea gene matched in the orthologue table. In m2f/m2f_orthofinder you can find the gimmemotifs database extended with the orthofinder orthologs. To fine-tune the motif database, you could try to use the gimme.vertebrates database (example in the notebook), and compare the orthofinder-extended orthologs to the default ortholog table. The RNA-seq tables and notebook are in counts/. Single-cell read counts were aggregated per experiment and UMAP cluster. The resulting speudo-bulk samples were annotated using notes in the color_matching file. This annotation file was saved in counts/samples_per_replicate_cluster.tsv. (This all was a piece of cake with the help of Nathan's file, and the seurat notebook! Thanks!) Read counts were converted to TPMs using longest transcript lengths obtained with genomepy 0.12. To fine-tune the RNA-seq data, you could tinker with with the number of replicates per cluster: more is better for determining DE genes, but which clusters can be mixed requires expert insight. # Other odds and ends The directory Smed/ contains the genomepy-installed Smed assembly. I manually added the taxonomy id to the README.txt, the rest is reproducible with genomepy 0.12. ANANSE plots with human ortholog labels were generated with grn_plots_w_ortholog_labels.ipynb. Finally, a quick-and-dirt network analysis of genes affected by HNF4 knockout can be found in hnf4_knockdown.ipynb The results: in phagocytes, affected genes show a higher binding affinity for HNF4 in nearby enhancer regions. In parenchymal cells, less genes are affected, and these do not seem much enriched for nearby HNF4 binding. However, both cell types indicate a high interaction probability of affected genes with HNF4, suggesting indirect regulation of the affected genes found in parenchymal cells. # ANANSE input and output This directory contains two analyses: one in directory ananse/, the other in ananse_orthofinder. I ran the same analyses twice, only once extended with the orthofinder orthologs. Orthofinder adds ~8.000 orthologs, on top of the ~600 from the curated list. As a result, more Schmidtea transcription factors could be matched to motifs, and more densly connected networks could be generated. To me, that would suggest that any genes of interest found in the output are more likely to be real, and less likely to be an artifact of small TF sample size. To generate the output we used these files: - atac_counts: peaks/atac_counts_min4.tsv - rna_counts: counts/counts_per_replicate_cluster.tsv - rna_tpms: counts/tpms_per_replicate_cluster.tsv - motif2factors: - for the directory "ananse/": m2f/Smed.JASPAR2022_vertebrates.motif2factors.txt (and the .pfm) - for the directory "ananse_orthoginder/": m2f/m2f_orthofinder/Smed.JASPAR2022_vertebrates.motif2factors.txt (and the .pfm) The directories ananse/ and ananse_orthoginder/ each have a copy of these files (for simplicity). These input file are connected in the config.yaml. To connect RNA- and ATAC-seq pseudo-bulk samples to experimental conditions, we use the samples.tsv files. Note that column names in the files can be important: briefly, "samples" and "assembly" are required, and I use column "celltype" here to indicate ANANSE influence comparissons (so the name is unimportant, it just has to be consistent in the samples.tsv and config.yaml files). Also note that the gimme database is located in ananse/gimme/, as anansnake would otherwise attempt to create its own database (and fail). After each anansnake analysis, I ran the grn_plots_w_ortholog_labels.ipynb notebook to add the labelled plots. # Trackhub Copy this url: https://mbdata.science.ru.nl/siebrenf/jordi/trackhub/trackhub.hub.txt in the box on this website: https://genome-euro.ucsc.edu/cgi-bin/hgHubConnect?redirect=manual&source=genome.ucsc.edu#unlistedHubs If anything goes wrong trying to run anansnake/ANANSE, or if anything is unclear, be sure to message me on Slack!