Project Selection Guide

Overview

For Homework 1, each student will create an R package derived from a raw analysis script. The package must provide the same core analysis and outputs as the raw script, but in a more modular, reusable form.

Students will offer up to three interfaces to the same core functionality from the package:

R API — exported functions with roxygen2 documentation
CLI — command-line interface via Rapp
OPTIONAL - Shiny app — interactive analysis/visualization

All projects share the same rubric (25 points) and structural requirements. What differs is the core analysis and the dataset. Choose one of the 12 projects below, or propose your own (see Custom Projects at the end).

Each project uses a different dataset chosen to be at least somewhat appropriate for the analysis.

The analysis is secondary here, though you might be interested in reading more about them. Some of the single cell projects may take a few minutes to run, as they may contain thousands of cells.

Depending on the project, your primary data structure will be either a SummarizedExperiment (bulk experiments) or a SingleCellExperiment (single-cell experiments).

Generally, your package will at minimum contain:

A valid R package, installable from Github, with properly documented functions, example data, and tests
A CLI that runs the analysis end-to-end with adjustable parameters and writes results to disk
A documentation website with a brief vignette describing how to use the package

See the grading rubric for exact requirements.

The course contains material related to packaging a Shiny application with your package, but this is optional for this assignment.

How to Read Each Project Description

Section	What it tells you
Rationale	Why this analysis matters in genomics
Dataset	The Bioconductor dataset you will use, with a `data-raw/` script
Data structure	`SummarizedExperiment` or `SingleCellExperiment`
What users should be able to do	The capabilities your package must provide
Key parameters	User-facing arguments adjustable in all three interfaces
Potential CLI outputs	The TSV files your CLI subcommand must produce
Potential Shiny inputs	What the user should be able to change interactively (if you choose to implement a Shiny app)

If you see function calls you aren’t familiar with in the raw code blocks, looking them up in R is your friend, e.g. ?prcomcp.

Project 0: PCA Explorer (Reference Implementation)

This is the course example. You may NOT choose this project — it is provided as a reference for structure and style. The lectures will demonstrate the process of this project being converted into a full R package with a Shiny app and CLI, documentation, and tests.

Rationale: PCA is a very common first step in exploratory analysis of high-dimensional data. It reduces thousands of features to a few principal components that capture the dominant sources of variation for the dataset.

Dataset: Simulated RNA-seq

The example_se dataset contains a simulated RNA-seq experiment with 8 samples, 4 treated and 4 untreated, with 10,000 genes. The first 200 genes have a strong treatment effect, while the rest are noise. This structure should be clearly visible in the PCA plot, with treated and untreated samples separating along PC1.

Data preparation code

set.seed(42)

# 10000 genes, 8 samples
n_genes <- 10000
n_samples <- 8

# Simulate counts (negative binomial-ish)
counts <- matrix(
  rpois(n_genes * n_samples, lambda = 100),
  nrow = n_genes,
  ncol = n_samples
)
rownames(counts) <- paste0("gene", seq_len(n_genes))
colnames(counts) <- paste0("sample", seq_len(n_samples))

# Add some structure: first 500 genes differ by treatment
treatment <- rep(c("control", "treated"), each = 4)
counts[1:500, treatment == "treated"] <- counts[1:500, treatment == "treated"] * 2

# Sample metadata
sample_data <- data.frame(
  sample_id = colnames(counts),
  treatment = treatment,
  batch = rep(c("A", "B"), times = 4),
  row.names = colnames(counts)
)

# Gene metadata
gene_data <- data.frame(
  gene_id = rownames(counts),
  gene_symbol = paste0("SYM", seq_len(n_genes)),
  row.names = rownames(counts)
)

# Create SummarizedExperiment
example_se <- SummarizedExperiment(
  assays = list(counts = counts),
  colData = sample_data,
  rowData = gene_data
)

usethis::use_data(example_se, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Select the X most variable genes from the dataset
Run PCA on filtered, optionally transformed expression data
Summarize variance explained per principal component
Visualize samples in PC space, colored and shaped by metadata columns
Export PCA scores and variance summaries to TSV files

Key parameters: n_top, log_transform, scale, color_by, shape_by

Potential CLI outputs: pca_scores.tsv, pca_variance.tsv

PotentialShiny inputs: Number of top genes, log-transform toggle, scale toggle, PC axes, color/shape by metadata column, point size

Raw analysis code

library(SummarizedExperiment)
library(airway)
library(ggplot2)

set.seed(42)

# 10000 genes, 8 samples
n_genes <- 10000
n_samples <- 8

# Simulate counts (negative binomial-ish)
counts <- matrix(
  rpois(n_genes * n_samples, lambda = 100),
  nrow = n_genes,
  ncol = n_samples
)
rownames(counts) <- paste0("gene", seq_len(n_genes))
colnames(counts) <- paste0("sample", seq_len(n_samples))

# Add some structure: first 500 genes differ by treatment
treatment <- rep(c("control", "treated"), each = 4)
counts[1:500, treatment == "treated"] <- counts[1:500, treatment == "treated"] * 2

# Sample metadata
sample_data <- data.frame(
  sample_id = colnames(counts),
  treatment = treatment,
  batch = rep(c("A", "B"), times = 4),
  row.names = colnames(counts)
)

# Gene metadata
gene_data <- data.frame(
  gene_id = rownames(counts),
  gene_symbol = paste0("SYM", seq_len(n_genes)),
  row.names = rownames(counts)
)

# Create SummarizedExperiment
example_se <- SummarizedExperiment(
  assays = list(counts = counts),
  colData = sample_data,
  rowData = gene_data
)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Feature selection: top 200 most variable genes ---
mat <- assay(example_se, "counts")
vars <- apply(mat, 1, stats::var)
top_idx <- order(vars, decreasing = TRUE)[seq_len(200)]
se_top <- example_se[top_idx, ]

# --- PCA ---
mat <- assay(se_top, "counts")
mat <- log2(mat + 1)                 # log-transform with pseudocount
mat_t <- t(mat)                      # prcomp expects samples as rows
pca_result <- prcomp(mat_t, scale. = TRUE, center = TRUE)

# Build scores data.frame merged with sample metadata
scores <- as.data.frame(pca_result$x)
scores$sample_id <- rownames(scores)
col_data <- as.data.frame(colData(example_se))
col_data$sample_id <- rownames(col_data)
scores <- merge(scores, col_data, by = "sample_id")
scores <- scores[order(scores$sample_id), ]
rownames(scores) <- NULL

# --- Variance explained ---
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2) * 100
var_df <- data.frame(
    PC = paste0("PC", seq_along(var_explained)),
    variance_percent = var_explained
)

# --- PCA scatter plot ---
var_x <- round(var_df$variance_percent[1], 1)
var_y <- round(var_df$variance_percent[2], 1)

p_pca <- ggplot(scores, aes(x = .data[["PC1"]], y = .data[["PC2"]])) +
    geom_point(aes(color = .data[["treatment"]]), size = 4) +
    theme_bw(base_size = 14) +
    labs(x = paste0("PC1 (", var_x, "% variance)"),
         y = paste0("PC2 (", var_y, "% variance)"),
         title = "PCA Plot")
print(p_pca)

# --- Variance explained bar chart ---
var_top <- var_df[1:8, ]
var_top$PC <- factor(var_top$PC, levels = var_top$PC)

p_var <- ggplot(var_top, aes(x = .data$PC, y = .data$variance_percent)) +
    geom_col(fill = "steelblue") +
    geom_text(aes(label = sprintf("%.1f%%", .data$variance_percent)),
              vjust = -0.5, size = 4) +
    theme_bw(base_size = 14) +
    labs(x = "Principal Component", y = "Variance Explained (%)") +
    ylim(0, max(var_top$variance_percent) * 1.15)
print(p_var)

# --- Export TSVs ---
output_dir <- file.path(tempdir(), "pca_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(scores, file.path(output_dir, "pca_scores.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(var_df, file.path(output_dir, "pca_variance.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

cat("Wrote results to:", output_dir, "\n")
list.files(output_dir)

Project 1: UMAP Embedding Explorer

Rationale: UMAP is a nonlinear dimensionality reduction method that reveals biological structure, such as cell types, that PCA may struggle to cleanly reveal. It is a standard visualization in single-cell RNA-seq analysis, and tuning its parameters (n_neighbors, min_dist) meaningfully changes the result.

For large datasets, PCA is often used as a pre-reduction step before UMAP to speed up computation and reduce noise. Evaluating the quality of a UMAP embedding is an open question. One common approach is to compute k-nearest-neighbor preservation, which quantifies how well the local neighborhood structure of the original high-dimensional data is preserved in the UMAP embedding.

Examining k-NN preservation per cell type (rather than only as a global mean) adds biological interpretability: cell types with low preservation scores may have diffuse or intermixed structure in the original PCA space, suggesting that subclustering or finer resolution analysis could be warranted for those populations.

Dataset: Zeisel Brain (Mouse scRNA-seq)

The Zeisel et al. (2015) dataset profiles 3,005 cells from the mouse somatosensory cortex and hippocampus, classified into 7 major cell types (interneurons, pyramidal neurons, oligodendrocytes, microglia, astrocytes, endothelial, and mural cells) and 47 subtypes. The clear cell-type structure makes UMAP parameters visually interpretable.

Data preparation code

## data-raw/example_sce.R
BiocManager::install("scRNAseq")
library(scRNAseq)
library(SingleCellExperiment)

sce <- ZeiselBrainData()
example_sce <- sce

# Keep only counts + key metadata
assays(example_sce) <- list(counts = counts(example_sce))
colData(example_sce) <- colData(example_sce)[, c("level1class", "level2class")]
colnames(colData(example_sce)) <- c("cell_type", "cell_subtype")

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Select the most variable genes for downstream analysis
Run PCA as a pre-reduction step before UMAP (standard practice in scRNA-seq workflows — reduces noise and speeds up computation)
Run UMAP on the PCA-reduced data with configurable parameters (neighbors, min_dist, metric, number of PCs to use as input)
Compute an embedding quality metric: k-nearest-neighbor preservation (what fraction of each cell’s k nearest neighbors in the high-dimensional PCA space are preserved in UMAP space)
Break down k-NN preservation by cell type to identify populations whose local neighborhood structure is poorly captured by the embedding, which may indicate candidates for subclustering
Summarize the embedding (parameter values, spread statistics, global and per-cell-type quality scores)
Visualize cells in UMAP space, colored/shaped by metadata columns
Visualize per-cell-type k-NN preservation as a violin/box plot to flag populations with low embedding fidelity
Export UMAP coordinates, parameter summaries, and quality metrics to TSV files

Key parameters: n_top, n_pcs (number of PCs to use as UMAP input, default 30), n_neighbors, min_dist, metric (euclidean/cosine), color_by, shape_by

Potential CLI outputs: umap_embeddings.tsv (cell × UMAP1/UMAP2 + metadata + knn_preserved per-cell score), umap_params.tsv (parameter values + spread stats + global and per-cell-type quality scores)

Potential Shiny inputs: n_top slider, n_pcs slider (10–50), n_neighbors slider (5–50), min_dist slider (0.01–1.0), metric dropdown, color/shape by metadata, per-cell-type k-NN preservation violin plot

Dependencies: SingleCellExperiment, scuttle, uwot, scRNAseq (data only)

Raw analysis code

library(SingleCellExperiment)
library(scRNAseq)
library(scuttle)
library(uwot)
library(ggplot2)

set.seed(42) # UMAP has randomness so we set a seed for reproducibility

# --- Load data ---
sce <- ZeiselBrainData()
counts_mat <- counts(sce)
cell_type <- colData(sce)$level1class

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Log-normalize using scuttle ---
# Normalize on the full SCE so size factors reflect the complete library
sce <- logNormCounts(sce)

# --- Feature selection: top 2000 most variable genes ---
gene_vars <- apply(counts_mat, 1, var)
top_idx <- order(gene_vars, decreasing = TRUE)[seq_len(2000)]
mat_norm <- as.matrix(logcounts(sce)[top_idx, ])

# --- PCA on the top 2000 most variable genes using the transformed counts ---
n_pcs <- 30
pca_result <- prcomp(t(mat_norm), scale. = TRUE, center = TRUE)
pca_mat <- pca_result$x[, seq_len(n_pcs)]

# --- Run UMAP on PCA-reduced data ---
n_neighbors <- 15
min_dist    <- 0.1
metric      <- "euclidean"
umap_coords <- umap(pca_mat,
                     n_neighbors = n_neighbors,
                     min_dist    = min_dist,
                     metric      = metric,
                     n_components = 2)

# --- k-NN preservation: quality metric ---
# For each cell, check how many of its k nearest neighbors in PCA space
# are also neighbors in UMAP space
k_eval <- 15 # Note this should be <= n_neighbors used for UMAP - setting them the same makes sense
pca_dists  <- as.matrix(dist(pca_mat))
umap_dists <- as.matrix(dist(umap_coords))
n_cells_eval <- nrow(pca_mat)
preserved <- numeric(n_cells_eval)
for (i in seq_len(n_cells_eval)) {
    pca_nn  <- order(pca_dists[i, ])[2:(k_eval + 1)]
    umap_nn <- order(umap_dists[i, ])[2:(k_eval + 1)]
    preserved[i] <- length(intersect(pca_nn, umap_nn)) / k_eval
}
knn_preservation <- mean(preserved)
cat("k-NN preservation:", round(knn_preservation, 3), "\n")

# --- Per-cell-type k-NN preservation ---
knn_df <- data.frame(
    cell_id        = colnames(sce),
    cell_type      = cell_type,
    knn_preserved  = preserved
)

# Global mean per cell type — flag types with mean below overall mean
knn_by_type <- tapply(knn_df$knn_preserved, knn_df$cell_type, mean)
cat("Per-cell-type mean k-NN preservation:\n")
print(round(sort(knn_by_type), 3))

# Violin + box plot: per-cell-type distribution, sorted by median
cell_type_order <- names(sort(tapply(knn_df$knn_preserved, knn_df$cell_type, median)))
knn_df$cell_type <- factor(knn_df$cell_type, levels = cell_type_order)

p_knn <- ggplot(knn_df, aes(x = cell_type, y = knn_preserved, fill = cell_type)) +
    geom_violin(trim = FALSE, alpha = 0.6) +
    geom_boxplot(width = 0.15, outlier.size = 0.8, alpha = 0.9) +
    geom_hline(yintercept = knn_preservation, lty = 2, color = "black") +
    theme_bw(base_size = 14) +
    theme(axis.text.x = element_text(angle = 30, hjust = 1),
          legend.position = "none") +
    labs(x = "Cell Type",
         y = paste0("k-NN Preservation (k = ", k_eval, ")"),
         title = paste0("Per-Cell-Type k-NN Preservation (global mean = ",
                        round(knn_preservation, 3), ")"),
         caption = "Dashed line = global mean. Low-scoring types may benefit from subclustering.")
print(p_knn)

# --- Build embeddings data.frame ---
embeddings <- data.frame(
    cell_id       = colnames(sce),
    UMAP1         = umap_coords[, 1],
    UMAP2         = umap_coords[, 2],
    cell_type     = cell_type,
    knn_preserved = preserved
)

# --- Parameter summary ---
params <- data.frame(
    parameter = c("n_neighbors", "min_dist", "metric", "n_pcs",
                  "n_genes", "n_cells",
                  "UMAP1_range", "UMAP2_range",
                  "UMAP1_sd", "UMAP2_sd",
                  "knn_preservation"),
    value = c(n_neighbors, min_dist, metric, n_pcs,
              nrow(mat), ncol(mat),
              round(diff(range(umap_coords[, 1])), 2),
              round(diff(range(umap_coords[, 2])), 2),
              round(sd(umap_coords[, 1]), 2),
              round(sd(umap_coords[, 2]), 2),
              round(knn_preservation, 4))
)

# Append per-cell-type k-NN preservation to params
knn_type_rows <- data.frame(
    parameter = paste0("knn_pres_", names(knn_by_type)),
    value     = as.character(round(knn_by_type, 4))
)
params <- rbind(params, knn_type_rows)

# --- UMAP scatter plot ---
p <- ggplot(embeddings, aes(x = UMAP1, y = UMAP2, color = cell_type)) +
    geom_point(size = 0.8, alpha = 0.7) +
    theme_bw(base_size = 14) +
    labs(title = paste0("UMAP — Zeisel Brain (k-NN pres: ",
                        round(knn_preservation, 2), ")"),
         color = "Cell type")
print(p)

# --- Export ---
output_dir <- file.path(tempdir(), "umap_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(embeddings, file.path(output_dir, "umap_embeddings.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(params, file.path(output_dir, "umap_params.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 2: Sample Similarity & Clustering

Rationale: Before formal analysis, you often must check whether samples group by the expected biological variable (treatment, genotype) rather than technical artifacts (batch, patient). A sample-by-sample correlation heatmap with hierarchical clustering is a common diagnostic and can also be used to display sample groupings with additional metadata annotations. It can also be useful for identifying outlier samples that should perhaps be investigated more closely.

Dataset: Macrophage Stimulation (Human bulk RNA-seq)

The macrophage package (Alasoo et al., 2018) contains RNA-seq data from human monocyte-derived macrophages: 6 donors × 4 stimulation conditions (naive, IFNγ, Salmonella, and IFNγ + Salmonella) for 24 samples total. The dual grouping by stimulation and donor creates an interesting similarity structure where immune activation effects and donor-specific variation compete — stimulated conditions should cluster together, with donor effects visible within each treatment group.

Data preparation code

## data-raw/example_se.R
BiocManager::install(c("macrophage", "tximeta"))
library(macrophage)
library(tximeta)
library(SummarizedExperiment)

# Import salmon quantifications via tximeta
dir <- system.file("extdata", package = "macrophage")
coldata <- read.csv(file.path(dir, "coldata.csv"))
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
coldata$names <- coldata$sample_id

se <- tximeta(coldata)
gse <- summarizeToGene(se)

# Simplify colData
colData(gse) <- DataFrame(
  condition = factor(gse$condition_name),
  donor = factor(gse$line_id)
)

usethis::use_data(gse, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Select the most variable genes
Compute pairwise sample similarity (correlation or distance matrix)
Perform hierarchical clustering, cut at a chosen k, and compute silhouette widths to assess cluster quality
Compute cophenetic correlation to assess how well the dendrogram preserves the original dissimilarity structure
Assess cluster stability via bootstrap resampling: resample genes N times, recluster, and compute Jaccard overlap of cluster memberships
Visualize an annotated heatmap with dendrogram sidebars and metadata color bars
Export the similarity matrix, cluster assignments, and stability results to TSV files

Key parameters: n_top, method (pearson/spearman), linkage (ward.D2/complete/average), k, n_bootstrap (number of bootstrap iterations for stability)

Potential CLI outputs: similarity_matrix.tsv, cluster_assignments.tsv, cluster_stability.tsv

Potential Shiny inputs: Correlation method, linkage method, k slider, n_top slider, n_bootstrap slider, metadata columns for annotation bars, stability bar chart

Dependencies: SummarizedExperiment, macrophage + tximeta (data only)

Raw analysis code

library(SummarizedExperiment)
library(macrophage)
library(tximeta)
library(ComplexHeatmap)
library(cluster)
library(ggplot2)

# --- Load data via tximeta ---
dir <- system.file("extdata", package = "macrophage")
coldata <- read.csv(file.path(dir, "coldata.csv"))
coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf.gz")
coldata$names <- coldata$sample_id
se <- tximeta(coldata)
gse <- summarizeToGene(se)

# Simplify metadata
condition <- factor(gse$condition_name)
donor     <- factor(gse$line_id)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Feature selection: top 2000 most variable genes ---
mat <- assay(gse, "counts")
gene_vars <- apply(mat, 1, var)
top_idx <- order(gene_vars, decreasing = TRUE)[seq_len(2000)]
mat_top <- mat[top_idx, ]

# Log-transform
mat_log <- log2(mat_top + 1)

# --- Pairwise sample correlation ---
cor_mat <- cor(mat_log, method = "pearson")

# --- Hierarchical clustering ---
hc <- hclust(as.dist(1 - cor_mat), method = "ward.D2")
k <- 4
clusters <- cutree(hc, k = k)

# --- Silhouette widths ---
sil <- silhouette(clusters, dist = as.dist(1 - cor_mat))
avg_sil <- mean(sil[, "sil_width"])
cat("Average silhouette width:", round(avg_sil, 3), "\n")

# --- Cophenetic correlation ---
coph <- cor(as.dist(1 - cor_mat), cophenetic(hc))
cat("Cophenetic correlation:", round(coph, 3), "\n")

# --- Bootstrap cluster stability ---
n_bootstrap <- 50
set.seed(42)
boot_jaccard <- matrix(NA, nrow = n_bootstrap, ncol = k)
for (b in seq_len(n_bootstrap)) {
    boot_genes <- sample(nrow(mat_top), replace = TRUE)
    boot_log <- log2(mat_top[boot_genes, ] + 1)
    boot_cor <- cor(boot_log, method = "pearson")
    boot_hc <- hclust(as.dist(1 - boot_cor), method = "ward.D2")
    boot_cl <- cutree(boot_hc, k = k)
    for (ci in seq_len(k)) {
        orig_members <- names(clusters[clusters == ci])
        boot_members <- names(boot_cl[boot_cl == ci])
        jaccard <- length(intersect(orig_members, boot_members)) /
                   length(union(orig_members, boot_members))
        boot_jaccard[b, ci] <- jaccard
    }
}
stability <- data.frame(
    cluster = seq_len(k),
    mean_jaccard = colMeans(boot_jaccard),
    sd_jaccard   = apply(boot_jaccard, 2, sd)
)

# --- Annotated heatmap ---
col_anno <- HeatmapAnnotation(
    condition = condition,
    donor     = donor,
    cluster   = factor(clusters)
)
ht <- Heatmap(cor_mat, name = "Pearson r",
              top_annotation = col_anno,
              show_row_names = TRUE, show_column_names = TRUE,
              column_title = "Sample Similarity")
draw(ht)

# --- Export ---
output_dir <- file.path(tempdir(), "similarity_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(as.data.frame(cor_mat),
            file.path(output_dir, "similarity_matrix.tsv"),
            sep = "\t", quote = FALSE)

assignments <- data.frame(
    sample    = colnames(gse),
    condition = as.character(condition),
    donor     = as.character(donor),
    cluster   = clusters,
    silhouette_width = sil[, "sil_width"]
)
write.table(assignments, file.path(output_dir, "cluster_assignments.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

write.table(stability, file.path(output_dir, "cluster_stability.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 3: Differential Expression with DESeq2

Rationale: One of the most common ’omics analyses is differential expression analysis. DESeq2 is a common package for differential analysis of count data (RNA-seq or otherwise) — it models the mean-variance relationship with a negative binomial distribution, shares information across features to stabilize fold-change estimates, and provides log2 fold-change shrinkage for reliable ranking. Volcano and MA plots of differential results are bog-standard visualizations for count-based data.

Dataset: Tissue-Resident Regulatory T Cells (Mouse bulk RNA-seq)

The tissueTreg package provides htseq count data from Delacher et al. (2017), profiling regulatory T cells (Tregs) isolated from four mouse tissues — fat, liver, lymph node, and skin — alongside conventional T cells (Tconv) from lymph nodes. This 5-group design lets students choose their own comparison (e.g., lymph-node Treg vs. Tconv, or fat vs. liver Tregs). The dataset is accessible directly via ExperimentHub (ID EH1075) and contains 21,935 genes × 15 samples with raw integer counts ready for DESeq2.

Data preparation code

## data-raw/example_se.R
BiocManager::install("ExperimentHub")
library(ExperimentHub)
library(SummarizedExperiment)

eh <- ExperimentHub()
se <- eh[["EH1075"]]  # Tissue Tregs — htseq counts, Mus musculus

# Parse tissue and cell type from the combined tissue_cell column
cd <- colData(se)
tissue_cell <- as.character(cd[["tissue_cell"]])
cell_type <- ifelse(grepl("Tcon", tissue_cell), "Tconv", "Treg")
tissue    <- sub("-Tre?[a-z]*", "", tissue_cell)  # strip cell-type suffix
tissue    <- sub("-Tcon", "", tissue)

example_se <- SummarizedExperiment(
  assays  = list(counts = assay(se, "counts")),
  colData = DataFrame(
    sample_id   = rownames(cd),
    cell_type   = factor(cell_type),
    tissue      = factor(tissue)
  )
)

usethis::use_data(example_se, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Filter out genes with low expression across samples (e.g., minimum total count threshold)
Run DESeq2 differential expression analysis between two groups, producing log2 fold changes, p-values, and BH-adjusted p-values
Apply log2 fold-change shrinkage for more reliable effect-size estimates
Summarize the number of significantly up/down/non-significant genes at given thresholds
Compute independent filtering diagnostics: evaluate how the pre-filtering threshold (minimum count per group) affects the number of discoveries, enabling users to choose an informed threshold
Visualize results as a volcano plot (log2FC vs. −log10 p) or MA plot (mean expression vs. log2FC), with significant genes highlighted
Export differential expression results, summary counts, and filtering diagnostics to TSV files

Key parameters: group_column, ref_level, fc_threshold, p_threshold, shrinkage (apeglm/ashr/normal/none), plot_type (volcano/ma), min_count_per_group (minimum mean count per group for gene filtering)

Potential CLI outputs: de_results.tsv, de_summary.tsv, filtering_diagnostics.tsv

Potential Shiny inputs: Grouping column, reference level, FC threshold slider, p-value threshold slider, shrinkage method dropdown, volcano ↔︎ MA radio button, min count per group slider, filtering diagnostics plot

Dependencies: SummarizedExperiment, DESeq2, ExperimentHub (data retrieval)

Raw analysis code

library(SummarizedExperiment)
library(ExperimentHub)
library(DESeq2)
library(apeglm)
library(ggplot2)

# --- Load data ---
eh <- ExperimentHub()
se <- eh[["EH1075"]]  # Tissue Tregs — 21,935 genes x 15 samples

# Parse tissue and cell type from tissue_cell column
cd <- colData(se)
tissue_cell <- as.character(cd[["tissue_cell"]])
cell_type <- ifelse(grepl("Tcon", tissue_cell), "Tconv", "Treg")
tissue    <- sub("-Tre?[a-z]*", "", tissue_cell)
tissue    <- sub("-Tcon", "", tissue)

# Build clean SE
se_clean <- SummarizedExperiment(
    assays  = list(counts = assay(se, "counts")),
    colData = DataFrame(cell_type = factor(cell_type),
                        tissue    = factor(tissue))
)

# --- Filter: keep lymph-node samples only, for Treg vs Tconv ---
keep <- se_clean$tissue == "Lymph-N"
se_ln <- se_clean[, keep]
se_ln$cell_type <- droplevels(se_ln$cell_type)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# Filter low-count genes
keep_genes <- rowSums(assay(se_ln, "counts")) >= 10
se_ln <- se_ln[keep_genes, ]
cat("Genes after filtering:", nrow(se_ln), "\n")

# --- DESeq2 ---
dds <- DESeqDataSet(se_ln, design = ~ cell_type)
dds$cell_type <- relevel(dds$cell_type, ref = "Tconv")
dds <- DESeq(dds)

# Log2 fold-change shrinkage
coef_name <- resultsNames(dds)[2]  # "cell_type_Treg_vs_Tconv"
res_shrunk <- lfcShrink(dds, coef = coef_name, type = "apeglm")
res_df <- as.data.frame(res_shrunk)
res_df$gene <- rownames(res_df)

# --- Summary ---
p_threshold  <- 0.05
fc_threshold <- 0.5
res_df$direction <- "ns"
res_df$direction[res_df$padj < p_threshold & res_df$log2FoldChange >  fc_threshold] <- "up"
res_df$direction[res_df$padj < p_threshold & res_df$log2FoldChange < -fc_threshold] <- "down"

summary_df <- as.data.frame(table(res_df$direction))
colnames(summary_df) <- c("direction", "count")
print(summary_df)

# --- Volcano plot ---
res_df$neg_log10p <- -log10(res_df$pvalue)
p <- ggplot(res_df, aes(x = log2FoldChange, y = neg_log10p, color = direction)) +
    geom_point(size = 0.8, alpha = 0.6) +
    scale_color_manual(values = c(down = "blue", ns = "grey70", up = "red")) +
    theme_bw(base_size = 14) +
    labs(x = "log2 Fold Change (Treg vs Tconv)",
         y = expression(-log[10](p)),
         title = "Volcano Plot — Lymph Node Treg vs Tconv") +
    geom_vline(xintercept = c(-fc_threshold, fc_threshold), lty = 2) +
    geom_hline(yintercept = -log10(p_threshold), lty = 2)
print(p)

# --- Independent filtering diagnostics ---
# Evaluate how the minimum count threshold affects number of discoveries
count_thresholds <- c(0, 1, 5, 10, 20, 50, 100, 200, 500)
filtering_diag <- data.frame(threshold = count_thresholds, n_tested = NA_integer_,
                              n_significant = NA_integer_)
for (j in seq_along(count_thresholds)) {
    thresh <- count_thresholds[j]
    keep_t <- rowSums(assay(se_ln, "counts")) >= thresh
    if (sum(keep_t) < 2) {
        filtering_diag$n_tested[j] <- sum(keep_t)
        filtering_diag$n_significant[j] <- 0
        next
    }
    dds_t <- DESeqDataSet(se_ln[keep_t, ], design = ~ cell_type)
    dds_t$cell_type <- relevel(dds_t$cell_type, ref = "Tconv")
    dds_t <- DESeq(dds_t, quiet = TRUE)
    res_t <- results(dds_t, alpha = p_threshold)
    filtering_diag$n_tested[j] <- sum(keep_t)
    filtering_diag$n_significant[j] <- sum(res_t$padj < p_threshold, na.rm = TRUE)
}

# --- Export ---
output_dir <- file.path(tempdir(), "de_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(res_df, file.path(output_dir, "de_results.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(summary_df, file.path(output_dir, "de_summary.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(filtering_diag, file.path(output_dir, "filtering_diagnostics.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 4: K-means Cell Clustering

Rationale: K-means clustering partitions samples into k groups by minimizing within-cluster variance. Choosing k when you don’t know the real value is a real practical decision — the elbow plot of within-cluster sum of squares is one diagnostic. K-means clustering is also a simple way to split a dataset into a specific number of groups, which can be useful when qualitative groups are obvious (e.g. via PCA). Evaluating cluster quality is an open question, but silhouette widths and gap statistics are common metrics. When reference labels are available, the Adjusted Rand Index (ARI) quantifies how well the clusters match known groups.

Dataset: Baron Human Pancreas (scRNA-seq)

The Baron et al. (2016) dataset profiles ~8,500 cells from human pancreatic islets, capturing 14 cell types including alpha, beta, delta, gamma, acinar, and ductal cells. Students must build a project to provide elbow and silhouette diagnostics to help users select the optimal number of clusters.

Data preparation code

## data-raw/example_sce.R
BiocManager::install("scRNAseq")
library(scRNAseq)
library(SingleCellExperiment)

sce <- BaronPancreasData("human")
example_sce <- sce

assays(example_sce) <- list(counts = counts(example_sce))
colData(example_sce) <- colData(example_sce)[, "label", drop = FALSE]
colnames(colData(example_sce)) <- "cell_type"

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Select the most variable genes
Run k-means clustering across a range of k values (2..max_k), see ?kmeans
Compute clustering quality metrics (total within-cluster sum of squares, average silhouette width, and gap statistic) for each k, see ?silhouette and ?clusGap
Evaluate cluster assignments against known cell type labels using the Adjusted Rand Index (ARI) when reference labels are available
Visualize clusters on a PCA projection, colored by cluster assignment
Generate an elbow plot (WSS vs. k) and silhouette/gap plots for choosing the optimal number of clusters
Export cluster assignments, quality metrics, and evaluation results to TSV files

Key parameters: n_top, max_k, n_starts, selected_k, reference_column (metadata column for ARI, optional)

Potential CLI outputs: cluster_assignments.tsv, kmeans_metrics.tsv, cluster_evaluation.tsv

Potential Shiny inputs: n_top slider, max_k slider, selected k slider, toggle between elbow plot and cluster scatter, color by cluster or cell_type metadata

Dependencies: SingleCellExperiment, scuttle, cluster, scRNAseq (data only)

Raw analysis code

library(SingleCellExperiment)
library(scRNAseq)
library(scuttle)
library(cluster)
library(ggplot2)
# --- Load data ---
sce <- BaronPancreasData("human")
counts_mat <- counts(sce)
cell_type <- colData(sce)$label
# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================
# --- Log-normalize using scuttle ---
sce <- logNormCounts(sce)
# --- Feature selection: top 2000 most variable genes ---
gene_vars <- apply(counts_mat, 1, var)
top_idx <- order(gene_vars, decreasing = TRUE)[seq_len(2000)]
mat_norm <- as.matrix(logcounts(sce)[top_idx, ])
# --- PCA ---
pca <- prcomp(t(mat_norm), scale. = TRUE, center = TRUE)
pca_mat <- pca$x[, 1:20]
# --- K-means across k = 5..20 ---
min_k <- 5
max_k <- 20
n_starts <- 25
set.seed(42)
metrics <- data.frame(k = min_k:max_k, wss = NA_real_, avg_silhouette = NA_real_)
km_list <- list()
for (k in min_k:max_k) {
    km <- kmeans(pca_mat, centers = k, nstart = n_starts)
    km_list[[as.character(k)]] <- km
    metrics$wss[k - min_k + 1] <- km$tot.withinss
    sil <- silhouette(km$cluster, dist(pca_mat))
    metrics$avg_silhouette[k - min_k + 1] <- mean(sil[, "sil_width"])
}
# --- Gap statistic ---
gap_stat <- clusGap(pca_mat, FUN = kmeans, nstart = n_starts,
                    K.max = max_k, B = 20)
gap_tab <- as.data.frame(gap_stat$Tab)
metrics$gap <- gap_tab$gap[min_k:max_k]
metrics$gap_se <- gap_tab$SE.sim[min_k:max_k]
# --- Adjusted Rand Index vs known cell types ---
evaluation <- data.frame(k = min_k:max_k, ari = NA_real_)
for (k in min_k:max_k) {
    pred <- km_list[[as.character(k)]]$cluster
    tab <- table(cell_type, pred)
    n <- sum(tab)
    sum_comb <- function(x) x * (x - 1) / 2
    sum_rows <- sum(sum_comb(rowSums(tab)))
    sum_cols <- sum(sum_comb(colSums(tab)))
    sum_all  <- sum(sum_comb(tab))
    expected <- sum_rows * sum_cols / sum_comb(n)
    max_index <- (sum_rows + sum_cols) / 2
    ari <- if (max_index == expected) 0 else (sum_all - expected) / (max_index - expected)
    evaluation$ari[k - min_k + 1] <- ari
}
# --- Elbow plot ---
p_elbow <- ggplot(metrics, aes(x = k, y = wss)) +
    geom_line() + geom_point(size = 2) +
    theme_bw(base_size = 14) +
    labs(x = "Number of Clusters (k)",
         y = "Total Within-Cluster Sum of Squares",
         title = "Elbow Plot")
print(p_elbow)
# --- Cluster scatter for selected k ---
selected_k <- 14
km_sel <- km_list[[as.character(selected_k)]]
scatter_df <- data.frame(
    PC1 = pca$x[, 1], PC2 = pca$x[, 2],
    cluster   = factor(km_sel$cluster),
    cell_type = cell_type
)
p_scatter <- ggplot(scatter_df, aes(x = PC1, y = PC2, color = cluster)) +
    geom_point(size = 0.6, alpha = 0.6) +
    theme_bw(base_size = 14) +
    labs(title = paste0("K-means (k=", selected_k, ") on PCA"))
print(p_scatter)


# --- Export ---
output_dir <- file.path(tempdir(), "kmeans_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

assignments <- data.frame(
    cell_id   = colnames(sce),
    cell_type = cell_type,
    cluster   = km_sel$cluster
)
write.table(assignments, file.path(output_dir, "cluster_assignments.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(metrics, file.path(output_dir, "kmeans_metrics.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(evaluation, file.path(output_dir, "cluster_evaluation.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 5: Cell QC Dashboard

Rationale: Quality control is the essential first step in scRNA-seq analysis. Before any biological interpretation, low-quality cells (those with too few detected genes, low library sizes, or high spike-in proportions) must be identified and flagged. ERCC spike-in controls provide a ground-truth reference for QC.

Dataset: Lun Spike-In (Mouse scRNA-seq)

The Lun et al. (2017) dataset contains 192 mouse cells with ERCC spike-in controls. It was specifically designed for benchmarking normalization and QC methods. The spike-in proportions and library size variation create natural QC outliers that students must detect.

Important note on data structure: LunSpikeInData() stores the ERCC spike-in counts in an alternative experiment (altExp), not alongside endogenous genes in the main assay. This is the standard Bioconductor convention for spike-ins: rownames(sce) contains only the ~9,000 endogenous mouse genes, while altExp(sce, "ERCC") is itself a SingleCellExperiment holding the 92 ERCC control counts for the same cells. To compute spike-in percentage per cell, sum across counts(altExp(sce, "ERCC")) and divide by the total library size (endogenous + spike-in counts).

Data preparation code

## data-raw/example_sce.R
BiocManager::install(c("scRNAseq", "org.Mm.eg.db", "AnnotationDbi"))
library(scRNAseq)
library(SingleCellExperiment)
library(org.Mm.eg.db)
library(AnnotationDbi)

sce <- LunSpikeInData()
example_sce <- sce

# ERCC spike-ins are already in altExp(example_sce, "ERCC") — the standard
# Bioconductor convention. Do NOT look for ERCC rows in rownames(example_sce);
# they live in the alternative experiment, accessed via:
#   altExp(example_sce, "ERCC")          # the ERCC SCE
#   counts(altExp(example_sce, "ERCC"))  # ERCC count matrix (92 controls x 192 cells)

# Keep key metadata — note column names use spaces in the raw object;
# rename to R-friendly names
colData(example_sce) <- DataFrame(
  cell_line = colData(sce)[["cell line"]],
  block     = colData(sce)[["block"]]
)

# Add gene symbols to rowData via annotation package, needed for mitochondrial gene QC
gene_ids <- rownames(example_sce)
gene_symbols <- mapIds(org.Mm.eg.db, keys = gene_ids, column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
rowData(example_sce)$symbol <- gene_symbols

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Compute per-cell QC metrics: library size, number of genes detected, spike-in percentage (from altExp(sce, "ERCC")), mitochondrial gene percentage (the standard QC metric for detecting dying cells), and novelty score (log10 genes / log10 UMI — detects low-complexity cells)
Flag outlier cells using MAD-based thresholds on QC metrics
Visualize QC results in a multi-panel display: library size distribution, genes vs. library size scatter, spike-in percentage histogram, mitochondrial percentage violin plot, with outliers highlighted
Export QC metrics and pass/fail flags to TSV files

Key parameters: mad_threshold, min_genes, min_library_size, max_spike_pct, max_mito_pct, min_novelty

Potential CLI outputs: qc_metrics.tsv, qc_flags.tsv

PotentialShiny inputs: MAD threshold slider, minimum genes input, max spike-in % slider, max mito % slider, min novelty slider; panels update to reflect pass/fail counts

Dependencies: SingleCellExperiment, scRNAseq (data only)

Raw analysis code

library(SingleCellExperiment)
library(scRNAseq)
library(ggplot2)
library(patchwork)
library(AnnotationDbi)
library(org.Mm.eg.db)

# --- Load data ---
sce <- LunSpikeInData()

gene_ids <- rownames(sce)
gene_symbols <- mapIds(org.Mm.eg.db, keys = gene_ids, column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")
rowData(sce)$symbol <- gene_symbols


# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

counts_mat <- counts(sce)

# --- Compute per-cell QC metrics ---
lib_size     <- colSums(counts_mat)
n_genes      <- colSums(counts_mat > 0)

# Spike-in counts from altExp
ercc_counts  <- counts(altExp(sce, "ERCC"))
ercc_total   <- colSums(ercc_counts)
total_counts <- lib_size + ercc_total
spike_pct    <- ercc_total / total_counts * 100

# Shannon entropy of count distribution per cell
shannon <- apply(counts_mat, 2, function(x) {
    p <- x[x > 0] / sum(x[x > 0])
    -sum(p * log(p))
})

# Mitochondrial gene percentage
# In mouse data, mitochondrial genes typically start with "mt-"
is_mito <- grepl("^mt-", rowData(sce)$symbol, ignore.case = TRUE)
mito_total <- colSums(counts_mat[is_mito, , drop = FALSE])
mito_pct <- mito_total / lib_size * 100

# Novelty score: log10(genes detected) / log10(UMI counts)
# Cells with low complexity (e.g., dominated by a few genes) score low
novelty_score <- log10(n_genes) / log10(lib_size)
# Handle edge cases where lib_size is 0 or 1
novelty_score[!is.finite(novelty_score)] <- 0

qc_metrics <- data.frame(
    cell_id    = colnames(sce),
    cell_line  = colData(sce)[["cell line"]],
    block      = colData(sce)[["block"]],
    lib_size   = lib_size,
    n_genes    = n_genes,
    spike_pct  = spike_pct,
    mito_pct   = mito_pct,
    novelty    = novelty_score,
    entropy    = shannon
)

# --- MAD-based outlier flagging ---
mad_threshold <- 3
flag_low_lib   <- qc_metrics$lib_size   < median(lib_size)   - mad_threshold * mad(lib_size)
flag_low_genes <- qc_metrics$n_genes    < median(n_genes)    - mad_threshold * mad(n_genes)
flag_hi_spike  <- qc_metrics$spike_pct  > median(spike_pct)  + mad_threshold * mad(spike_pct)
flag_hi_mito   <- qc_metrics$mito_pct   > median(mito_pct)   + mad_threshold * mad(mito_pct)
flag_low_novelty <- qc_metrics$novelty  < median(novelty_score) - mad_threshold * mad(novelty_score)

qc_metrics$pass <- !(flag_low_lib | flag_low_genes | flag_hi_spike |
                      flag_hi_mito | flag_low_novelty)
cat("Cells passing QC:", sum(qc_metrics$pass), "/", nrow(qc_metrics), "\n")

# --- Multi-panel QC visualization ---
qc_metrics$status <- ifelse(qc_metrics$pass, "pass", "fail")

p1 <- ggplot(qc_metrics, aes(x = lib_size, fill = status)) +
    geom_histogram(bins = 30, alpha = 0.7) +
    theme_bw() + labs(x = "Library Size", title = "Library Size Distribution")

p2 <- ggplot(qc_metrics, aes(x = lib_size, y = n_genes, color = status)) +
    geom_point(size = 1.5, alpha = 0.7) +
    theme_bw() + labs(x = "Library Size", y = "Genes Detected")

p3 <- ggplot(qc_metrics, aes(x = spike_pct, fill = status)) +
    geom_histogram(bins = 30, alpha = 0.7) +
    theme_bw() + labs(x = "Spike-in %", title = "ERCC Spike-in Percentage")

p4 <- ggplot(qc_metrics, aes(x = n_genes, y = entropy, color = status)) +
    geom_point(size = 1.5, alpha = 0.7) +
    theme_bw() + labs(x = "Genes Detected", y = "Shannon Entropy")

# Print panels (use patchwork or gridExtra for combined layout)
(p1 | p2) / (p3 | p4)

# --- Export ---
output_dir <- file.path(tempdir(), "qc_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(qc_metrics, file.path(output_dir, "qc_metrics.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

qc_flags <- data.frame(
    cell_id        = qc_metrics$cell_id,
    pass           = qc_metrics$pass,
    flag_low_lib   = flag_low_lib,
    flag_low_genes = flag_low_genes,
    flag_hi_spike  = flag_hi_spike,
    flag_hi_mito   = flag_hi_mito,
    flag_low_novelty = flag_low_novelty
)
write.table(qc_flags, file.path(output_dir, "qc_flags.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 6: Gene Set Scoring

Rationale: Rather than analyzing genes individually, biologists often ask whether a predefined set of genes (a pathway, signature) is collectively up- or down-regulated. Per-sample gene set scores reduce genes composing a pathway or contributing to a given biological process to a single number per sample, enabling group comparisons. This project combines expression data with external pathway knowledge from MSigDB.

Dataset: Airway (Human bulk RNA-seq) + MSigDB Hallmark Gene Sets

The airway Bioconductor package provides RNA-seq count data from Himes et al. (2014), profiling four human airway smooth muscle cell lines treated with dexamethasone (a glucocorticoid) versus untreated controls (8 samples total). Dexamethasone activates inflammatory and stress-response pathways that are well-represented in MSigDB Hallmark gene sets — making this a compact, self-contained dataset for demonstrating pathway-level scoring without requiring large data downloads.

In this case, you wouldn’t really have to prepare the data yourself or include it in your package - it’s already readily available in the format you need. But for the sake of the exercise, go ahead and save it as an internal dataset in your package. This will give you practice with packaging and documenting data, even if it’s not strictly necessary for this particular dataset.

Data preparation code

## data-raw/example_se.R
library(airway)
library(SummarizedExperiment)

data(airway)

example_se <- airway

usethis::use_data(example_se, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Read gene sets from a GMT file or named list of character vectors
Compute per-sample gene set scores using two methods: mean z-score and rank-based scoring (ssGSEA-like: rank genes per sample, compute enrichment from ranks of set members) — comparing methods is essential since results can differ substantially
Compute per-gene contribution scores within a gene set (individual gene z-scores across samples) to understand which genes drive the overall score
Summarize mean scores per group and compute effect sizes between conditions
Visualize gene set scores as boxplots or violin plots, grouped by a clinical variable and faceted by gene set; also visualize per-gene contributions as a heatmap strip for a selected gene set
Export per-sample scores (with method column), group summaries, and per-gene contributions to TSV files

Key parameters: gene_sets (list or GMT path), score_method (mean_z, rank_based, or both), group_column

Potential CLI outputs: geneset_scores.tsv, scoring_summary.tsv, gene_contributions.tsv

Potential Shiny inputs: Select built-in sets or upload GMT, score method dropdown, grouping variable dropdown (dex treatment, cell line), gene set selector, boxplot ↔︎ violin toggle

Dependencies: SummarizedExperiment, airway (data only), msigdbr (gene sets)

Raw analysis code

library(airway)
library(SummarizedExperiment)
library(msigdbr)
library(ggplot2)

# --- Load airway data ---
data(airway)

# log2-CPM normalize counts for scoring
counts_mat <- assay(airway, "counts")
lib_sizes   <- colSums(counts_mat)
mat         <- log2(t(t(counts_mat) / lib_sizes * 1e6) + 1)
dex         <- colData(airway)$dex  # "trt" or "untrt"

# Convert Ensembl IDs to gene symbols for MSigDB overlap
library(org.Hs.eg.db)
library(AnnotationDbi)
symbols <- mapIds(org.Hs.eg.db, keys = rownames(mat),
                  column = "SYMBOL", keytype = "ENSEMBL",
                  multiVals = "first")
keep <- !is.na(symbols)
mat  <- mat[keep, ]
rownames(mat) <- symbols[keep]

# --- Get MSigDB Hallmark gene sets ---
hallmarks <- msigdbr(species = "Homo sapiens", collection = "H")
gene_sets <- split(hallmarks$gene_symbol, hallmarks$gs_name)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Compute per-sample gene set scores ---
mat_z <- t(scale(t(mat)))  # z-score each gene across samples

# Method 1: mean z-score
scores_meanz <- lapply(names(gene_sets), function(gs_name) {
    genes_in_set <- intersect(gene_sets[[gs_name]], rownames(mat_z))
    if (length(genes_in_set) < 5) return(NULL)
    per_sample <- colMeans(mat_z[genes_in_set, , drop = FALSE], na.rm = TRUE)
    data.frame(sample_id = colnames(mat), gene_set = gs_name,
               score = per_sample, method = "mean_z",
               dex = as.character(dex))
})

# Method 2: rank-based (ssGSEA-like)
scores_rank <- lapply(names(gene_sets), function(gs_name) {
    genes_in_set <- intersect(gene_sets[[gs_name]], rownames(mat))
    if (length(genes_in_set) < 5) return(NULL)
    # Rank genes per sample (higher expression = higher rank)
    ranks <- apply(mat, 2, rank)
    n <- nrow(ranks)
    in_set <- rownames(ranks) %in% genes_in_set
    # Enrichment score: mean rank of set members, normalized
    per_sample <- colMeans(ranks[in_set, , drop = FALSE]) / n
    data.frame(sample_id = colnames(mat), gene_set = gs_name,
               score = per_sample, method = "rank_based",
               dex = as.character(dex))
})

scores_df <- do.call(rbind, c(scores_meanz, scores_rank))

# --- Per-gene contribution for a selected gene set ---
selected_set <- names(gene_sets)[1]
contrib_genes <- intersect(gene_sets[[selected_set]], rownames(mat_z))
gene_contrib <- as.data.frame(t(mat_z[contrib_genes, , drop = FALSE]))
gene_contrib$sample_id <- colnames(mat)
gene_contrib$dex <- as.character(dex)

# --- Group summary ---
summary_df <- aggregate(score ~ gene_set + dex, data = scores_df,
                        FUN = function(x) c(mean = mean(x), sd = sd(x)))
summary_df <- do.call(data.frame, summary_df)
colnames(summary_df) <- c("gene_set", "dex", "mean_score", "sd_score")

# --- Horizontal grouped bar chart (mean ± SE per treatment group, faceted by method) ---
method_labels <- c(mean_z = "Mean Z-Score", rank_based = "Rank-Based (ssGSEA-like)")

p <- ggplot(scores_df, aes(x = gene_set, y = score, fill = dex)) +
    stat_summary(fun = mean, geom = "bar",
                 position = position_dodge(1.0), width = 0.6) +
    stat_summary(fun.data = mean_se, geom = "errorbar",
                 position = position_dodge(1.0), width = 0.2) +
    scale_fill_manual(values = c(trt = "#E64B35", untrt = "#4DBBD5"),
                      labels = c(trt = "Treated", untrt = "Untreated")) +
    facet_wrap(~ method, scales = "free_x", labeller = labeller(method = method_labels)) +
    coord_flip() +
    theme_bw(base_size = 9) +
    theme(legend.position = "top") +
    labs(x = "Gene Set", y = "Mean Gene Set Score",
         fill = "Dexamethasone",
         title = "Hallmark Gene Set Scores by Dexamethasone Treatment")
print(p)

# --- Export ---
output_dir <- file.path(tempdir(), "geneset_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(scores_df, file.path(output_dir, "geneset_scores.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(summary_df, file.path(output_dir, "scoring_summary.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(gene_contrib, file.path(output_dir, "gene_contributions.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 7: Gene Expression Profiler

Rationale: After cell types are annotated, a common question is “How is gene X expressed across these groups?” Biologists routinely look up known marker genes to sanity-check annotations, explore candidate genes for follow-up experiments, or compare expression patterns between related genes. This tool provides per-group expression statistics and figures visualizations for any user-specified set of genes.

Dataset: Muraro Human Pancreas (Human scRNA-seq)

The Muraro et al. (2016) dataset from scRNAseq profiles ~2,300 cells from human donor pancreatic islets using CEL-Seq2. Cell types include alpha, beta, delta, acinar, ductal, and more. The well-characterized marker genes for each pancreatic cell type (INS for beta cells, GCG for alpha, SST for delta, PRSS1 for acinar, KRT19 for ductal) make results immediately interpretable — students can verify their tool produces biologically expected patterns.

Data preparation code

## data-raw/example_sce.R
BiocManager::install("scRNAseq")
library(scRNAseq)
library(SingleCellExperiment)

sce <- MuraroPancreasData()
example_sce <- sce

assays(example_sce) <- list(counts = counts(example_sce))

# --- Normalize using scuttle (standard Bioconductor scRNA-seq pipeline) ---
# logNormCounts() computes library-size-based size factors, normalizes,
# and log2-transforms, storing the result in logcounts(sce)
example_sce <- logNormCounts(example_sce)

# Note: rownames have a "__chrN" suffix (e.g. "INS__chr11") —
rownames(example_sce) <- sub("__chr.*$", "", rownames(example_sce))

# Remove cells with NA cell type (unlabeled)
example_sce <- example_sce[, !is.na(colData(example_sce)$label)]

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Look up a set of genes by name, with validation against the dataset (report which genes were found and which were missing)
Compute per-group expression statistics for each queried gene: mean expression, median expression, and detection rate (fraction of cells in each group with non-zero counts)
Compute a gene specificity score for each gene — a Gini coefficient measuring how concentrated a gene’s expression is in one group versus spread evenly across all groups (0 = uniform, 1 = perfectly specific)
Visualize expression as faceted violin plots (one panel per gene, x = cell type, colored by group) and as a co-expression scatter plot (two selected genes plotted against each other, colored by cell type)
Export per-group expression statistics and gene specificity scores to TSV files

Key parameters: genes (character vector of gene names to query), group_column (metadata column for grouping, default "cell_type"), min_detection (minimum fraction of cells detecting a gene in at least one group to include it in results)

Potential CLI outputs: expression_stats.tsv, gene_specificity.tsv

Potential Shiny inputs: Gene name text input (comma-separated), grouping variable dropdown, plot type toggle (violin / co-expression scatter), gene pair selectors for scatter view, detection threshold slider

Dependencies: SingleCellExperiment, scuttle, scRNAseq (data only)

Raw analysis code

library(SingleCellExperiment)
library(scRNAseq)
library(scuttle)
library(ggplot2)

# --- Load data ---
sce <- MuraroPancreasData()

# Muraro rownames have a "__chrN" suffix — strip it for readable gene names
rownames(sce) <- sub("__chr.*$", "", rownames(sce))

# Remove cells with NA cell type (unlabeled)
sce <- sce[, !is.na(colData(sce)$label)]
cell_type <- colData(sce)$label

# --- Normalize using scuttle (standard Bioconductor scRNA-seq pipeline) ---
# logNormCounts() computes library-size-based size factors, normalizes,
# and log2-transforms, storing the result in logcounts(sce)
sce <- logNormCounts(sce)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

mat_norm <- as.matrix(logcounts(sce))

# --- Define query genes (well-known pancreatic markers) ---
query_genes <- c("INS", "GCG", "SST", "PPY", "PRSS1", "KRT19")

# Validate: check which genes are present in the dataset
found <- query_genes %in% rownames(mat_norm)
cat("Found:", paste(query_genes[found], collapse = ", "), "\n")
if (any(!found)) cat("Missing:", paste(query_genes[!found], collapse = ", "), "\n")
query_genes <- query_genes[found]

# --- Build long-format expression data frame ---
n_cells <- ncol(mat_norm)
expr_vals <- c()
for (g in query_genes) expr_vals <- c(expr_vals, mat_norm[g, ])

expr_long <- data.frame(
    cell_id   = rep(colnames(mat_norm), length(query_genes)),
    gene      = rep(query_genes, each = n_cells),
    expr      = expr_vals,
    cell_type = rep(cell_type, length(query_genes)),
    stringsAsFactors = FALSE
)

# --- Per-group expression statistics ---
groups <- unique(cell_type)
counts_mat <- as.matrix(counts(sce))
stats_list <- vector("list", length(query_genes) * length(groups))
k <- 1
for (gene in query_genes) {
    vals <- mat_norm[gene, ]
    raw_vals <- counts_mat[gene, ]
    for (ct in groups) {
        idx <- which(cell_type == ct)
        stats_list[[k]] <- data.frame(
            gene           = gene,
            cell_type      = ct,
            mean_expr      = mean(vals[idx]),
            median_expr    = median(vals[idx]),
            detection_rate = mean(raw_vals[idx] > 0),
            n_cells        = length(idx),
            stringsAsFactors = FALSE
        )
        k <- k + 1
    }
}
expr_stats <- do.call(rbind, stats_list)
head(expr_stats)

# --- Gene specificity: Gini coefficient of mean expression across groups ---
# Gini = 0 means the gene is equally expressed in all groups
# Gini = 1 means the gene is expressed in only one group
gini <- function(x) {
    x <- sort(x[x >= 0])
    n <- length(x)
    if (n == 0 || sum(x) == 0) return(0)
    indices <- seq_len(n)
    2 * sum(indices * x) / (n * sum(x)) - (n + 1) / n
}

spec_list <- vector("list", length(query_genes))
for (i in seq_along(query_genes)) {
    gene <- query_genes[i]
    group_means <- tapply(mat_norm[gene, ], cell_type, mean)
    top_group <- names(which.max(group_means))
    spec_list[[i]] <- data.frame(
        gene         = gene,
        gini_score   = gini(as.numeric(group_means)),
        top_group    = top_group,
        top_mean     = max(group_means),
        overall_mean = mean(mat_norm[gene, ]),
        stringsAsFactors = FALSE
    )
}
specificity <- do.call(rbind, spec_list)
specificity <- specificity[order(specificity$gini_score, decreasing = TRUE), ]
print(specificity)

# --- Violin plot: one panel per gene, x = cell type ---
p_violin <- ggplot(expr_long, aes(x = cell_type, y = expr, fill = cell_type)) +
    geom_violin(trim = FALSE, alpha = 0.7, scale = "width") +
    geom_boxplot(width = 0.15, fill = "white", outlier.size = 0.3) +
    facet_wrap(~ gene, scales = "free_y", ncol = 3) +
    theme_bw(base_size = 13) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
          legend.position = "none") +
    labs(x = "Cell Type", y = "Normalized Expression (log2)",
         title = "Gene Expression Across Cell Types")
print(p_violin)

# --- Co-expression scatter: INS vs GCG, colored by cell type ---
gene_a <- "INS"
gene_b <- "GCG"
scatter_df <- data.frame(
    expr_a    = mat_norm[gene_a, ],
    expr_b    = mat_norm[gene_b, ],
    cell_type = cell_type
)

p_scatter <- ggplot(scatter_df, aes(x = expr_a, y = expr_b, color = cell_type)) +
    geom_point(size = 1.5, alpha = 1) +
    theme_bw(base_size = 13) +
    labs(x = paste0(gene_a, " (log-normalized)"),
         y = paste0(gene_b, " (log-normalized)"),
         color = "Cell Type",
         title = paste0("Co-Expression: ", gene_a, " vs. ", gene_b),
         subtitle = "Mutually exclusive in alpha vs. beta cells")
print(p_scatter)

# --- Export ---
output_dir <- file.path(tempdir(), "expr_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(expr_stats, file.path(output_dir, "expression_stats.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

write.table(specificity, file.path(output_dir, "gene_specificity.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 8: Gene Correlation Network

Rationale: Co-expressed genes often share biological functions. Building a gene-gene correlation network by computing pairwise correlations, thresholding to an adjacency matrix, and identifying hub genes is a lightweight introduction to network biology. Datasets with many samples provide stable pairwise gene correlations that reveal known co-regulation patterns.

Dataset: GTEx Skeletal Muscle (Human bulk RNA-seq)

The recount3 package provides access to the GTEx (Genotype-Tissue Expression) project. The skeletal muscle tissue has one of the largest sample sizes in GTEx (~800 samples), providing highly stable gene-gene correlations. Known co-expression modules — such as mitochondrial respiration genes, sarcomere components, and extracellular matrix genes — are recoverable, making this a biologically rich dataset for network analysis.

Data preparation code

## data-raw/example_se.R
BiocManager::install("recount3")
library(recount3)
library(SummarizedExperiment)

# Access GTEx skeletal muscle data
human_projects <- available_projects()
proj <- subset(human_projects, file_source == "gtex" &
                 project == "MUSCLE")
rse <- create_rse(proj)

# Transform raw counts
assay(rse, "counts") <- transform_counts(rse)

# Subset to top 2000 genes by variance for tractable network computation
vars <- apply(assay(rse, "counts"), 1, var)
keep_genes <- names(sort(vars, decreasing = TRUE))[1:2000]

# Subset to 200 samples for package size
set.seed(42)
keep_samples <- sample(ncol(rse), min(200, ncol(rse)))

example_se <- rse[keep_genes, keep_samples]
assays(example_se) <- list(counts = assay(example_se, "counts"))
colData(example_se) <- colData(example_se)[, c("gtex.age", "gtex.sex"),
                                            drop = FALSE]

usethis::use_data(example_se, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Select the most variable genes for network construction
Compute a pairwise gene-gene correlation matrix
Threshold the correlation matrix into a binary adjacency matrix and compute per-gene network statistics (degree, betweenness centrality)
Detect gene modules/communities in the network using Louvain or walktrap algorithms, and summarize per-module statistics (size, within-module density)
Summarize the network: total edge count, mean degree, modularity score, hub genes above a connectivity threshold
Visualize the network as a correlation heatmap with gene clustering, or as a force-directed layout colored by module, with top hub genes labeled by gene symbol
Export correlations, network summary, per-node statistics (degree, betweenness, module for all genes), hub gene lists, module assignments, and per-module summaries to TSV files

Key parameters: n_top, cor_method (pearson/spearman), cor_threshold, community_method (louvain/walktrap)

Potential CLI outputs: gene_correlations.tsv, network_summary.tsv, hub_genes.tsv, module_assignments.tsv, node_stats.tsv, module_summary.tsv

Potential Shiny inputs: Correlation threshold slider, method dropdown, n_top slider, hub gene table, heatmap redraws on threshold change

Dependencies: SummarizedExperiment, igraph (Suggests, for force-directed layout), recount3 (data only)

Raw analysis code

library(SummarizedExperiment)
library(recount3)
library(igraph)
library(ComplexHeatmap)
library(ggplot2)

# --- Load GTEx skeletal muscle data ---
human_projects <- available_projects()
proj <- subset(human_projects, file_source == "gtex" & project == "MUSCLE")
rse <- create_rse(proj)
assay(rse, "counts") <- transform_counts(rse)

# Subset to top 500 variable genes (for tractable pairwise correlations)
vars <- apply(assay(rse, "counts"), 1, var)
keep_genes <- names(sort(vars, decreasing = TRUE))[1:500]

# Subset to 200 random samples
set.seed(42) # Needed for network layout reproducibility
keep_samples <- sample(ncol(rse), min(200, ncol(rse)))
se <- rse[keep_genes, keep_samples]
mat <- log2(assay(se, "counts") + 1)

# --- Remap row identifiers to gene symbols ---
# rowData(rse)$gene_name holds HGNC symbols; fall back to Ensembl ID on NA
gene_symbols <- rowData(se)$gene_name
gene_symbols[is.na(gene_symbols) | gene_symbols == ""] <- rownames(se)[
    is.na(gene_symbols) | gene_symbols == ""
]
rownames(mat) <- make.unique(gene_symbols)  # resolve rare duplicates

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Pairwise Pearson correlation ---
cor_mat <- cor(t(mat), method = "pearson")

# --- Threshold to adjacency matrix ---
cor_threshold <- 0.7
adj_mat <- (abs(cor_mat) >= cor_threshold) * 1
diag(adj_mat) <- 0

# --- Network statistics via igraph ---
g <- graph_from_adjacency_matrix(adj_mat, mode = "undirected")
deg <- degree(g)
betw <- betweenness(g)

# --- Community/module detection ---
comm <- cluster_louvain(g)
modules <- membership(comm)
modularity_score <- modularity(comm)
cat("Modularity:", round(modularity_score, 3), "\n")
cat("Modules found:", max(modules), "\n")

module_assignments <- data.frame(
    gene   = names(modules),
    module = paste0("M", modules),
    degree = deg[names(modules)]
)

net_summary <- data.frame(
    metric = c("n_genes", "n_edges", "mean_degree",
               "median_degree", "n_hubs_deg20",
               "n_modules", "modularity"),
    value  = c(vcount(g), ecount(g), round(mean(deg), 2),
               median(deg), sum(deg >= 20),
               max(modules), round(modularity_score, 4))
)
print(net_summary)

# Hub genes: top 20 by degree
hub_df <- data.frame(
    gene        = names(deg),
    degree      = deg,
    betweenness = betw,
    module      = paste0("M", modules[names(deg)])
)
hub_df <- hub_df[order(hub_df$degree, decreasing = TRUE), ]
hub_genes <- head(hub_df, 20)

# --- Correlation heatmap ---
ht <- Heatmap(cor_mat, name = "Pearson r",
              show_row_names = FALSE, show_column_names = FALSE,
              column_title = "Gene-Gene Correlation (top 500 genes)")
draw(ht)

# --- Force-directed network graph (hub-focused subgraph) ---
# Subset to top 80 hub genes to keep layout readable
hub_names <- head(hub_df$gene, 80)
g_sub <- induced_subgraph(g, vids = hub_names)

# Module membership and degree for the subgraph
V(g_sub)$module   <- as.character(modules[V(g_sub)$name])
V(g_sub)$degree   <- degree(g_sub)
V(g_sub)$is_top20 <- V(g_sub)$name %in% hub_genes$gene

# Layout 
layout_fr <- layout_with_fr(g_sub)

# Build a ggplot-friendly data frame from the igraph layout
node_df <- data.frame(
    gene        = V(g_sub)$name,   # gene symbol (or Ensembl fallback)
    x           = layout_fr[, 1],
    y           = layout_fr[, 2],
    degree      = V(g_sub)$degree,
    module      = paste0("M", V(g_sub)$module),
    is_hub      = V(g_sub)$is_top20,
    stringsAsFactors = FALSE
)

edge_list <- as_edgelist(g_sub)
edge_df <- data.frame(
    x    = layout_fr[match(edge_list[, 1], V(g_sub)$name), 1],
    y    = layout_fr[match(edge_list[, 1], V(g_sub)$name), 2],
    xend = layout_fr[match(edge_list[, 2], V(g_sub)$name), 1],
    yend = layout_fr[match(edge_list[, 2], V(g_sub)$name), 2]
)

p_net <- ggplot() +
    geom_segment(data = edge_df,
                 aes(x = x, y = y, xend = xend, yend = yend),
                 color = "grey75", linewidth = 0.25, alpha = 0.5) +
    geom_point(data = node_df,
               aes(x = x, y = y, size = degree, color = module),
               alpha = 0.85) +
    geom_label(data = subset(node_df, is_hub),
               aes(x = x, y = y, label = gene),
               size = 2.5, label.padding = unit(0.15, "lines"),
               label.size = 0.2, alpha = 0.85) +
    scale_size_continuous(range = c(2, 8), name = "Degree") +
    scale_color_brewer(palette = "Set2", name = "Module") +
    theme_void() +
    labs(title = "Hub Gene Co-expression Network",
         subtitle = paste0("Top 80 genes by degree | cor threshold = ",
                           cor_threshold,
                           " | Louvain modules = ", max(modules)))
print(p_net)

# --- Export ---
output_dir <- file.path(tempdir(), "network_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

# Write top correlations (flattened, filtered to |r| >= threshold)
# rownames are gene symbols after the remap above
upper_idx <- which(upper.tri(cor_mat) & abs(cor_mat) >= cor_threshold,
                   arr.ind = TRUE)
cor_edges <- data.frame(
    gene1       = rownames(cor_mat)[upper_idx[, 1]],
    gene2       = colnames(cor_mat)[upper_idx[, 2]],
    correlation = cor_mat[upper_idx]
)
write.table(cor_edges, file.path(output_dir, "gene_correlations.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(net_summary, file.path(output_dir, "network_summary.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(hub_genes, file.path(output_dir, "hub_genes.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(module_assignments, file.path(output_dir, "module_assignments.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

# All-node statistics (degree + betweenness + module for every gene)
node_stats <- data.frame(
    gene        = names(deg),
    degree      = deg,
    betweenness = betw,
    module      = paste0("M", modules[names(deg)])
)
node_stats <- node_stats[order(node_stats$degree, decreasing = TRUE), ]
write.table(node_stats, file.path(output_dir, "node_stats.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

# Per-module summary: size, mean degree, mean betweenness
mod_summary <- do.call(rbind, lapply(unique(node_stats$module), function(m) {
    sub <- node_stats[node_stats$module == m, ]
    data.frame(
        module          = m,
        n_genes         = nrow(sub),
        mean_degree     = round(mean(sub$degree), 2),
        mean_betweenness = round(mean(sub$betweenness), 2),
        top_hub         = sub$gene[which.max(sub$degree)]
    )
}))
mod_summary <- mod_summary[order(mod_summary$n_genes, decreasing = TRUE), ]
write.table(mod_summary, file.path(output_dir, "module_summary.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 9: Expression Heatmap Builder

Rationale: The genes-by-samples heatmap is one of the most iconic figures in genomics. Building one from scratch — row scaling, hierarchical clustering of both axes, sample metadata annotation bars — requires integrating several analysis steps into a single publication-quality visualization.

Dataset: Fission yeast stress response (S. pombe RNA-seq)

The fission Bioconductor package provides RNA-seq count data for 36 Schizosaccharomyces pombe samples from Rhind et al. (2011). Cells were sampled at six time points (0, 15, 30, 60, 120, 180 minutes) following oxidative stress treatment, across two genetic backgrounds: wild-type and an atf21Δ mutant that knocks out a key stress-response transcription factor. The mutant completely fails to mount the normal transcriptional response, producing a dramatic, biologically interpretable separation between wt and mut columns in unsupervised clustering — gene modules corresponding to early-response, sustained-response, and housekeeping genes are immediately visible.

Data preparation code

## data-raw/example_se.R
BiocManager::install("fission")
library(fission)
library(SummarizedExperiment)

data(fission)

# fission is already a SummarizedExperiment with counts and colData
example_se <- fission

usethis::use_data(example_se, overwrite = TRUE)

Data structure: SummarizedExperiment

What users should be able to do:

Select the most variable genes
Row-scale the expression matrix (z-score or min-max normalization)
Perform k-means clustering on rows (genes) directly inside the Heatmap() call and extract the resulting module assignments post-draw
Generate a genes × samples heatmap with k-means row splits, column annotation bars from sample metadata (strain, minute), and column splits by strain
Export the scaled expression matrix, per-gene module assignments (long format), and per-module gene lists with gene counts (wide format) to TSV files

Key parameters: n_top, scale_method (zscore/minmax/none), gene_k, column_split_by (metadata column)

Potential CLI outputs: scaled_expression.tsv, gene_modules.tsv, module_gene_lists.tsv

Potential Shiny inputs: n_top slider, scale method dropdown, gene_k slider, metadata column selector for annotation and column splitting, heatmap updates reactively

Dependencies: SummarizedExperiment, ComplexHeatmap (Suggests), fission (data only)

Raw analysis code

library(fission)
library(SummarizedExperiment)
library(ComplexHeatmap)

# --- Load fission yeast data ---
data(fission)
mat    <- assay(fission, "counts")
strain <- fission$strain   # wt / mut
minute <- fission$minute   # 0, 15, 30, 60, 120, 180

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Log-normalize counts (log2-CPM) ---
# This reduces the dynamic range and makes the scaling more interpretable
lib_sizes  <- colSums(mat)
mat_norm   <- log2(t(t(mat) / lib_sizes * 1e6) + 1)

# --- Select top 500 most variable genes ---
gene_vars  <- apply(mat_norm, 1, var)
top_genes  <- names(sort(gene_vars, decreasing = TRUE))[1:500]
mat_top    <- mat_norm[top_genes, ]

# --- Z-score row scaling ---
mat_scaled <- t(scale(t(mat_top)))  # z-score each gene across samples

# --- Annotated heatmap with k-means row clustering ---
gene_k <- 5

# Explicit color palettes for each annotation track
strain_colors <- c(wt  = "#027a2a", mut = "#a50181")

minute_colors <- c("0"   = "#a1a1a0",
                   "15"  = "#FDCC8A",
                   "30"  = "#FC8D59",
                   "60"  = "#E34A33",
                   "120" = "#B30000",
                   "180" = "#7A0000")

# One color per k-means module (length must equal gene_k)
module_colors <- c("#3C5488", "#00A087", "#ffee01", "#f80404", "#00e1ff")

col_anno <- HeatmapAnnotation(
    strain = strain,
    minute = as.character(minute),
    na_col = "grey90",
    col    = list(strain = strain_colors,
                  minute = minute_colors)
)

# Left annotation block: colors each k-means split and labels it M1–M5
row_anno <- rowAnnotation(
    module = anno_block(
        gp        = gpar(fill = module_colors),
        labels    = paste0("M", seq_len(gene_k)),
        labels_gp = gpar(col = "white", fontsize = 10)
    )
)

ht <- Heatmap(mat_scaled, name = "Z-score",
              top_annotation    = col_anno,
              left_annotation   = row_anno,
              row_km            = gene_k,   # k-means on rows inside the heatmap
              column_split      = strain,
              show_row_names    = FALSE,
              show_column_names = FALSE,
              column_title      = "Fission Yeast Stress Response (top 500 genes)")
ht_drawn <- draw(ht)

# --- Extract k-means module assignments from the drawn heatmap ---
row_clusters <- row_order(ht_drawn)   # named list: one element per k-means group
gene_modules <- integer(nrow(mat_scaled))
for (i in seq_along(row_clusters)) {
    gene_modules[row_clusters[[i]]] <- i
}
names(gene_modules) <- rownames(mat_scaled)

module_df <- data.frame(
    gene   = names(gene_modules),
    module = paste0("M", gene_modules)
)

# --- Export ---
output_dir <- file.path(tempdir(), "heatmap_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

scaled_out       <- as.data.frame(mat_scaled)
scaled_out$gene  <- rownames(mat_scaled)
write.table(scaled_out, file.path(output_dir, "scaled_expression.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

# Per-gene module assignments (long format: one row per gene)
write.table(module_df, file.path(output_dir, "gene_modules.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

# Per-module gene lists (wide format: one row per module, genes comma-separated)
module_gene_list <- do.call(rbind, lapply(sort(unique(module_df$module)), function(m) {
    genes <- module_df$gene[module_df$module == m]
    data.frame(module   = m,
               n_genes  = length(genes),
               genes    = paste(genes, collapse = ","),
               stringsAsFactors = FALSE)
}))
write.table(module_gene_list, file.path(output_dir, "module_gene_lists.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 10: Dimensionality Estimation Tool

Rationale: Determining the meaningful number of principal components after PCA is a common task. Use too many and you’re effectively including noise; use too few and you may miss real variation. The broken-stick model, Kaiser criterion, and elbow heuristic each encode different assumptions. Looking at these metrics can help users determine the number of principal components to retain for downstream analyses like additional dimensionality reduction (e.g., t-SNE, UMAP) and clustering.

Dataset: PBMC 3k (Human scRNA-seq)

The 10X Genomics PBMC 3k dataset (~2,700 peripheral blood mononuclear cells) is the canonical Seurat tutorial dataset. It contains ~10 immune cell types (T cells, B cells, NK cells, monocytes, dendritic cells), and the meaningful number of PCs is unknown.

Data preparation code

## data-raw/example_sce.R
BiocManager::install("TENxPBMCData")
library(TENxPBMCData)
library(SingleCellExperiment)

sce <- TENxPBMCData("pbmc3k")
example_sce <- sce

# Minimal colData — no pre-computed labels; that's fine for dim estimation
colData(example_sce) <- colData(example_sce)[, "Barcode", drop = FALSE]

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Select the most variable genes
Run PCA and extract eigenvalues (variance per component, cumulative variance)
Apply multiple dimensionality estimation methods (broken-stick model, Kaiser criterion, elbow heuristic) and report recommended number of components from each
Visualize a scree plot with overlays for each estimation method (broken-stick curve, Kaiser line, elbow annotation)
Export eigenvalue tables and dimensionality estimates to TSV files

Key parameters: n_top, methods (broken_stick/kaiser/elbow), max_pcs

Potential CLI outputs: eigenvalues.tsv, dimension_estimates.tsv

Potential Shiny inputs: n_top slider, checkboxes to toggle overlay of each method, max_pcs slider, summary table of recommendations

Dependencies: SingleCellExperiment, scuttle, TENxPBMCData (data only)

Raw analysis code

library(SingleCellExperiment)
library(TENxPBMCData)
library(scuttle)
library(ggplot2)
library(matrixStats)

# --- Load data ---
sce <- TENxPBMCData("pbmc3k")
counts_mat <- counts(sce)

# --- Log-normalize using scuttle ---
# Normalize on the full SCE so size factors reflect the complete library
sce <- logNormCounts(sce)

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Feature selection: top 2000 most variable genes ---
gene_vars <- rowVars(counts_mat)
top_idx <- order(gene_vars, decreasing = TRUE)[seq_len(2000)]
mat_norm <- as.matrix(logcounts(sce)[top_idx, ])

# --- PCA ---
pca <- prcomp(t(mat_norm), scale. = TRUE, center = TRUE)
max_pcs <- min(50, ncol(pca$x))
eigenvalues <- pca$sdev[1:max_pcs]^2
var_pct <- eigenvalues / sum(pca$sdev^2) * 100
cum_var <- cumsum(var_pct)

eigenvalue_df <- data.frame(
    PC         = seq_len(max_pcs),
    eigenvalue = eigenvalues,
    var_pct    = var_pct,
    cum_var    = cum_var
)

# --- Broken-stick model ---
bs_vals <- numeric(max_pcs)
for (i in seq_len(max_pcs)) bs_vals[i] <- sum(1 / (i:max_pcs))
bs_vals <- bs_vals / max_pcs * 100  # as percentage of total

# --- Kaiser criterion: eigenvalue > mean eigenvalue ---
kaiser_threshold <- mean(eigenvalues)
kaiser_n <- sum(eigenvalues > kaiser_threshold)

# --- Elbow heuristic: largest drop between consecutive PCs ---
diffs <- -diff(eigenvalues)
elbow_n <- which.max(diffs)

# --- Estimates ---
bs_n <- sum(var_pct > bs_vals)
estimates <- data.frame(
    method   = c("broken_stick", "kaiser", "elbow"),
    n_dims   = c(bs_n, kaiser_n, elbow_n),
    criterion = c("var% > broken-stick%",
                  paste0("eigenvalue > ", round(kaiser_threshold, 2)),
                  paste0("largest drop at PC", elbow_n))
)
print(estimates)

# --- Scree plot with overlays ---
plot_df <- data.frame(
    PC = seq_len(max_pcs),
    observed = var_pct,
    broken_stick = bs_vals
)
p <- ggplot(plot_df, aes(x = PC)) +
    geom_line(aes(y = observed), color = "black", linewidth = 1) +
    geom_point(aes(y = observed), size = 2) +
    geom_line(aes(y = broken_stick), color = "red", lty = 2) +
    geom_hline(yintercept = mean(var_pct), color = "blue", lty = 3) +
    geom_vline(xintercept = elbow_n, color = "darkgreen", lty = 4) +
    annotate("text", x = max_pcs * 0.7, y = max(var_pct) * 0.9,
             label = paste0("Broken-stick: ", bs_n, " PCs\n",
                            "Kaiser: ", kaiser_n, " PCs\n",
                            "Elbow: ", elbow_n, " PCs"),
             hjust = 0, size = 4) +
    theme_bw(base_size = 14) +
    labs(x = "Principal Component", y = "Variance Explained (%)",
         title = "Scree Plot with Dimensionality Estimates")
print(p)

# --- Export ---
output_dir <- file.path(tempdir(), "dim_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(eigenvalue_df, file.path(output_dir, "eigenvalues.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(estimates, file.path(output_dir, "dimension_estimates.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 11: Batch Effect Assessment

Rationale: Batch effects (systematic technical variation from processing date, lane, or operator) can muddy biological signal. Quantifying how much variance each PC attributes to batch (via linear regression R²) is a common diagnostic before deciding whether batch correction or modeling is needed.

This analysis takes several minutes to run.

Dataset: Tasic + Zeisel Mouse Brain (Mouse scRNA-seq, two technologies)

The scRNAseq package provides two independent mouse brain single-cell datasets: Zeisel et al. (2015) (~3,000 cells from mouse cortex and hippocampus sequenced with STRT-seq) and Tasic et al. (2016) (~1,800 cells from mouse primary visual cortex sequenced with SMART-Seq v4). Both datasets profile overlapping cell types — excitatory neurons, inhibitory neurons, astrocytes, oligodendrocytes, and microglia — but the dramatic technology difference (STRT-seq vs SMART-Seq) creates a textbook-visible batch effect: PCA before correction separates cells almost entirely by dataset rather than by cell type. After fastMNN correction the cell-type signal dominates, providing an unambiguous before/after demonstration. The datasets share thousands of common genes and load with single function calls from scRNAseq.

Data preparation code

## data-raw/example_sce.R
library(scRNAseq)
library(SingleCellExperiment)
library(scuttle)
library(batchelor)

sce_z <- ZeiselBrainData()
sce_t <- TasicBrainData()

# QC: remove low-quality cells using mitochondrial gene percentage
# Note: mito gene prefix differs between the two datasets
sce_z <- addPerCellQC(sce_z, subsets = list(Mito = grep("mt-", rownames(sce_z))))
qc_z  <- quickPerCellQC(colData(sce_z), sub.fields = "subsets_Mito_percent")
sce_z <- sce_z[, !qc_z$discard]

sce_t <- addPerCellQC(sce_t, subsets = list(Mito = grep("mt_", rownames(sce_t))))
qc_t  <- quickPerCellQC(colData(sce_t), sub.fields = "subsets_Mito_percent")
sce_t <- sce_t[, !qc_t$discard]

# Subset to the common set of genes
universe <- intersect(rownames(sce_z), rownames(sce_t))
sce_z    <- sce_z[universe, ]
sce_t    <- sce_t[universe, ]

# Cross-batch normalization: downscale to the least-sequenced batch
out   <- multiBatchNorm(sce_z, sce_t)
sce_z <- out[[1]]
sce_t <- out[[2]]

# Save cell-type labels before replacing colData
ct_z <- as.character(sce_z$level1class)
ct_t <- as.character(sce_t$broad_type)

# Unify cell-type vocabulary across the two datasets so matching types share
# the same label and colour in downstream plots.
# Zeisel uses lowercase/underscore conventions; Tasic uses title case.
ct_map_z <- c(
    "pyramidal CA1"        = "Excitatory Neuron",
    "pyramidal SS"         = "Excitatory Neuron",
    "interneurons"         = "Inhibitory Neuron",
    "oligodendrocytes"     = "Oligodendrocyte",
    "astrocytes_ependymal" = "Astrocyte",
    "microglia"            = "Microglia",
    "endothelial-mural"    = "Endothelial"
)
ct_map_t <- c(
    "Glutamatergic Neuron" = "Excitatory Neuron",
    "GABA-ergic Neuron"    = "Inhibitory Neuron",
    "Astrocyte"            = "Astrocyte",
    "Oligodendrocyte"      = "Oligodendrocyte",
    "Microglia"            = "Microglia",
    "Endothelial Cell"     = "Endothelial",
    "OPC"                  = "OPC"
)
ct_z <- unname(ct_map_z[ct_z]); ct_z[is.na(ct_z)] <- "Other"
ct_t <- unname(ct_map_t[ct_t]); ct_t[is.na(ct_t)] <- "Other"

# Drop altExps (ERCC spike-ins, repeats) so cbind works without error
altExps(sce_z) <- NULL
altExps(sce_t) <- NULL

# Harmonize colData to two columns used downstream
colData(sce_z) <- DataFrame(batch = "Zeisel", cell_type = ct_z)
colData(sce_t) <- DataFrame(batch = "Tasic",  cell_type = ct_t)

example_sce <- cbind(sce_z, sce_t)

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Select highly variable genes jointly across both batches using scran (modelGeneVar, combineVar, getTopHVGs)
Quantify batch effects by fitting a linear model of each PC against the batch variable and extracting R² values
Apply batch correction using fastMNN() from the batchelor package (mutual nearest-neighbour alignment in PCA space across donor batches) and extract the corrected low-dimensional embedding
Visualize dual-panel scatter plots (PCA before vs. MNN-corrected embedding after) colored by batch, plus a side-by-side R² bar chart comparing batch variance before and after correction
Export the combined before/after R² table and the corrected embedding to TSV files

Key parameters: batch_column, n_top, correct (logical), bio_column

Potential CLI outputs: batch_variance.tsv, corrected_embedding.tsv

Potential Shiny inputs: Batch column dropdown, biological column dropdown, n_top slider, correction toggle, before/after scatter panels, R² bar chart

Dependencies: SingleCellExperiment, batchelor, scran, scuttle, scRNAseq (data only)

Raw analysis code

library(scRNAseq)
library(SingleCellExperiment)
library(scuttle)
library(batchelor)
library(scran)
library(ggplot2)
library(patchwork)

set.seed(42)  # For reproducibility of PCA and MNN results

# --- Load both datasets ---
sce_z <- ZeiselBrainData()
sce_t <- TasicBrainData()

# --- QC: remove low-quality cells ---
sce_z <- addPerCellQC(sce_z, subsets = list(Mito = grep("mt-", rownames(sce_z))))
qc_z  <- quickPerCellQC(colData(sce_z), sub.fields = "subsets_Mito_percent")
sce_z <- sce_z[, !qc_z$discard]

sce_t <- addPerCellQC(sce_t, subsets = list(Mito = grep("mt_", rownames(sce_t))))
qc_t  <- quickPerCellQC(colData(sce_t), sub.fields = "subsets_Mito_percent")
sce_t <- sce_t[, !qc_t$discard]

# --- Subset to common genes ---
universe <- intersect(rownames(sce_z), rownames(sce_t))
sce_z    <- sce_z[universe, ]
sce_t    <- sce_t[universe, ]

# --- Cross-batch normalization ---
out   <- multiBatchNorm(sce_z, sce_t)
sce_z <- out[[1]]
sce_t <- out[[2]]

# --- Combine into a single SCE and set batch / cell_type ---
ct_z <- as.character(sce_z$level1class)
ct_t <- as.character(sce_t$broad_type)

# Unify cell-type vocabulary across both datasets
ct_map_z <- c(
    "pyramidal CA1"        = "Excitatory Neuron",
    "pyramidal SS"         = "Excitatory Neuron",
    "interneurons"         = "Inhibitory Neuron",
    "oligodendrocytes"     = "Oligodendrocyte",
    "astrocytes_ependymal" = "Astrocyte",
    "microglia"            = "Microglia",
    "endothelial-mural"    = "Endothelial"
)
ct_map_t <- c(
    "Glutamatergic Neuron" = "Excitatory Neuron",
    "GABA-ergic Neuron"    = "Inhibitory Neuron",
    "Astrocyte"            = "Astrocyte",
    "Oligodendrocyte"      = "Oligodendrocyte",
    "Microglia"            = "Microglia",
    "Endothelial Cell"     = "Endothelial",
    "OPC"                  = "OPC"
)
ct_z <- unname(ct_map_z[ct_z]); ct_z[is.na(ct_z)] <- "Other"
ct_t <- unname(ct_map_t[ct_t]); ct_t[is.na(ct_t)] <- "Other"

altExps(sce_z) <- NULL
altExps(sce_t) <- NULL

colData(sce_z) <- DataFrame(batch = "Zeisel", cell_type = ct_z)
colData(sce_t) <- DataFrame(batch = "Tasic",  cell_type = ct_t)

sce_all   <- cbind(sce_z, sce_t)
batch     <- sce_all$batch
cell_type <- sce_all$cell_type

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- HVG selection: model gene variance per batch, combine, pick top 5000 ---
dec_z        <- modelGeneVar(sce_z)
dec_t        <- modelGeneVar(sce_t)
combined_dec <- combineVar(dec_z, dec_t)
chosen_hvgs  <- getTopHVGs(combined_dec, n = 5000)
cat("HVGs selected:", length(chosen_hvgs), "\n")

# --- PCA (before correction) on log-normalized HVGs ---
mat_all    <- as.matrix(logcounts(sce_all)[chosen_hvgs, ])
pca_before <- prcomp(t(mat_all), scale. = TRUE, center = TRUE)
n_pcs <- 10

# --- Quantify batch effect: R² of each PC ~ batch ---
r2_df <- data.frame(PC = paste0("PC", 1:n_pcs), R2 = NA_real_)
for (i in 1:n_pcs) {
    fit        <- lm(pca_before$x[, i] ~ batch)
    r2_df$R2[i] <- summary(fit)$r.squared
}
print(r2_df)

# --- Batch correction ---
out_mnn        <- fastMNN(sce_all, batch = sce_all$batch, subset.row = chosen_hvgs)
corrected_embed <- reducedDim(out_mnn, "corrected")  # cells × 50 matrix

# --- Quantify residual batch effect in the corrected embedding ---
n_dims      <- min(n_pcs, ncol(corrected_embed))
r2_after_df <- data.frame(dim = paste0("dim", seq_len(n_dims)), R2 = NA_real_)
for (i in seq_len(n_dims)) {
    fit               <- lm(corrected_embed[, i] ~ batch)
    r2_after_df$R2[i] <- summary(fit)$r.squared
}

# Combine before/after R²
r2_compare <- rbind(
    data.frame(dim = r2_df$PC,        R2 = r2_df$R2,       stage = "before"),
    data.frame(dim = r2_after_df$dim, R2 = r2_after_df$R2, stage = "after")
)
# Order dimensions numerically (PC1–PC10 / dim1–dim10) and set stage order
dim_order <- paste0(c(rep("PC", n_pcs), rep("dim", n_dims)),
                    c(seq_len(n_pcs), seq_len(n_dims)))
dim_order <- unique(dim_order[order(as.integer(sub("\\D+", "", dim_order)))])
r2_compare$dim   <- factor(r2_compare$dim,   levels = dim_order)
r2_compare$stage <- factor(r2_compare$stage, levels = c("before", "after"))
print(r2_compare)

# --- Before/after scatter plots ---
df_before <- data.frame(x = pca_before$x[, 1], y = pca_before$x[, 2],
                         batch = batch, cell_type = cell_type)
df_after  <- data.frame(x = corrected_embed[, 1], y = corrected_embed[, 2],
                         batch = batch, cell_type = cell_type)

# Row 1: colored by batch — reveals the technical confound
p1 <- ggplot(df_before, aes(x = x, y = y, color = batch)) +
    geom_point(size = 0.8, alpha = 0.5) +
    theme_bw(base_size = 12) +
    labs(x = "PC1", y = "PC2", title = "Before — colored by batch",
         subtitle = "Cells separate by technology")
p2 <- ggplot(df_after, aes(x = x, y = y, color = batch)) +
    geom_point(size = 0.8, alpha = 0.5) +
    theme_bw(base_size = 12) +
    labs(x = "MNN dim 1", y = "MNN dim 2", title = "After — colored by batch",
         subtitle = "Batch signal is reduced")

# Row 2: colored by cell type — shows the biological signal that emerges
p3 <- ggplot(df_before, aes(x = x, y = y, color = cell_type)) +
    geom_point(size = 0.8, alpha = 0.5) +
    theme_bw(base_size = 12) +
    labs(x = "PC1", y = "PC2", title = "Before — colored by cell type",
         subtitle = "Cell types mixed across batch clusters")
p4 <- ggplot(df_after, aes(x = x, y = y, color = cell_type)) +
    geom_point(size = 0.8, alpha = 0.5) +
    theme_bw(base_size = 12) +
    labs(x = "MNN dim 1", y = "MNN dim 2", title = "After — colored by cell type",
         subtitle = "Cell-type signal dominates")

# Row 3: R² bar chart
r2_compare$dim_num <- as.integer(sub("\\D+", "", as.character(r2_compare$dim)))

p_r2 <- ggplot(r2_compare, aes(x = dim_num, y = R2, fill = stage)) +
    geom_col(position = position_dodge(0.8), width = 0.7) +
    scale_x_continuous(breaks = seq_len(n_pcs), labels = seq_len(n_pcs)) +
    scale_fill_manual(values = c(before = "#fcc100", after = "#00b5dd")) +
    theme_bw(base_size = 12) +
    labs(x = "Dimension", y = expression(R^2 ~ "(batch)"),
         fill = NULL,
         title = "Batch Variance Before vs After fastMNN")

(p1 + p2) / (p3 + p4) / p_r2

# --- Export ---
output_dir <- file.path(tempdir(), "batch_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(r2_compare, file.path(output_dir, "batch_variance.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

corrected_out           <- as.data.frame(corrected_embed)
corrected_out$cell      <- colnames(sce_all)
corrected_out$batch     <- batch
corrected_out$cell_type <- cell_type
write.table(corrected_out, file.path(output_dir, "corrected_embedding.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Project 12: Marker Gene Identification

Rationale: After clustering samples in scRNA-seq data, identifying the genes that best distinguish each cluster from the others. One-vs-rest testing produces ranked marker lists that help you assign a cell type or state to each cluster (e.g. T cells, monocytes, B cells, etc). This task often requires manual review, expert knowledge, and some literature searching to validate the markers against known biology. The dot plot (size = detection rate, color = mean expression) is a common way to summarize marker expression patterns across clusters or cell types.

This code takes a few minutes to run, single cell analysis is computationally expensive.

Dataset: Baron Mouse Pancreas (Mouse scRNA-seq)

The Baron et al. (2016) mouse pancreatic islet dataset contains ~1,886 cells across 9 clearly defined cell types: alpha, beta, delta, gamma (PP), acinar, ductal, stellate (activated and quiescent), immune, and Schwann cells. Each cell type has well-known marker genes (e.g., Ins1/Ins2 for beta cells, Gcg for alpha cells), enabling a sanity check for what’s returned.

Note: this is the mouse version — Project 4 uses the human dataset from the same study, so code cannot be directly shared.

Data preparation code

## data-raw/example_sce.R
BiocManager::install("scRNAseq")
library(scRNAseq)
library(SingleCellExperiment)

sce <- BaronPancreasData("mouse")
example_sce <- sce

assays(example_sce) <- list(counts = counts(example_sce))
colData(example_sce) <- colData(example_sce)[, "label", drop = FALSE]
colnames(colData(example_sce)) <- "cell_type"

usethis::use_data(example_sce, overwrite = TRUE)

Data structure: SingleCellExperiment

What users should be able to do:

Filter out low-expression genes
Perform one-vs-rest statistical testing per cell type: compute log2 fold changes, Wilcoxon p-values, BH-adjusted p-values, and rank genes per group
Summarize the top N markers per group and compute overlap counts between groups
Visualize markers as a dot plot: genes on y-axis, groups on x-axis, dot size encodes detection rate, dot color encodes mean expression
Export marker gene lists and summaries to TSV files

Key parameters: group_column, n_markers, fc_threshold, detection_threshold

Potential CLI outputs: marker_genes.tsv, marker_summary.tsv

Potential Shiny inputs: Grouping variable dropdown, n_markers slider, FC threshold slider, detection threshold slider, dot plot + filterable gene table

Dependencies: SingleCellExperiment, scuttle, scRNAseq (data only)

Raw analysis code

library(SingleCellExperiment)
library(scRNAseq)
library(scuttle)
library(ggplot2)

# --- Load data ---
sce <- BaronPancreasData("mouse")
counts_mat <- counts(sce)
cell_type <- colData(sce)$label

# ============================================================================
# Everything above is upstream data preparation — belongs in data-raw/.
# Everything below is core analysis — should be functionized in R/.
# ============================================================================

# --- Filter genes detected in < 5% of cells ---
detect_rate_all <- rowSums(counts_mat > 0) / ncol(counts_mat)
keep_genes <- detect_rate_all >= 0.05
sce <- sce[keep_genes, ]
counts_mat <- counts(sce)
cat("Genes after filtering:", nrow(sce), "\n")

# --- Log-normalize using scuttle ---
sce <- logNormCounts(sce)
mat_norm <- as.matrix(logcounts(sce))

# --- One-vs-rest Wilcoxon testing per cell type ---
cell_types <- unique(cell_type)
n_markers <- 5

marker_list <- lapply(cell_types, function(ct) {
    is_target <- cell_type == ct
    # Log2 fold change: mean(target) - mean(rest)
    mean_target <- rowMeans(mat_norm[, is_target, drop = FALSE])
    mean_rest   <- rowMeans(mat_norm[, !is_target, drop = FALSE])
    log2fc <- mean_target - mean_rest

    # Detection rate in target vs rest
    detect_target <- rowSums(counts_mat[, is_target, drop = FALSE] > 0) / sum(is_target)
    detect_rest   <- rowSums(counts_mat[, !is_target, drop = FALSE] > 0) / sum(!is_target)

    # Wilcoxon rank-sum test (subset of genes for speed)
    pvals <- sapply(seq_len(nrow(mat_norm)), function(i) {
        wilcox.test(mat_norm[i, is_target], mat_norm[i, !is_target],
                    alternative = "greater")$p.value
    })

    data.frame(
        gene         = rownames(mat_norm),
        cell_type    = ct,
        log2fc       = log2fc,
        detect_target = detect_target,
        detect_rest  = detect_rest,
        pvalue       = pvals,
        padj         = p.adjust(pvals, method = "BH"),
        stringsAsFactors = FALSE
    )
})
all_markers <- do.call(rbind, marker_list)

# --- Select top N markers per cell type ---
top_markers <- do.call(rbind, lapply(cell_types, function(ct) {
    sub <- all_markers[all_markers$cell_type == ct, ]
    sub <- sub[order(sub$log2fc, decreasing = TRUE), ]
    head(sub, n_markers)
}))

# --- Summary: overlap counts between groups ---
marker_summary <- data.frame(
    cell_type   = cell_types,
    n_sig       = sapply(cell_types, function(ct) {
        sum(all_markers$cell_type == ct & all_markers$padj < 0.05 &
            all_markers$log2fc > 1)
    }),
    top_marker  = sapply(cell_types, function(ct) {
        sub <- all_markers[all_markers$cell_type == ct, ]
        sub$gene[which.max(sub$log2fc)]
    })
)

# --- Dot plot ---
# gene ordering: use the unique marker genes in their natural ranking order
gene_levels <- unique(as.character(top_markers$gene))
ct_levels   <- as.character(cell_types)

# Pull every (gene x cell_type) row from all_markers for the selected genes
# detect_target here is the detection rate of that gene *within* that cell type
plot_df <- all_markers[all_markers$gene %in% gene_levels, ]

# Add mean expression per (gene, cell_type) combination
plot_df$mean_expr <- mapply(function(g, ct) {
    mean(as.numeric(mat_norm[g, cell_type == ct]))
}, plot_df$gene, plot_df$cell_type)

plot_df$gene      <- factor(plot_df$gene,      levels = rev(gene_levels))
plot_df$cell_type <- factor(plot_df$cell_type, levels = ct_levels)

p <- ggplot(plot_df, aes(x = cell_type, y = gene,
                          size = detect_target, color = mean_expr)) +
    geom_point() +
    scale_size_continuous(range = c(1, 6), name = "Detection Rate") +
    scale_color_viridis_c(name = "Mean Expression") +
    theme_bw(base_size = 12) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Top Marker Genes per Cell Type")
print(p)

# --- Export ---
output_dir <- file.path(tempdir(), "marker_output")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

write.table(all_markers, file.path(output_dir, "marker_genes.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)
write.table(marker_summary, file.path(output_dir, "marker_summary.tsv"),
            sep = "\t", row.names = FALSE, quote = FALSE)

list.files(output_dir)

Quick Comparison

#	Project	Dataset	Organism	Structure	Primary Plot	Extra Deps
1	UMAP Embedding	Zeisel brain	Mouse	SCE	2D scatter	uwot
2	Sample Similarity	Macrophage stimulation	Human	SE	heatmap + dendrogram	—
3	Differential Expression	Tissue Tregs (Treg vs Tconv)	Mouse	SE	volcano / MA	DESeq2
4	K-means Clustering	Baron human pancreas	Human	SCE	scatter + elbow	cluster
5	Cell QC Dashboard	Lun spike-in	Mouse	SCE	multi-panel	—
6	Gene Set Scoring	airway + MSigDB	Human	SE	boxplot / violin	msigdbr, airway
7	Normalization	Muraro pancreas	Human	SCE	density / boxplot	—
8	Gene Corr Network	GTEx skeletal muscle	Human	SE	network / heatmap	igraph (opt.), recount3
9	Expression Heatmap	Fission yeast stress response	S. pombe	SE	genes × samples heatmap	ComplexHeatmap, fission
10	Dim Estimation	PBMC 3k	Human	SCE	scree + overlays	—
11	Batch Effect	Tasic + Zeisel brain	Mouse	SCE	before/after scatter + R²	batchelor
12	Marker Genes	Baron mouse pancreas	Mouse	SCE	dot plot	—

12 unique datasets — all human or mouse, all RNA-seq or scRNA-seq
8 single-cell (SCE) + 4 bulk (SE) projects
No two projects share both the same computation and visualization

Custom Projects

You are welcome to propose your own project of similar scope. To be approved, your proposal must include:

One paragraph describing the biological or analytical rationale
A dataset with a data-raw/ script showing how to prepare the example data
A capability list describing what users should be able to do with the package (4–6 bullet points)
At least one non-trivial visualization (not just a base R plot() call)
Potential CLI outputs — what TSV/plot files will be produced?
Potential Shiny inputs — what will the user be able to adjust?

Submit your proposal as a GitHub Issue or by email before starting implementation. The instructor will respond within 48 hours with approval or suggested modifications.

Ground rules for custom projects:

Must operate on SummarizedExperiment or SingleCellExperiment input
Must use a real dataset; RNA-seq or scRNA-seq preferred
Must produce at least one ggplot-based visualization
Must not duplicate an existing project from the list above
Scope should be comparable

Jared Andrews

2026-04-21

Overview

How to Read Each Project Description

Project 0: PCA Explorer (Reference Implementation)

Dataset: Simulated RNA-seq

Project 1: UMAP Embedding Explorer

Dataset: Zeisel Brain (Mouse scRNA-seq)

Project 2: Sample Similarity & Clustering

Dataset: Macrophage Stimulation (Human bulk RNA-seq)

Project 3: Differential Expression with DESeq2

Dataset: Tissue-Resident Regulatory T Cells (Mouse bulk RNA-seq)

Project 4: K-means Cell Clustering

Dataset: Baron Human Pancreas (scRNA-seq)

Project 5: Cell QC Dashboard

Dataset: Lun Spike-In (Mouse scRNA-seq)

Project 6: Gene Set Scoring

Dataset: Airway (Human bulk RNA-seq) + MSigDB Hallmark Gene Sets

Project 7: Gene Expression Profiler

Dataset: Muraro Human Pancreas (Human scRNA-seq)

Project 8: Gene Correlation Network

Dataset: GTEx Skeletal Muscle (Human bulk RNA-seq)

Project 9: Expression Heatmap Builder

Dataset: Fission yeast stress response (S. pombe RNA-seq)

Project 10: Dimensionality Estimation Tool

Dataset: PBMC 3k (Human scRNA-seq)

Project 11: Batch Effect Assessment

Dataset: Tasic + Zeisel Mouse Brain (Mouse scRNA-seq, two technologies)

Project 12: Marker Gene Identification

Dataset: Baron Mouse Pancreas (Mouse scRNA-seq)

Quick Comparison

Custom Projects