Long-read RNA sequencing technologies offer unparalleled insights into transcriptomes by enabling full-length sequencing of RNA molecules, uncovering novel isoforms and alternative splicing events. While long-read sequencing platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have historically been associated with higher error rates, recent advancements in both platforms have significantly enhanced read accuracy, broadening their applicability for transcriptomic studies.
With the rapid evolution of sequencing protocols and bioinformatics tools, the trade-offs between sequencing throughput, read length, accuracy, and cost present significant challenges in selecting the optimal approach. Systematic benchmarking studies that compare these options are crucial to inform future research directions. However, many existing benchmarking datasets with matched data across multiple platforms have limitations, including: 1) a lack of realistic biological replicates, which may restrict the generalisability of differential analysis results to real-world scenarios, and 2) the use of earlier sequencing kits, which may not reflect the latest advancements in sequencing technology, limiting their relevance for future studies that typically use newer sequencing protocols.
To address these gaps, we present LongBench2, a comprehensive benchmarking dataset designed to fill these critical gaps. Derived from eight lung cancer cell lines with synthetic RNA spike-ins, LongBench2 includes bulk, single-cell, and single-nucleus RNA-seq data from three state-of-the-art long-read sequencing platforms — ONT PCR-cDNA, ONT direct RNA sequencing, PacBio Kinnex—alongside Illumina short-read data for robust cross-platform comparisons. The LongBench2 dataset is a valuable resource for benchmarking and improving sequencing protocols and bioinformatics tools, With the LongBench2 dataset we present a systematic evaluation of transcript capture, quantification, and differential expression analyses, examining the strengths and limitations of each sequencing platform in various biological contexts, enabling researchers to make more informed decisions on platform and method selection.