TandemTwister is a user-friendly tool for genotyping tandem repeats that can handle long-read data from various technologies — like CLR, CCS, and ONT — as well as Somatic and aligned genomes as input.
Versatile Compatibility: TandemTwister supports long-read sequencing data from CLR, CCS, and ONT technologies, ensuring adaptability to diverse genomic datasets.
Phasing Capabilities: The tool incorporates phasing algorithm, by leveraging distinctive features within TR regions.
Noise Correction for Short Motifs: TandemTwister includes specialized correction mechanisms for short motifs (≤3) in CLR and ONT reads, ensuring robust and accurate genotyping results in the presence of noisy data.
Speed and Scalability: Optimized for efficiency, TandemTwister supports multi-processing and can complete genotyping analyses for approximately 1.2 Mio regions in under 20 minutes using 32 threads.
TandemTwister comes with a companion visualization tool, ProleTRact, which enables interactive exploration and visualization of genotyped tandem repeats. After running TandemTwister, you can use PRoleTRact to visulize the regions you’re analyzing.
For comprehensive tandem repeat analyses, the following resources provide curated TR catalogs and annotations:
TRcompDB v1.0: A global reference of tandem repeat variation with motif sets and TR annotations from 360 long-read assemblies. Available for GRCh38 and T2T-CHM13 reference genomes in VCF and TSV formats.
STRchive: A catalog of 75 disease-associated tandem repeat loci with detailed annotations including motif sequences, genomic positions, and disease associations. Compatible with multiple TR genotyping tools and available for hg19, hg38, and T2T-CHM13.
Project Adotto Tandem-Repeat Regions: A catalog of tandem-repeat regions in human genomes, provided as BED format files with annotations for tandem repeat analysis.
TandemTwister can be installed via conda/bioconda (recommended) or built from source.
The easiest way to install TandemTwister is through conda/bioconda:
mamba install bioconda::tandemtwister
Follow these steps to build TandemTwister from source.
git clone https://github.com/Lionward/TandemTwist.git
cd TandemTwister
mamba create -n TandemTwist
mamba activate TandemTwist
Please ensure that you have these tools installed and available in your PATH before proceeding with the build process.
Tip: All dependencies can be installed using mamba for speed, but regular conda also works.
Note:
Before building TandemTwister, please ensure all required tools are installed and available in yourPATH.mamba install -c conda-forge libdeflate=1.21 mamba install bioconda::htslib=1.22.1 mamba install mlpack=4.5.0 mamba install make=4.4.1 mamba install gxx=14.3.0 mamba install cereal=1.3.2 mamba install spdlog=1.15.3
To install TandemTwister in /usr/local/bin:
make install
If you only want to build the executable in the current directory, just use:
make
Run tandemtwister from an activated environment using the command-first interface:
./tandemtwister [global options] <command> [command options]
Commands
germline – Genotype germline tandem repeats from long-read alignments.somatic – Profile somatic tandem-repeat expansions from long-read alignments.assembly – Genotype tandem repeats from aligned genome or assembly input.germline / somatic / assembly: Selects the analysis workflow.⚠️ Warning: Somatic mode is still experimental and has not been fully tested. Use with caution.
-b, --bam Path to the BAM file of the aligned reads to the reference genome.-r, --ref Path to the input reference file (e.g., .fa/.fna).-m, --motif_file Path to the file containing reference coordinates and motif sequence (BED/TSV/CSV).-o, --output_file Output file containing region, motif, hap1 and hap2 copy numbers.-s, --sex Sample sex (0 = female, 1 = male).-sn, --sample Name of the sample.-rt, --reads_type Type of reads (Default: CCS).-bt, --bam_type Type of BAM file (e.g., reads or assembly).-t, --threads: Number of threads to use (Default: 1)-v, --verbose Verbosity level (0 = error, 1 = critical, 2 = info, 3 = debug).-h, --help Display global help (or command-specific help when issued after a command).--version Print version information and exit.-mml, --min_match_ratio_l: Minimum match ratio for long motifs (Default: 0.5)-pad, --padding: Padding around the STR region to extract reads (Default: 0)-kcr, --keepCutReads: Keep cut reads (Default: false)-minR, --minReadsInRegion: Minimum number of reads that should span the region (Default: 2)-btg, --bamIsTagged: Reads in BAM are phased (Default: false)-qs, --quality_score: Minimum quality score for a read to be considered (Default: 10, Max: 60)Read-based Correction Parameters
-cor, --correct: Perform genotype calling correction based on the interval-based consensus from sequencing reads (CCS Default: false, CLR/ONT Default: true)-crs, --consensus_ratio_str: Minimum fraction of reads in a cluster required for a consensus call in STR regions (Default: 0.3)-crv, --consensus_ratio_vntr: Minimum fraction of reads in a cluster required for a consensus call in VNTR regions (Default: 0.3)-roz, --removeOutliersZscore: Remove outlier reads for phasing based on Z-score (Default: false)Reference-based Correction Parameters
-rtr, --refineTrRegions: Refine the coordinates of tandem repeat regions using the reference genome (Default: false)-tanCon, --tandem_run_threshold: Maximum number of bases for merging tandem-repeat runs during reference-based refinement (Default: 2 × motif size)-seps, --start_eps_str: Start radian for clustering in STR regions (Default: 0.2)-sepv, --start_eps_vntr: Start radian for clustering in VNTR regions (Default: 0.2)-minPF, --minPts_frac: Min fraction of reads that should be in one cluster (Default: 0.12)-nls, --noise_limit_str: Noise limit for clustering in STR regions (Default: 0.2)-nlv, --noise_limit_vntr: Noise limit for clustering in VNTR regions (Default: 0.35)-ci, --cluster_iter: Number of iterations for clustering (Default: 20)tandemtwister --help: Show the global overview with available commands and options.tandemtwister <command> --help: Show command-specific options (e.g., tandemtwister germline --help).
tandemtwister germline \
-b sample.bam \
-m motifs.bed \
-r reference.fna \
-o output.txt \
-s 1 \
-sn SampleName \
-rt CCS \
-t 4
| Field | Type | Description |
|---|---|---|
REF_SPAN |
INFO | Span intervals of the tandem repeat (TR) on the reference sequence. |
MOTIF_IDs_REF |
INFO | Motif IDs for the reference sequence, representing identifiers for each unique motif in the reference. |
CN_ref |
INFO | Number of repeat units (copy number) for the tandem repeat region in the reference sequence. |
CN |
FORMAT | Copy number of the TR for each called allele in the sample. |
MI |
FORMAT | Motif IDs for the haplotype(s) of each allele in the sample. |
SP |
FORMAT | Span of the TR for each allele. |
DP |
FORMAT | Number of reads supporting each allele. |
GT |
FORMAT | Genotype; indicates which alleles are present for this sample (e.g., 0/1). |
Below is an example of the output in VCF format:
chr1 60637 chr1:60636-60665 ATTGTAAAGTCAAACAATTATAAGTCAAAC ATTGTAAAGTCAAACAATTATAAGTCAAAC,ATTGTAAAGTCAAACAATTATAAGTCAAAC 1 PASS TR_type=VNTR;MOTIFS=AATTATAAGTCAAA,AATTATAAGTCAAAC,AATTGTAAGTCAAAC,ATTGTAAAGTCAAAC,TTGTAAAGTCAAAC;UNIT_LENGTH_AVG=14;MOTIF_IDs_REF=3_1;REF_SPAN=(1-15)_(16-30);CN_ref=2 GT:CN:MI:DP:SP 0/0:2,2:3_1:31:(1-15)_(16-30)
The test data is available in the test_data folder. The test data is in the form of a bam file, a reference file, and a motif file. The motif file is in the form of a bed file. The test data can be used to test the TandemTwister tool.
Run the following command to test the tool:
make test
We welcome contributions from the community! If you find any issues or have suggestions for improvement, please open an issue or create a pull request.
If you use TandemTwister in your research or analysis, please cite our work as follows:
TandemTwister: A fast accurate tool for genotyping tandem repeats
Al Raei L, Ghareghani M GitHub Repository
(Manuscript in preparation)
Once our manuscript is published, we will update this section with the official citation.
Thank you for acknowledging TandemTwister in your work!
Implementation of a Lookup Table for ONT Input Acceleration: Integrate a lookup table for ONT input to enhance processing speed, optimizing the tool’s performance.
Inclusion of Methylation Information: Integrate methylation information into the analysis, providing users with additional insights into the epigenetic characteristics of the tandem repeats.
Add Trio-analysis mode for better genotyping results in Trio samples.
We would like to express our appreciation to the IT team at Max Planck Institute for Molecular Genetics for their support with technical aspects related to this project.