tandemtwister


grafik


TandemTwister is a fast tool for tandem repeat genotyping!

Conda Version

· Report Bug · Request Feature

Table of Contents

Introducing TandemTwister

TandemTwister is a user-friendly tool for genotyping tandem repeats that can handle long-read data from various technologies — like CLR, CCS, and ONT — as well as Somatic and aligned genomes as input.

Key features

  1. Versatile Compatibility: TandemTwister supports long-read sequencing data from CLR, CCS, and ONT technologies, ensuring adaptability to diverse genomic datasets.

  2. Phasing Capabilities: The tool incorporates phasing algorithm, by leveraging distinctive features within TR regions.

  3. Noise Correction for Short Motifs: TandemTwister includes specialized correction mechanisms for short motifs (≤3) in CLR and ONT reads, ensuring robust and accurate genotyping results in the presence of noisy data.

  4. Speed and Scalability: Optimized for efficiency, TandemTwister supports multi-processing and can complete genotyping analyses for approximately 1.2 Mio regions in under 20 minutes using 32 threads.

    Visualization Tool: ProleTRact

TandemTwister comes with a companion visualization tool, ProleTRact, which enables interactive exploration and visualization of genotyped tandem repeats. After running TandemTwister, you can use PRoleTRact to visulize the regions you’re analyzing.

Tandem Repeat Catalogs

For comprehensive tandem repeat analyses, the following resources provide curated TR catalogs and annotations:

  1. TRcompDB v1.0: A global reference of tandem repeat variation with motif sets and TR annotations from 360 long-read assemblies. Available for GRCh38 and T2T-CHM13 reference genomes in VCF and TSV formats.

  2. STRchive: A catalog of 75 disease-associated tandem repeat loci with detailed annotations including motif sequences, genomic positions, and disease associations. Compatible with multiple TR genotyping tools and available for hg19, hg38, and T2T-CHM13.

  3. Project Adotto Tandem-Repeat Regions: A catalog of tandem-repeat regions in human genomes, provided as BED format files with annotations for tandem repeat analysis.

Installation

TandemTwister can be installed via conda/bioconda (recommended) or built from source.

The easiest way to install TandemTwister is through conda/bioconda:

mamba install bioconda::tandemtwister

Option 2: Build from Source

Follow these steps to build TandemTwister from source.

1. Clone the repository

git clone https://github.com/Lionward/TandemTwist.git
cd TandemTwister
mamba create -n TandemTwist
mamba activate TandemTwist

3. Install dependencies

Please ensure that you have these tools installed and available in your PATH before proceeding with the build process.

Tip: All dependencies can be installed using mamba for speed, but regular conda also works.

Note:
Before building TandemTwister, please ensure all required tools are installed and available in yourPATH.

mamba install -c conda-forge libdeflate=1.21
mamba install bioconda::htslib=1.22.1
mamba install mlpack=4.5.0
mamba install make=4.4.1
mamba install gxx=14.3.0
mamba install cereal=1.3.2
mamba install spdlog=1.15.3

4. Build and install TandemTwister

To install TandemTwister in /usr/local/bin:

make install

If you only want to build the executable in the current directory, just use:

make

Usage

Run tandemtwister from an activated environment using the command-first interface:

./tandemtwister [global options] <command> [command options]

Commands

Required Input

  1. Command
    • germline / somatic / assembly: Selects the analysis workflow.
    • ⚠️ Warning: Somatic mode is still experimental and has not been fully tested. Use with caution.

  2. Arguments
    • -b, --bam   Path to the BAM file of the aligned reads to the reference genome.
    • -r, --ref   Path to the input reference file (e.g., .fa/.fna).
    • -m, --motif_file   Path to the file containing reference coordinates and motif sequence (BED/TSV/CSV).
    • -o, --output_file   Output file containing region, motif, hap1 and hap2 copy numbers.
    • -s, --sex   Sample sex (0 = female, 1 = male).
    • -sn, --sample   Name of the sample.
    • -rt, --reads_type   Type of reads (Default: CCS).
    • -bt, --bam_type   Type of BAM file (e.g., reads or assembly).
    • -t, --threads: Number of threads to use (Default: 1)

Global Options

Dynamic Programming Alignment Parameters

Germline & Somatic Analysis Options

Read Extraction Parameters

Correction Parameters

Read-based Correction Parameters

Reference-based Correction Parameters

Clustering Parameters

Help

Example


tandemtwister germline \
  -b sample.bam \
  -m motifs.bed \
  -r reference.fna \
  -o output.txt \
  -s 1 \
  -sn SampleName \
  -rt CCS \
  -t 4

VCF INFO and FORMAT Field Descriptions

Field Type Description
REF_SPAN INFO Span intervals of the tandem repeat (TR) on the reference sequence.
MOTIF_IDs_REF INFO Motif IDs for the reference sequence, representing identifiers for each unique motif in the reference.
CN_ref INFO Number of repeat units (copy number) for the tandem repeat region in the reference sequence.
CN FORMAT Copy number of the TR for each called allele in the sample.
MI FORMAT Motif IDs for the haplotype(s) of each allele in the sample.
SP FORMAT Span of the TR for each allele.
DP FORMAT Number of reads supporting each allele.
GT FORMAT Genotype; indicates which alleles are present for this sample (e.g., 0/1).

Example Output

Below is an example of the output in VCF format:


chr1    60637   chr1:60636-60665        ATTGTAAAGTCAAACAATTATAAGTCAAAC  ATTGTAAAGTCAAACAATTATAAGTCAAAC,ATTGTAAAGTCAAACAATTATAAGTCAAAC   1       PASS    TR_type=VNTR;MOTIFS=AATTATAAGTCAAA,AATTATAAGTCAAAC,AATTGTAAGTCAAAC,ATTGTAAAGTCAAAC,TTGTAAAGTCAAAC;UNIT_LENGTH_AVG=14;MOTIF_IDs_REF=3_1;REF_SPAN=(1-15)_(16-30);CN_ref=2 GT:CN:MI:DP:SP  0/0:2,2:3_1:31:(1-15)_(16-30)

Test data

The test data is available in the test_data folder. The test data is in the form of a bam file, a reference file, and a motif file. The motif file is in the form of a bed file. The test data can be used to test the TandemTwister tool.

Run the following command to test the tool:

make test

Contributions

We welcome contributions from the community! If you find any issues or have suggestions for improvement, please open an issue or create a pull request.

Citation

If you use TandemTwister in your research or analysis, please cite our work as follows:

TandemTwister: A fast accurate tool for genotyping tandem repeats
Al Raei L, Ghareghani M GitHub Repository
(Manuscript in preparation)

Once our manuscript is published, we will update this section with the official citation.

Thank you for acknowledging TandemTwister in your work!

Upcoming Features

  1. Implementation of a Lookup Table for ONT Input Acceleration: Integrate a lookup table for ONT input to enhance processing speed, optimizing the tool’s performance.

  2. Inclusion of Methylation Information: Integrate methylation information into the analysis, providing users with additional insights into the epigenetic characteristics of the tandem repeats.

  3. Add Trio-analysis mode for better genotyping results in Trio samples.

Acknowledgements

We would like to express our appreciation to the IT team at Max Planck Institute for Molecular Genetics for their support with technical aspects related to this project.