CAGEcleaner
Welcome to CAGECleaner’s documentation!
CAGEcleaner: A tool to reduce redundancy in gene mining hit sets.
CAGEcleaner reduces redundancy in gene cluster hit sets, easing downstream analyses and visualisation. It features a taxonomically conservative dereplication mode that acts at the full genome level, and a more aggressive mode that acts at the level of the genomic neighbourhood of the cluster. In addition, it prevents clusters from being discarded if they show remarkable diversity based on gene cluster contents and homology scores. Sessions filtered by CAGEcleaner can be plugged back in into the cblaster workflow.
In full genome mode, CAGEcleaner retrieves the full genome assemblies of the clusters’ host genomes, performs a fast ANI-based full genome dereplication using skDER, and only keeps clusters that were part of the retained genomes.
In region mode, CAGEcleaner retrieves the nucleotide sequence of each cluster with an optional sequence margin on both sides, dereplicates these using MMseqs2, and only keeps clusters part of a representative region.
CAGEcleaner offers seamless integration with cblaster, as it has originally been developed as an auxiliary tool for cblaster. Other inputs are now also possible via the helper tool cagecleaner-generate-session, which builds a session from formatted TSV files.
If you find CAGEcleaner useful, please cite:
De Vrieze, L., Biltjes, M., Lukashevich, S., Tsurumi, K., Masschelein, J. (2025).
CAGEcleaner: reducing genomic redundancy in gene cluster mining. Bioinformatics
https://doi.org/10.1093/bioinformatics/btaf373
User Guide
Comprehensive documentation for all API exposed by CAGEcleaner:
API Reference