CAGEcleaner

main

cagecleaner.main.create_parser() Namespace[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:

None

Returns:

An ArgumentParser object holding the CLI ready to collect the arguments when called

Return type:

parser (argparse.ArgumentParser)

Note

Also configures the logger.

cagecleaner.main.main()[source]
cagecleaner.main.setup_logging(verbosity: int) None[source]

Set up the root logger if it has not been set up yet.

Parameters:

verbosity (int) – Verbosity level (choices: 0,1,2,3,4).

Returns:

None

run

class cagecleaner.run.Run(parsed_args)[source]

Bases: ABC

Abstract base class orchestrating the complete CAGEcleaner dereplication workflow.

Handles which workflows to call for dereplicating genome mining hits across all search modes (local/remote sources, genome/region-based dereplication). In all workflows, all hits, their metadata and their dereplication status are recorded in a cblaster-style binary table extended with additional columns.

The typical high-level workflow is: 1. Parse input session and create binary table 2. Dereplicate sequences (implemented by subclasses) 3. Map dereplication results to binary table 4. Recover hits by score or content diversity 5. Filter session and generate outputs

Note

This is the abstract grandparent class with globally shared methods. Check out parent classes (GenomeRun, RegionRun, RemoteRun, LocalRun) for partially shared methods. Subclasses (LocalGenomeRun, RemoteGenomeRun, LocalRegionRun, RemoteRegionRun) provide concrete implementations and inherit from these parent classes through multiple inheritance.

See also

LocalGenomeRun: Full-genome dereplication for hits in local sequences. RemoteGenomeRun: Full-genome dereplication for hits in remote sequences. LocalRegionRun: Region-based dereplication for hits in local sequences. RemoteRegionRun: Region-based dereplication for hits in remote sequences.

filter_session() None[source]

Filter the original cblaster session to retain only dereplicated hits.

Removes all scaffolds with all hits marked as ‘redundant’ during dereplication from the original cblaster session object. This produces a cleaned session containing only representative hits.

The filtering operates at two levels: 1. Scaffold level: Remove scaffolds without hits in the dereplicated set 2. Organism level: Remove organisms with no remaining scaffolds

This method respects bypass flags set by the user, allowing certain organisms or scaffolds to bypass filtering (i.e., always retained).

Parameters:

None (uses class attributes)

Mutates:

self.filtered_session (Session): Newly created filtered session object.

Raises:

RuntimeError – If filtered session is empty.

Returns:

None

generate_output()[source]

Generate all output files from the filtered session and results.

Creates the complete set of output files in the specified output directory, including the filtered cblaster session, binary table, and summary file.

Output files generated (always): - filtered_session.json: Filtered cblaster session object (JSON format) - filtered_binary.txt: Binary presence/absence table (tab-separated) - filtered_summary.txt: Summary for each hit per organism (text file) - retained_cluster_numbers.txt: Comma-separated list of retained cluster IDs from the cblaster session - cluster_sizes.txt: Number of hits per representative (tab-separated) - extended_binary.txt: Binary table with dereplication status annotations

Optional intermediate files (if keep flags enabled): - downloads/: Copy of downloaded sequences (remote mode only) - dereplication/: skDER or MMseqs2 dereplication output files

Parameters:

None (uses class attributes)

Returns:

None

initialise_binary(session: Path)[source]

Initialise the binary table from the provided session file (either obtained from a search run, or from cagecleaner-generate-session).

Always generates a cblaster Session instance, which is stored in the self.session attribute. Then parses the binary table generated from this Session and joins it with more cluster information retrieved using the cluster extraction function get_sorted_cluster_hierarchies. Generates cluster layout groups from this and corrects them for strand location.

Parameters:

session (pathlib.Path) – Path of the session file from which the initial binary table will be created.

Raises:

ValueError – If the initial binary is empty, and thus the supplied session was empty.

Returns:

None

abstractmethod join_dereplication_with_binary() None[source]

Joins dereplication results with the binary table.

Parses the dereplication clustering table generated by skDER or MMseqs2. Each row gets tagged with: - representative: Which genome is the dereplication representative of this assembly? - dereplication_status: ‘dereplication_representative’ or ‘redundant’

Subclass implementations vary by mode: - Local: Join on genome file names (after suffix removal) - Remote: Join on assembly accession numbers

This method must: 1. Read dereplication output from self.DEREP_OUT_DIR 2. Parse clustering results (which genomes are reps vs redundant) 3. Join with self.binary_df on appropriate column 4. Handle unmatched entries gracefully (log warnings, drop if needed) 5. Update self.binary_df with new columns

Mutates:

self.binary_df: Adds ‘representative’ and ‘dereplication_status’ columns.

Raises:
  • FileNotFoundError – If dereplication output file not found.

  • ValueError – If output format is unexpected.

Expected Result:

self.binary_df should now have columns: - representative: Genome ID of the dereplication representative - dereplication_status: ‘dereplication_representative’ | ‘redundant’

Returns:

None

Notes

This is an abstract method that is passed on as an abstract class by the parent classes. Only the specific classes define the method for their use case.

recover_hits() None[source]

Recovers redundant hits based on cluster layout abd/or outlier homology scores.

This method attempts to restore hits that were marked as redundant during dereplication if they represent genuinely distinct gene cluster compositions or if they have outlier homology scores. Recovery occurs at two levels:

  1. Score-based recovery: Within each representative genome’s clusters, identify hits with outlier cblaster scores (Z-score based). These hits may represent functionally important evolutionary variants.

  2. Content-based recovery: For each cluster layout (set of homologs with a particular homolog order), select one hit at random to recover to represent that layout, unless the dereplication representative is a member of that set.

Recovery is applied hierarchically: hits are grouped by representative and cluster layout, then outliers are identified within these groups.

Parameters:

None (uses class attributes)

Mutates:

self.binary_df: Updates the ‘dereplication_status’ column for recovered hits: - ‘readded_by_score’: Hit with outlier score in its group - ‘readded_by_content’: Hit representing unique gene cluster layout

Returns:

None

abstractmethod run()[source]

Execute the complete CAGEcleaner dereplication and recovery workflow.

This is the main orchestration method that coordinates all workflow steps in the correct sequence. Subclasses must implement this to define their specific mode (local/remote, genome/region-based).

The typical workflow executed by subclasses is: 1. Prepare inputs: Validate and stage genomes/regions for dereplication 2. Dereplication: Run skDER (genome) or MMseqs2 (region) clustering 3. Integrate dereplication: Join dereplication output with binary table 4. Hit recovery: Restore hits with outlier scores or unique gene content 5. Session filtering: Remove redundant scaffolds from session 6. Output: Generate all result files 7. Cleanup: Remove temporary files

Implementation details differ by mode: - Local/Genome: Use local sequences and skDER for full-genome ANI clustering - Local/Region: Use local sequences and MMseqs2 for region sequence clustering - Remote/Genome: Download genomes from NCBI Assembly and use skDER ANI clustering - Remote/Region: Download regions from NCBI Nucleotide and use MMseqs2 sequence clustering

Parameters:

None (uses class attributes)

Mutates:

self.binary_df: Updated through dereplication and recovery steps. self.filtered_session: Populated by filter_session().

Returns:

None

Expected Results After Execution:
  • self.binary_df: Contains dereplication_status column with values:
    • ‘dereplication_representative’: Genome/region kept as representative

    • ‘redundant’: Marked for removal

    • ‘readded_by_score’: Recovered due to outlier score

    • ‘readded_by_content’: Recovered due to unique gene content

  • self.filtered_session: Filtered Session object ready for output

    (may be None if no hits/clusters identified)

  • OUT_DIR: Contains all output files (see generate_output for details)

  • TEMP_DIR: Contains intermediate files

genome_run

class cagecleaner.genome_run.GenomeRun(args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving whole-genome dereplication.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:

LocalGenomeRun: Full-genome dereplication for hits in local sequences. RemoteGenomeRun: Full-genome dereplication for hits in remote sequences.

dereplicate_genomes()[source]

Dereplicate the gathered genome files using whole-genome ANI similarity with skDER.

Sets the dereplication input directory to the full genome folder, and runs the skDER dereplication command. skDER output is stored in TEMP_DIR/dereplication.

Returns:

None

Raises:

RuntimeError – If the input folder is empty or does not exist, or if the skDER command run fails.

local_genome_run

class cagecleaner.local_genome_run.LocalGenomeRun(parsed_args)[source]

Bases: LocalRun, GenomeRun

Subclass orchestrating the workflow for dereplication by whole-genome similarity using genomes from local sources.

This class combines local genome file handling with whole-genome dereplication workflows. It stages the local genome assemblies for dereplication (converting genbanks to fastas), performs whole-genome dereplication using skDER, and integrates the dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assembly file was found) are tracked and reported separately.

Inherits from:

LocalRun: Intermediary class providing local file handling utilities. GenomeRun: Intermediary class providing genome dereplication utilities.

join_dereplication_with_binary() None[source]

After dereplication, map the dereplication clustering table to the binary table. The dereplication clustering table is converted to a dataframe and joined with the binary table based on assembly ID (full genome dereplication) or scaffold ID (region dereplication).

Mutates:

self.binary_df: pd.DataFrame: The binary table derived from a cblaster Session object.

Join dereplication clustering results with the binary table.

Reads the skDER clustering output file, converts it to a DataFrame, and joins it with the binary table based on assembly ID. This associates each genome in the binary table with its dereplication status (representative or redundant) and representative assembly. The resulting table is sorted by representative and dereplication status for clarity.

Mutates:
self.binary_df (pd.DataFrame): Updated in-place with additional columns for

‘representative’ and ‘dereplication_status’, and sorted by these columns.

Returns:

None

Raises:
  • FileNotFoundError – If the skDER clustering file cannot be found at the expected path.

  • RuntimeError – If the dereplication table is empty.

  • RuntimeError – If the binary table is empty after joining with the dereplication table.

Notes

This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete local genome dereplication pipeline.

Orchestrates all processing steps in sequence: stages local genome assemblies for dereplication, runs skDER dereplication on the staged genomes, joins dereplication clustering results with the binary table, recovers hit diversity information from the dereplication output, filters the original session based on dereplication results, and generates final output files with dereplication metadata. Cleans up temporary working directories upon completion.

Returns:

None

local_region_run

class cagecleaner.local_region_run.LocalRegionRun(parsed_args)[source]

Bases: LocalRun, RegionRun

Subclass orchestrating the workflow for dereplication by region similarity using regions extracted from local sources.

This class combines local file handling with region-based dereplication workflows. It extracts genomic regions of interest (with optional sequence margins) from local assembly files, performs MMseqs2-based sequence dereplication on the extracted regions, and integrates dereplication results back into the binary table. Handles contig edge cases where regions with sequence margins extend beyond scaffold boundaries according to user-specified behavior (keep but clip them (permissive), or discard them (strict)).

Inherits from:

LocalRun: Intermediary class providing local file handling utilities. RegionRun: Intermediary class providing region dereplication utilities.

extract_regions()[source]

Extract genomic regions surrounding cluster hits using sequence margins.

Processes each cluster hit in the binary table to extract the genomic region with sequence margins from the local assembly files. Extraction is performed in parallel using multiple worker threads. Regions that extend beyond contig boundaries are treated as specified by the user (strict_regions flag).

When strict_regions is enabled, regions at contig edges are excluded from downstream dereplication analysis. When disabled (permissive mode), such regions are retained but clipped to the contig boundaries.

The extracted regions are written to DEREP_IN_DIR for use in the dereplication step.

Mutates:

Writes extracted region sequences to temporary files in DEREP_IN_DIR.

Returns:

None

join_dereplication_with_binary() None[source]

Join dereplication clustering results with the binary table.

Reads the MMseqs2 clustering output file and joins it with the binary table based on scaffold ID and region coordinates (Start, End). This associates each extracted region in the binary table with its dereplication status (representative or redundant) and representative region identifier. The resulting table is sorted by representative and dereplication status for clarity.

The clustering table is parsed to extract scaffold and coordinate information from compound region identifiers. Dereplication status is determined by comparing each region’s identifier with its assigned representative.

Mutates:
self.binary_df (pd.DataFrame): Updated in-place with additional columns for

‘representative’ and ‘dereplication_status’, and sorted by these columns. The temporary ‘Region’ column is removed after processing.

Returns:

None

Raises:
  • FileNotFoundError – If the dereplication table cannot be read

  • RuntimeError – If the dereplication table is empty, or has empty values due to an invalid filename formatting.

  • RuntimeError – If the binary table is empty after joining with the dereplication table

Notes

The Region temporary column in self.binary_df was added by joining with the MMseqs2 dereplication table. This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete local region-based dereplication pipeline.

Orchestrates all processing steps in sequence: stages local genome assemblies for region extraction, extracts genomic regions of interest from the staged genomes, runs MMseqs2 dereplication on the extracted regions, joins the dereplication clustering results with the binary table, recovers hit diversity information from the dereplication output, filters the original session based on dereplication results, and generates final output files with dereplication metadata.

Cleans up temporary working directories upon completion.

Returns:

None

local_run

class cagecleaner.local_run.LocalRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving local sequence files.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:

LocalRegionRun: Region-based dereplication for hits in local sequences. LocalGenomeRun: Whole-genome dereplication for hits in local sequences.

abstractmethod join_dereplication_with_binary()[source]

Join dereplication results with the binary table.

Mutates:

self.binary_df: Adds ‘representative’ and ‘dereplication_status’ columns.

Expected Result:

self.binary_df should now have columns: - representative: Genome ID of the dereplication representative - dereplication_status: ‘dereplication_representative’ | ‘redundant’

Returns:

None

Notes

This is the abstract method inherited from the Run parent class. It is not meant to be implemented at this level. Only the child classes inheriting this method are expected to provide a workflow-dependent specific implementation.

prepare_genomes() None[source]

Prepare the genome sequence files in the specified genome directory for dereplication.

Checks whether The filenames of the genome sequence files are among the names of the organisms in the Session object, ignoring file extensions. Checks whether there are fasta and genbank files in the user-specified genome folder, converting genbank files to fasta files on-the-fly.

Adds a column assembly_file to binary table specifying the filepath of each scaffold’s associated genome assembly. In Genbank mode, this will point to converted files in the temporary directories.

Mutates:
self.binary_df (pd.DataFrame): Updated in-place with an additional column for

‘assembly_file’ and ‘dereplication_status’.

Returns:

None

Raises:
  • ValueError – If an organism is found of which the genome is not present in the user-supplied genome directory.

  • RuntimeError – If no fasta or genbank files have been found in the supplied genome directory.

Notes

The sequence files in the user genome folder should be either all fasta files or all genbank files. There is no mix case support.

region_run

class cagecleaner.region_run.RegionRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving region-based dereplication.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:

LocalRegionRun: Region-based dereplication for hits in local sequences. RemoteRegionRun: Region-based dereplication for hits in remote sequences.

dereplicate_regions()[source]

This method takes the path to a genomic regions folder and dereplicates them using MMseqs2. MMseqs2 output is stored in TEMP_DIR/derep_out.

Dereplicate the gathered genome files using whole-genome ANI similarity with skDER.

Sets the dereplication input directory to the full genome folder, and runs the skDER dereplication command. skDER output is stored in TEMP_DIR/dereplication.

Returns:

None

Raises:

RuntimeError – If the MMseqs2 command run fails, or if the input folder is empty or does not exist.

remote_genome_run

class cagecleaner.remote_genome_run.RemoteGenomeRun(parsed_args)[source]

Bases: RemoteRun, GenomeRun

Subclass orchestrating the workflow for dereplication by whole-genome similary using genomes downloaded from NCBI.

This class combines remote genome downloading functionality with local genome dereplication. It links NCBI assembly accession IDs to the scaffold IDs in the binary table, downloads the corresponding genome assemblies, maps the scaffold IDs back to the downloaded assemblies, performs whole-genome dereplication using skDER, and integrates dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assemby file was found) are tracked and reported separately.

Inherits from:

RemoteRun: Intermediary class providing common downloading configurations. GenomeRun: Intermediary class providing genome dereplication utilities.

fetch_assembly_ids() None[source]

Fetch NCBI assembly accession IDs for all scaffold IDs in the binary table.

Extracts scaffold IDs from the binary table and categorises them as RefSeq or GenBank IDs. For each category, retrieves corresponding assembly accession IDs using NCBI Entrez Direct utilities. Combines and deduplicates the results, storing them in self.assembly_accessions.

Mutates:

self.assembly_accessions (list): Populated with deduplicated NCBI assembly accession IDs.

Raises:

RuntimeError – If no assembly IDs have been retrieved.

Returns:

None

fetch_genomes() None[source]

Fetch genome assemblies from NCBI in batches using the datasets command-line tool.

Downloads assemblies in batches (default 300 per batch) using the NCBI Datasets CLI. For each batch, fetches genome data using NCBI datasets, rehydrates the dehydrated files with gzip compression, and moves all resulting genome files to the temporary genome directory. Removes version digits from assembly IDs to ensure the latest versions are downloaded.

Mutates:

Populates self.TEMP_GENOME_DIR with downloaded gzipped genome files.

Returns:

None

join_assemblies_with_binary() None[source]

This function maps each row in the binary table to a corresponding assembly file based on the mapping obtained by map_scaffolds_to_assemblies().

Mutates:

self.binary_df: pd.DataFrame: Internal representation of the binary table.

Join the scaffold-to-assembly mapping with the binary table and remove unmapped scaffolds.

Converts self.scaffold_assembly_pairs dictionary to a DataFrame and joins with self.binary_df on the Scaffold column. Identifies scaffolds that could not be linked to any assembly file, logs a warning, saves them to an output file, and removes them from the binary table.

Mutates:

self.binary_df (pd.DataFrame): Adds ‘assembly_file’ column and removes rows with unmatched scaffolds.

Returns:

None

Raises:

RuntimeError – If the binary table is empty after joining with the mapping table.

join_dereplication_with_binary() None[source]

Join the genome dereplication clustering results with the binary table.

Reads the skDER clustering output file and converts it to a DataFrame with assembly filenames and dereplication status. Joins this data with self.binary_df on the assembly_file column. Sorts the resulting table by representative genome and dereplication status.

Mutates:

self.binary_df (pd.DataFrame): Adds ‘representative’ and ‘dereplication_status’ columns and sorts by these values.

Returns:

None

Raises:
  • FileNotFoundError – If the dereplication table cannot be read

  • RuntimeError – If the dereplication table is empty.

  • RuntimeError – If the binary table is empty after joining with the dereplication table.

Note

This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

map_scaffolds_to_assemblies() None[source]

Map each scaffold ID from the binary table to its corresponding downloaded assembly file.

Iterates through all genome files in the temporary genome directory and extracts scaffold IDs using BioPython’s SeqIO. For each assembly file, compares its scaffolds (with and without prefixes) to those in the binary table. When matches are found, stores the mapping in self.scaffold_assembly_pairs. Prefixes are stripped during comparison because NCBI sometimes omits them in downloaded genomes.

Mutates:

self.scaffold_assembly_pairs (dict): Populated with (scaffold_id: assembly_filename) pairs.

Returns:

None

Raises:

RuntimeError – If the dereplication input folder does not contain any fasta file.

run()[source]

Execute the complete remote genome dereplication pipeline.

Orchestrates all processing steps in sequence: fetches NCBI assembly IDs for scaffold IDs, downloads genomes, maps scaffold IDs to assembly IDs, performs whole-genome dereplication, joins dereplication results with the binary table, recovers hit diversity according to user parameters, filters the original session file, and generates output files. Cleans up the temporary directory upon completion.

Returns:

None

remote_region_run

class cagecleaner.remote_region_run.RemoteRegionRun(parsed_args)[source]

Bases: RemoteRun, RegionRun

Subclass orchestrating the workflow for dereplication by region sequence similary using regions downloaded from NCBI.

This class combines remote region downloading functionality with local region dereplication. It downloads genomic regions (with optional sequence margins) from NCBI based on scaffold IDs and coordinates in a binary table, performs MMseqs2-based sequence dereplication on the downloaded regions, and integrates dereplication results back into the binary table. Handles contig edge cases where regions with sequence margins extend beyond the scaffold boundaries according to user-specified behavior (keep but clip them (permissive), or discard them (strict)).

It links NCBI assembly accession IDs to the scaffold IDs in the binary table, downloads the corresponding genome assemblies, maps the scaffold IDs back to the downloaded assemblies, performs whole-genome dereplication using skDER, and integrates dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assemby file was found) are tracked and reported separately.

Inherits from:

RemoteRun: Intermediary class providing common downloading configurations. RegionRun: Intermediary class providing region dereplication utilities

fetch_regions() None[source]

Fetch genomic regions from NCBI based on scaffold coordinates with optional sequence margins.

Extracts region coordinates from the binary table and applies user-specified margins (upstream and downstream sequence extensions). Handles regions that extend beyond contig boundaries by either discarding them (strict mode) or clipping them (permissive mode). Fetches contig lengths from NCBI using Entrez utilities. Logs statistics about discarded or clipped regions. Downloads the regions.

Mutates:

self.binary_df (pd.DataFrame): Adds ‘Contig_length’ column from NCBI data. self.DEREP_IN_DIR: Populates folder with downloaded gzipped genomic region FASTA files.

Returns:

None

join_dereplication_with_binary() None[source]

Join MMseqs2 dereplication clustering results with the binary table.

Reads the dereplication clustering output file and parses region coordinates. Assigns dereplication status (‘dereplication_representative’ or ‘redundant’) based on whether each region’s accession matches its representative. Handles four possible cases of region boundary mismatches due to region boundary adjustments during preprocessing:

  1. No contig edges: Exact coordinate match after removing margins.

  2. Upstream edge: Match on scaffold and end coordinate (contig was clipped at the upstream edge).

  3. Downstream edge: Match on scaffold and start coordinate (contig was clipped at the downstream edge).

  4. Both edges: Interval containment where the original cluster region is within the dereplicated region (contig was clipped at both edges).

Concatenates results from all cases and removes temporary helper columns. Sorts the final table by representative ID and dereplication status.

Mutates:
self.binary_df (pd.DataFrame): Joined with dereplication results; adds ‘representative’

and ‘dereplication_status’ columns; removes ‘Contig_length’ and ‘Region’ columns.

Returns:

None

Raises:
  • FileNotFoundError – If the dereplication table cannot be read

  • RuntimeError – If the dereplication table is empty, or has empty values due to an invalid filename formatting.

  • RuntimeError – If the binary table is empty after joining with the dereplication table

Notes

The Region temporary column in self.binary_df was added by joining with the MMseqs2 dereplication table. The Contig_length temporary column in self.binary_df was added by fetch_regions when the sequence margins were added on-the-fly when fetching the regions. This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete remote region dereplication pipeline.

Orchestrates all processing steps in sequence: fetches genomic regions from NCBI with optional sequence margins and contig boundary handling, performs MMseqs2-based dereplication on the downloaded regions, joins the dereplication results with the binary table, recovers hit diversity according to user parameters, filters the original session file, and generates output files. Cleans up the temporary directory upon completion.

Returns:

None

remote_run

class cagecleaner.remote_run.RemoteRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving remote sequence files.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:

RemoteRegionRun: Region-based dereplication for hits in remote sequences. RemoteGenomeRun: Whole-genome dereplication for hits in remote sequences.

abstractmethod join_dereplication_with_binary()[source]

Join dereplication results with the binary table.

Mutates:

self.binary_df: Adds ‘representative’ and ‘dereplication_status’ columns.

Expected Result:

self.binary_df should now have columns: - representative: Genome ID of the dereplication representative - dereplication_status: ‘dereplication_representative’ | ‘redundant’

Returns:

None

Notes

This is the abstract method inherited from the Run parent class. It is not meant to be implemented at this level. Only the child classes inheriting this method are expected to provide a workflow-dependent specific implementation.

communication

cagecleaner.communication.download_regions(regions: DataFrame, directory: Path, download_workers: int, no_progress: bool = False) None[source]

Download multiple genomic sequence regions in parallel from NCBI.

Accepts a DataFrame of genomic regions and uses multi-threaded downloads to retrieve each region from NCBI. Automatically limits concurrent downloads to a maximum of 2.

Parameters:
  • regions (pd.DataFrame) – A DataFrame with at minimum the following columns: - ‘Scaffold’: NCBI nucleotide accession identifiers - ‘Start’: Start position of the region (integer or convertible to int) - ‘End’: End position of the region (integer or convertible to int)

  • directory (Path) – Directory where downloaded files will be saved.

  • download_workers (int) – Number of concurrent download threads requested. If greater than 2, will be automatically reduced to 2.

  • no_progress (bool, optional) – If True, suppress progress bar display. Defaults to False.

Raises:
  • ValueError – If the provided regions dataframe is empty.

  • IOError – If the download directory does not exist.

Returns:

None

cagecleaner.communication.fetch_contig_lengths(contig_ids: list, max_attempts=3)[source]

Retrieve sequence lengths for given NCBI contig accessions using BioPython’s Entrez API.

Fetches summary information for a list of nucleotide contig IDs from NCBI using Efetch and extracts their sequence lengths. Returns results as a deduplicated DataFrame.

Parameters:
  • contig_ids (list) – A list of NCBI nucleotide accession identifiers (contig IDs) for which to retrieve sequence lengths.

  • max_attempts (int, optional) – Maximum number of retry attempts on network/server errors. Defaults to 3.

Raises:

RuntimeError – If the fetch fails the maximum number of times, or if it returns an empty response

Returns:

A DataFrame with columns:
  • ’Scaffold’: NCBI accession version identifiers

  • ’Contig_length’: Sequence length in base pairs (integer)

Return type:

pd.DataFrame

cagecleaner.communication.get_assembly_accessions(scaffolds: list, source: str, no_progress: bool = False, max_attempts: int = 3, batch_size: int = 100) list[source]

Retrieve Assembly accession IDs from NCBI given a list of Nucleotide scaffold IDs.

Converts Nucleotide accession IDs to their associated Assembly IDs using BioPython’s Entrez elink and esummary services. Automatically redirects WGS (Whole Genome Shotgun) records to their master records when requesting Genbank IDs.

Parameters:
  • scaffolds (list) – A list of NCBI Nucleotide accession identifiers (e.g., GenBank or RefSeq accessions). WGS Genbank records (pattern: XXXX12345678.1) are automatically converted to their master records.

  • source (str) – The database to extract IDs from. Only possible values include ‘Genbank’ or ‘RefSeq’. This determines which synonym field is extracted from the Assembly summary.

  • no_progress (bool, optional) – If True, suppress progress bar display. Defaults to False.

  • max_attempts (int) – Number of retries in case of a failure. Defaults to 3.

  • batch_size (int) – Number of accession IDs to link in one batch. Defaults to 100.

Returns:

A list of Assembly accession IDs associated with the input scaffolds. Returns

an empty list if the retrieval fails after all retry attempts.

Return type:

list

Raises:
  • ValueError – If an invalid NCBI Nucleotide database name was supplied, or if the supplied scaffold ID list is empty.

  • RuntimeError – If the NCBI Assembly IDs could not be fetched from NCBI.

Notes

  • Processes scaffolds in batches of 100.

  • Retries up to 3 times on network/server errors.

Raises:
  • ValueError – If the specified source is not a valid NCBI Nucleotide database.

  • ValueErorr – If the scaffold ID list is empty.

  • RuntimeError – If the assembly IDs could not be fetched from NCBI.

file_utils

cagecleaner.file_utils.convert_genbanks_to_fastas(in_dir: Path, out_dir: Path, workers: int = 1, no_progress: bool = False) None[source]

Convert all genbank files in an input directory to fasta files.

All genbank files in the input directory are converted to fasta files in parallel using any2fasta.

Parameters:
  • in_dir (Path) – input directory containing the genbank files

  • out_dir (Path) – output directory where the new fasta files will be written

  • workers (int) – number of threads for parallellisation

  • no_progress (bool) – flag to disable showing a progress bar while converting

Returns:

None

Raises:

RuntimeError – If the input directory is empty.

cagecleaner.file_utils.is_fasta(file: str | Path) bool[source]

Check whether the file ends in any of the accepted fasta suffices. Gzipped allowed.

Parameters:

file (str | Path) – filename to check

Returns:

Boolean result of the check

Return type:

bool

cagecleaner.file_utils.is_genbank(file: str | Path) bool[source]

Check whether the file ends in any of the accepted genbank suffices. Gzipped allowed.

Parameters:

file (str | Path) – filename to check

Returns:

Boolean result of the check

Return type:

bool

cagecleaner.file_utils.remove_suffixes(string: str | Path) str[source]

Split off any valid sequence file suffix either in the form of .<suffix> or .<suffix>.gz.

Parameters:

string (str | Path) – string from which the suffix should be split off, if any.

Returns:

the string without the sequence file suffix

Return type:

str

utils

cagecleaner.utils.correct_layouts(binary_df: DataFrame) DataFrame[source]

Correct cluster layouts on strands complementary to ones with existing layouts.

Identifies and resolves equivalent cases where a certain cluster layout is represented on both strands in a set of genomes. Uses a directed graph to detect flipped layouts and corrects them accordingly.

Example

Assembly 1 has a cluster with a gene layout ABC on the positive strand (strand location layout +++). Assembly 2 has a cluster with a gene layout CBA on the negative strand (strand location layout —).

Since there is no way to recognise a strand as the positive or negative one, assembly 1’s positive strand is likely homologous to assembly 2’s negative strand.

Hence, the layout of the cluster of assembly 2 is flipped to ABC (strand location layout +++).

Correction strategy:

Make a directed network and remove backloops. Nodes: pairs of cluster layout and strand location layout strings Edges: One node being the complementary of an existing layout in the dataset

Example:

Reconsidering the previous example above, we will have two nodes, (ABC,+++) and (CBA,—), and two edges (ABC,+++) -> (CBA,—), and (CBA,—) -> (ABC,+++).

These two edges contain a backloop, so they are pruned to just one edge. For example, (ABC,+++) -> (CBA,—).

The remaining edges in the network will be used to correct the layouts, so in this case layouts (ABC,+++) will be corrected to (CBA,—). Or, all gene layouts ABC who are on the positive strand are equivalent to gene layouts CBA on the negative strand.

Parameters:

binary_df (pd.DataFrame) – A DataFrame containing ‘Strand’ and ‘Layout_group’ columns where Strand is a tuple of integers and Layout_group is a tuple of layout identifiers.

Returns:

The input DataFrame with corrected ‘Layout_group’ column where equivalent

layouts on complementary strands have been flipped.

Return type:

pd.DataFrame

Notes

  • Modifies the ‘Layout_group’ column in-place within the returned DataFrame.

  • Uses NetworkX to detect bidirectional edges (back-loops) between layout pairs.

cagecleaner.utils.generate_cblaster_session(hits: Path, clusters: Path, queries: Path, mode: str) Session[source]

Generate a cblaster Session object from TSV tabular data files.

Reads cluster hit data from a folder containing hits, clusters, and queries TSV files and reconstructs a hierarchical cblaster Session object. The function parses tabular data into the nested structure required by cblaster (organisms > scaffolds > clusters > subjects).

Parameters:
  • hits (Path) – Path to the hits TSV file, containing individual hit records with columns db_id, query, scaff, strand, coords, evalue, score, seqid, tcov

  • clusters (Path) – Path to the clusters TSV file, containing cluster records with columns number, hits, start, end, length, score, scaff, taxon_name, taxon_id

  • queries (Path) – Path to the queries TSV file, containing query sequence records with at least id, start, end columns

  • mode (str) – The search mode to be set in the session parameters (e.g., ‘remote’, ‘local’).

Returns:

A cblaster Session object populated with organisms, scaffolds, clusters,

and subjects (hits) reconstructed from the input TSV files.

Return type:

Session

Notes

  • Query subjects are artificially positioned with a 500 amino acid margin between them, as in the original cblaster implementation.

  • Hit coordinates are validated to ensure they fall within the cluster’s defined interval.

  • The session includes placeholder fields (e.g., sequence data set to None/empty strings) that are not essential for the plotting and file export functionalities.

  • Organisms are indexed by (taxon_id, taxon_name) pairs internally but stored as a flat list in the final session object.

cagecleaner.utils.run_command(cmd_list: list, max_attempts: int = 3) None[source]

Execute an externally composed command with automatic retry logic and streaming output logging.

Runs a subprocess command with up to max_attempts retries. Captures both stdout and stderr in separate daemon threads, logging output in real-time. Logs debug information for stdout, warnings for stderr, and errors if the command fails.

Parameters:
  • cmd_list (list) – A list where the first element is the executable name (resolved via shutil.which) and remaining elements are command arguments.

  • max_attempts (int, optional) – Maximum number of times to retry the command if it fails (non-zero exit code). Defaults to 3.

Returns:

None

Raises:

RuntimeError – If the supplied command fails the maxinum number of times.

Notes

  • The first element of cmd_list must be an executable available in the system PATH.

  • Retries only occur on non-zero exit codes.

  • Output is logged using the module’s LOG logger.

  • Daemon threads ensure subprocess output is captured and logged in real-time.