CAGEcleaner

Comprehensive documentation for all API exposed by CAGEcleaner:

main

cagecleaner.main.create_parser() → Namespace[source]

This function creates a parser object that will collect the arguments given through the command line.

Parameters:: None
Returns:: An ArgumentParser object holding the CLI ready to collect the arguments when called
Return type:: parser (argparse.ArgumentParser)

Note

Also configures the logger.

cagecleaner.main.main()[source]

cagecleaner.main.setup_logging(verbosity: int) → None[source]

Set up the root logger if it has not been set up yet.

Parameters:: verbosity (int) – Verbosity level (choices: 0,1,2,3,4).
Returns:: None

run

class cagecleaner.run.Run(parsed_args)[source]

Bases: ABC

Abstract base class orchestrating the complete CAGEcleaner dereplication workflow.

Handles which workflows to call for dereplicating genome mining hits across all search modes (local/remote sources, genome/region-based dereplication). In all workflows, all hits, their metadata and their dereplication status are recorded in a cblaster-style binary table extended with additional columns.

The typical high-level workflow is: 1. Parse input session and create binary table 2. Dereplicate sequences (implemented by subclasses) 3. Map dereplication results to binary table 4. Recover hits by score or content diversity 5. Filter session and generate outputs

Note

This is the abstract grandparent class with globally shared methods. Check out parent classes (GenomeRun, RegionRun, RemoteRun, LocalRun) for partially shared methods. Subclasses (LocalGenomeRun, RemoteGenomeRun, LocalRegionRun, RemoteRegionRun) provide concrete implementations and inherit from these parent classes through multiple inheritance.

genome_run

class cagecleaner.genome_run.GenomeRun(args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving whole-genome dereplication.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:: LocalGenomeRun: Full-genome dereplication for hits in local sequences. RemoteGenomeRun: Full-genome dereplication for hits in remote sequences.

dereplicate_genomes()[source]

Dereplicate the gathered genome files using whole-genome ANI similarity with skDER.

Sets the dereplication input directory to the full genome folder, and runs the skDER dereplication command. skDER output is stored in TEMP_DIR/dereplication.

Returns:: None
Raises:: RuntimeError – If the input folder is empty or does not exist, or if the skDER command run fails.

local_genome_run

class cagecleaner.local_genome_run.LocalGenomeRun(parsed_args)[source]

Bases: LocalRun, GenomeRun

Subclass orchestrating the workflow for dereplication by whole-genome similarity using genomes from local sources.

This class combines local genome file handling with whole-genome dereplication workflows. It stages the local genome assemblies for dereplication (converting genbanks to fastas), performs whole-genome dereplication using skDER, and integrates the dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assembly file was found) are tracked and reported separately.

Inherits from:: LocalRun: Intermediary class providing local file handling utilities. GenomeRun: Intermediary class providing genome dereplication utilities.

join_dereplication_with_binary() → None[source]

After dereplication, map the dereplication clustering table to the binary table. The dereplication clustering table is converted to a dataframe and joined with the binary table based on assembly ID (full genome dereplication) or scaffold ID (region dereplication).

Mutates:: self.binary_df: pd.DataFrame: The binary table derived from a cblaster Session object.

Join dereplication clustering results with the binary table.

Reads the skDER clustering output file, converts it to a DataFrame, and joins it with the binary table based on assembly ID. This associates each genome in the binary table with its dereplication status (representative or redundant) and representative assembly. The resulting table is sorted by representative and dereplication status for clarity.

Mutates:

self.binary_df (pd.DataFrame): Updated in-place with additional columns for: ‘representative’ and ‘dereplication_status’, and sorted by these columns.

Returns:

None

Raises:

FileNotFoundError – If the skDER clustering file cannot be found at the expected path.
RuntimeError – If the dereplication table is empty.
RuntimeError – If the binary table is empty after joining with the dereplication table.

Notes

This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete local genome dereplication pipeline.

Orchestrates all processing steps in sequence: stages local genome assemblies for dereplication, runs skDER dereplication on the staged genomes, joins dereplication clustering results with the binary table, recovers hit diversity information from the dereplication output, filters the original session based on dereplication results, and generates final output files with dereplication metadata. Cleans up temporary working directories upon completion.

Returns:: None

local_region_run

class cagecleaner.local_region_run.LocalRegionRun(parsed_args)[source]

Bases: LocalRun, RegionRun

Subclass orchestrating the workflow for dereplication by region similarity using regions extracted from local sources.

This class combines local file handling with region-based dereplication workflows. It extracts genomic regions of interest (with optional sequence margins) from local assembly files, performs MMseqs2-based sequence dereplication on the extracted regions, and integrates dereplication results back into the binary table. Handles contig edge cases where regions with sequence margins extend beyond scaffold boundaries according to user-specified behavior (keep but clip them (permissive), or discard them (strict)).

Inherits from:: LocalRun: Intermediary class providing local file handling utilities. RegionRun: Intermediary class providing region dereplication utilities.

extract_regions()[source]

Extract genomic regions surrounding cluster hits using sequence margins.

Processes each cluster hit in the binary table to extract the genomic region with sequence margins from the local assembly files. Extraction is performed in parallel using multiple worker threads. Regions that extend beyond contig boundaries are treated as specified by the user (strict_regions flag).

When strict_regions is enabled, regions at contig edges are excluded from downstream dereplication analysis. When disabled (permissive mode), such regions are retained but clipped to the contig boundaries.

The extracted regions are written to DEREP_IN_DIR for use in the dereplication step.

Mutates:: Writes extracted region sequences to temporary files in DEREP_IN_DIR.

Returns:: None

join_dereplication_with_binary() → None[source]

Join dereplication clustering results with the binary table.

Reads the MMseqs2 clustering output file and joins it with the binary table based on scaffold ID and region coordinates (Start, End). This associates each extracted region in the binary table with its dereplication status (representative or redundant) and representative region identifier. The resulting table is sorted by representative and dereplication status for clarity.

The clustering table is parsed to extract scaffold and coordinate information from compound region identifiers. Dereplication status is determined by comparing each region’s identifier with its assigned representative.

Mutates:

self.binary_df (pd.DataFrame): Updated in-place with additional columns for: ‘representative’ and ‘dereplication_status’, and sorted by these columns. The temporary ‘Region’ column is removed after processing.

Returns:

None

Raises:

FileNotFoundError – If the dereplication table cannot be read
RuntimeError – If the dereplication table is empty, or has empty values due to an invalid filename formatting.
RuntimeError – If the binary table is empty after joining with the dereplication table

Notes

The Region temporary column in self.binary_df was added by joining with the MMseqs2 dereplication table. This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete local region-based dereplication pipeline.

Orchestrates all processing steps in sequence: stages local genome assemblies for region extraction, extracts genomic regions of interest from the staged genomes, runs MMseqs2 dereplication on the extracted regions, joins the dereplication clustering results with the binary table, recovers hit diversity information from the dereplication output, filters the original session based on dereplication results, and generates final output files with dereplication metadata.

Cleans up temporary working directories upon completion.

Returns:: None

local_run

class cagecleaner.local_run.LocalRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving local sequence files.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:: LocalRegionRun: Region-based dereplication for hits in local sequences. LocalGenomeRun: Whole-genome dereplication for hits in local sequences.

abstractmethod join_dereplication_with_binary()[source]

Join dereplication results with the binary table.

Mutates:

self.binary_df: Adds ‘representative’ and ‘dereplication_status’ columns.

Expected Result:: self.binary_df should now have columns: - representative: Genome ID of the dereplication representative - dereplication_status: ‘dereplication_representative’ | ‘redundant’
Returns:: None

Notes

This is the abstract method inherited from the Run parent class. It is not meant to be implemented at this level. Only the child classes inheriting this method are expected to provide a workflow-dependent specific implementation.

prepare_genomes() → None[source]

Prepare the genome sequence files in the specified genome directory for dereplication.

Checks whether The filenames of the genome sequence files are among the names of the organisms in the Session object, ignoring file extensions. Checks whether there are fasta and genbank files in the user-specified genome folder, converting genbank files to fasta files on-the-fly.

Adds a column assembly_file to binary table specifying the filepath of each scaffold’s associated genome assembly. In Genbank mode, this will point to converted files in the temporary directories.

Mutates:

self.binary_df (pd.DataFrame): Updated in-place with an additional column for: ‘assembly_file’ and ‘dereplication_status’.

Returns:

None

Raises:

ValueError – If an organism is found of which the genome is not present in the user-supplied genome directory.
RuntimeError – If no fasta or genbank files have been found in the supplied genome directory.

Notes

The sequence files in the user genome folder should be either all fasta files or all genbank files. There is no mix case support.

region_run

class cagecleaner.region_run.RegionRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving region-based dereplication.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:: LocalRegionRun: Region-based dereplication for hits in local sequences. RemoteRegionRun: Region-based dereplication for hits in remote sequences.

dereplicate_regions()[source]

This method takes the path to a genomic regions folder and dereplicates them using MMseqs2. MMseqs2 output is stored in TEMP_DIR/derep_out.

Dereplicate the gathered genome files using whole-genome ANI similarity with skDER.

Sets the dereplication input directory to the full genome folder, and runs the skDER dereplication command. skDER output is stored in TEMP_DIR/dereplication.

Returns:: None
Raises:: RuntimeError – If the MMseqs2 command run fails, or if the input folder is empty or does not exist.

remote_genome_run

class cagecleaner.remote_genome_run.RemoteGenomeRun(parsed_args)[source]

Bases: RemoteRun, GenomeRun

Subclass orchestrating the workflow for dereplication by whole-genome similary using genomes downloaded from NCBI.

This class combines remote genome downloading functionality with local genome dereplication. It links NCBI assembly accession IDs to the scaffold IDs in the binary table, downloads the corresponding genome assemblies, maps the scaffold IDs back to the downloaded assemblies, performs whole-genome dereplication using skDER, and integrates dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assemby file was found) are tracked and reported separately.

Inherits from:: RemoteRun: Intermediary class providing common downloading configurations. GenomeRun: Intermediary class providing genome dereplication utilities.

fetch_assembly_ids() → None[source]

Fetch NCBI assembly accession IDs for all scaffold IDs in the binary table.

Extracts scaffold IDs from the binary table and categorises them as RefSeq or GenBank IDs. For each category, retrieves corresponding assembly accession IDs using NCBI Entrez Direct utilities. Combines and deduplicates the results, storing them in self.assembly_accessions.

Mutates:: self.assembly_accessions (list): Populated with deduplicated NCBI assembly accession IDs.

Raises:: RuntimeError – If no assembly IDs have been retrieved.
Returns:: None

fetch_genomes() → None[source]

Fetch genome assemblies from NCBI in batches using the datasets command-line tool.

Downloads assemblies in batches (default 300 per batch) using the NCBI Datasets CLI. For each batch, fetches genome data using NCBI datasets, rehydrates the dehydrated files with gzip compression, and moves all resulting genome files to the temporary genome directory. Removes version digits from assembly IDs to ensure the latest versions are downloaded.

Mutates:: Populates self.TEMP_GENOME_DIR with downloaded gzipped genome files.

Returns:: None

join_assemblies_with_binary() → None[source]

This function maps each row in the binary table to a corresponding assembly file based on the mapping obtained by map_scaffolds_to_assemblies().

Mutates:: self.binary_df: pd.DataFrame: Internal representation of the binary table.

Join the scaffold-to-assembly mapping with the binary table and remove unmapped scaffolds.

Converts self.scaffold_assembly_pairs dictionary to a DataFrame and joins with self.binary_df on the Scaffold column. Identifies scaffolds that could not be linked to any assembly file, logs a warning, saves them to an output file, and removes them from the binary table.

Mutates:: self.binary_df (pd.DataFrame): Adds ‘assembly_file’ column and removes rows with unmatched scaffolds.

Returns:: None
Raises:: RuntimeError – If the binary table is empty after joining with the mapping table.

join_dereplication_with_binary() → None[source]

Join the genome dereplication clustering results with the binary table.

Reads the skDER clustering output file and converts it to a DataFrame with assembly filenames and dereplication status. Joins this data with self.binary_df on the assembly_file column. Sorts the resulting table by representative genome and dereplication status.

Mutates:: self.binary_df (pd.DataFrame): Adds ‘representative’ and ‘dereplication_status’ columns and sorts by these values.

Returns:

None

Raises:

FileNotFoundError – If the dereplication table cannot be read
RuntimeError – If the dereplication table is empty.
RuntimeError – If the binary table is empty after joining with the dereplication table.

Note

This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

map_scaffolds_to_assemblies() → None[source]

Map each scaffold ID from the binary table to its corresponding downloaded assembly file.

Iterates through all genome files in the temporary genome directory and extracts scaffold IDs using BioPython’s SeqIO. For each assembly file, compares its scaffolds (with and without prefixes) to those in the binary table. When matches are found, stores the mapping in self.scaffold_assembly_pairs. Prefixes are stripped during comparison because NCBI sometimes omits them in downloaded genomes.

Mutates:: self.scaffold_assembly_pairs (dict): Populated with (scaffold_id: assembly_filename) pairs.

Returns:: None
Raises:: RuntimeError – If the dereplication input folder does not contain any fasta file.

run()[source]

Execute the complete remote genome dereplication pipeline.

Orchestrates all processing steps in sequence: fetches NCBI assembly IDs for scaffold IDs, downloads genomes, maps scaffold IDs to assembly IDs, performs whole-genome dereplication, joins dereplication results with the binary table, recovers hit diversity according to user parameters, filters the original session file, and generates output files. Cleans up the temporary directory upon completion.

Returns:: None

remote_region_run

class cagecleaner.remote_region_run.RemoteRegionRun(parsed_args)[source]

Bases: RemoteRun, RegionRun

Subclass orchestrating the workflow for dereplication by region sequence similary using regions downloaded from NCBI.

This class combines remote region downloading functionality with local region dereplication. It downloads genomic regions (with optional sequence margins) from NCBI based on scaffold IDs and coordinates in a binary table, performs MMseqs2-based sequence dereplication on the downloaded regions, and integrates dereplication results back into the binary table. Handles contig edge cases where regions with sequence margins extend beyond the scaffold boundaries according to user-specified behavior (keep but clip them (permissive), or discard them (strict)).

It links NCBI assembly accession IDs to the scaffold IDs in the binary table, downloads the corresponding genome assemblies, maps the scaffold IDs back to the downloaded assemblies, performs whole-genome dereplication using skDER, and integrates dereplication results back into the binary table. Unmapped scaffolds (scaffolds for which no assemby file was found) are tracked and reported separately.

Inherits from:: RemoteRun: Intermediary class providing common downloading configurations. RegionRun: Intermediary class providing region dereplication utilities

fetch_regions() → None[source]

Fetch genomic regions from NCBI based on scaffold coordinates with optional sequence margins.

Extracts region coordinates from the binary table and applies user-specified margins (upstream and downstream sequence extensions). Handles regions that extend beyond contig boundaries by either discarding them (strict mode) or clipping them (permissive mode). Fetches contig lengths from NCBI using Entrez utilities. Logs statistics about discarded or clipped regions. Downloads the regions.

Mutates:: self.binary_df (pd.DataFrame): Adds ‘Contig_length’ column from NCBI data. self.DEREP_IN_DIR: Populates folder with downloaded gzipped genomic region FASTA files.

Returns:: None

join_dereplication_with_binary() → None[source]

Join MMseqs2 dereplication clustering results with the binary table.

Reads the dereplication clustering output file and parses region coordinates. Assigns dereplication status (‘dereplication_representative’ or ‘redundant’) based on whether each region’s accession matches its representative. Handles four possible cases of region boundary mismatches due to region boundary adjustments during preprocessing:

No contig edges: Exact coordinate match after removing margins.
Upstream edge: Match on scaffold and end coordinate (contig was clipped at the upstream edge).
Downstream edge: Match on scaffold and start coordinate (contig was clipped at the downstream edge).
Both edges: Interval containment where the original cluster region is within the dereplicated region (contig was clipped at both edges).

Concatenates results from all cases and removes temporary helper columns. Sorts the final table by representative ID and dereplication status.

Mutates:

self.binary_df (pd.DataFrame): Joined with dereplication results; adds ‘representative’: and ‘dereplication_status’ columns; removes ‘Contig_length’ and ‘Region’ columns.

Returns:

None

Raises:

FileNotFoundError – If the dereplication table cannot be read
RuntimeError – If the dereplication table is empty, or has empty values due to an invalid filename formatting.
RuntimeError – If the binary table is empty after joining with the dereplication table

Notes

The Region temporary column in self.binary_df was added by joining with the MMseqs2 dereplication table. The Contig_length temporary column in self.binary_df was added by fetch_regions when the sequence margins were added on-the-fly when fetching the regions. This is the workflow-specific implementation of the abstract method inherited from its grandparent class Run.

run()[source]

Execute the complete remote region dereplication pipeline.

Orchestrates all processing steps in sequence: fetches genomic regions from NCBI with optional sequence margins and contig boundary handling, performs MMseqs2-based dereplication on the downloaded regions, joins the dereplication results with the binary table, recovers hit diversity according to user parameters, filters the original session file, and generates output files. Cleans up the temporary directory upon completion.

Returns:: None

remote_run

class cagecleaner.remote_run.RemoteRun(parsed_args)[source]

Bases: Run

Abstract intermediary class grouping the methods shared by every run involving remote sequence files.

Inherits from:

Run: Base class providing argument parsing, hit recovery, session filtering and output generation functionalities

See Also:: RemoteRegionRun: Region-based dereplication for hits in remote sequences. RemoteGenomeRun: Whole-genome dereplication for hits in remote sequences.

abstractmethod join_dereplication_with_binary()[source]

Join dereplication results with the binary table.

Mutates:

self.binary_df: Adds ‘representative’ and ‘dereplication_status’ columns.

Expected Result:: self.binary_df should now have columns: - representative: Genome ID of the dereplication representative - dereplication_status: ‘dereplication_representative’ | ‘redundant’
Returns:: None

Notes

This is the abstract method inherited from the Run parent class. It is not meant to be implemented at this level. Only the child classes inheriting this method are expected to provide a workflow-dependent specific implementation.

communication

cagecleaner.communication.download_regions(regions: DataFrame, directory: Path, download_workers: int, no_progress: bool = False) → None[source]

Download multiple genomic sequence regions in parallel from NCBI.

Accepts a DataFrame of genomic regions and uses multi-threaded downloads to retrieve each region from NCBI. Automatically limits concurrent downloads to a maximum of 2.

Parameters:

regions (pd.DataFrame) – A DataFrame with at minimum the following columns: - ‘Scaffold’: NCBI nucleotide accession identifiers - ‘Start’: Start position of the region (integer or convertible to int) - ‘End’: End position of the region (integer or convertible to int)
directory (Path) – Directory where downloaded files will be saved.
download_workers (int) – Number of concurrent download threads requested. If greater than 2, will be automatically reduced to 2.
no_progress (bool, optional) – If True, suppress progress bar display. Defaults to False.

Raises:

ValueError – If the provided regions dataframe is empty.
IOError – If the download directory does not exist.

Returns:

None

cagecleaner.communication.fetch_contig_lengths(contig_ids: list, max_attempts=3)[source]

Retrieve sequence lengths for given NCBI contig accessions using BioPython’s Entrez API.

Fetches summary information for a list of nucleotide contig IDs from NCBI using Efetch and extracts their sequence lengths. Returns results as a deduplicated DataFrame.

Parameters:

contig_ids (list) – A list of NCBI nucleotide accession identifiers (contig IDs) for which to retrieve sequence lengths.
max_attempts (int, optional) – Maximum number of retry attempts on network/server errors. Defaults to 3.

Raises:

RuntimeError – If the fetch fails the maximum number of times, or if it returns an empty response

Returns:

A DataFrame with columns:

’Scaffold’: NCBI accession version identifiers
’Contig_length’: Sequence length in base pairs (integer)

Return type:

pd.DataFrame

cagecleaner.communication.get_assembly_accessions(scaffolds: list, source: str, no_progress: bool = False, max_attempts: int = 3, batch_size: int = 100) → list[source]

Retrieve Assembly accession IDs from NCBI given a list of Nucleotide scaffold IDs.

Converts Nucleotide accession IDs to their associated Assembly IDs using BioPython’s Entrez elink and esummary services. Automatically redirects WGS (Whole Genome Shotgun) records to their master records when requesting Genbank IDs.

Parameters:

scaffolds (list) – A list of NCBI Nucleotide accession identifiers (e.g., GenBank or RefSeq accessions). WGS Genbank records (pattern: XXXX12345678.1) are automatically converted to their master records.
source (str) – The database to extract IDs from. Only possible values include ‘Genbank’ or ‘RefSeq’. This determines which synonym field is extracted from the Assembly summary.
no_progress (bool, optional) – If True, suppress progress bar display. Defaults to False.
max_attempts (int) – Number of retries in case of a failure. Defaults to 3.
batch_size (int) – Number of accession IDs to link in one batch. Defaults to 100.

Returns:

A list of Assembly accession IDs associated with the input scaffolds. Returns: an empty list if the retrieval fails after all retry attempts.

Return type:

list

Raises:

ValueError – If an invalid NCBI Nucleotide database name was supplied, or if the supplied scaffold ID list is empty.
RuntimeError – If the NCBI Assembly IDs could not be fetched from NCBI.

Notes

Processes scaffolds in batches of 100.
Retries up to 3 times on network/server errors.

Raises:

ValueError – If the specified source is not a valid NCBI Nucleotide database.
ValueErorr – If the scaffold ID list is empty.
RuntimeError – If the assembly IDs could not be fetched from NCBI.

file_utils

cagecleaner.file_utils.convert_genbanks_to_fastas(in_dir: Path, out_dir: Path, workers: int = 1, no_progress: bool = False) → None[source]

Convert all genbank files in an input directory to compressed fasta files.

All genbank files in the input directory are converted to compressed fasta files in parallel using any2fasta and gzip.

Parameters:

in_dir (Path) – input directory containing the genbank files
out_dir (Path) – output directory where the new fasta files will be written
workers (int) – number of threads for parallellisation
no_progress (bool) – flag to disable showing a progress bar while converting

Returns:

None

Raises:

RuntimeError – If the input directory is empty.

cagecleaner.file_utils.is_fasta(file: str | Path) → bool[source]

Check whether the file ends in any of the accepted fasta suffices. Gzipped allowed.

Parameters:: file (str | Path) – filename to check
Returns:: Boolean result of the check
Return type:: bool

cagecleaner.file_utils.is_genbank(file: str | Path) → bool[source]

Check whether the file ends in any of the accepted genbank suffices. Gzipped allowed.

Parameters:: file (str | Path) – filename to check
Returns:: Boolean result of the check
Return type:: bool

cagecleaner.file_utils.read_genome(file: str | Path)[source]

Open the appropriate file handle for a genome file.

Automatically distinguishes between compressed and uncompressed files based on the file extension.

Parameters:: file (str | Path) – genome file to open
Returns:: A file handle to open the genome file
Return type:: handle

cagecleaner.file_utils.remove_suffixes(string: str | Path) → str[source]

Split off any valid sequence file suffix either in the form of .<suffix> or .<suffix>.gz.

Parameters:: string (str | Path) – string from which the suffix should be split off, if any.
Returns:: the string without the sequence file suffix
Return type:: str

utils

cagecleaner.utils.correct_layouts(binary_df: DataFrame) → DataFrame[source]

Correct cluster layouts on strands complementary to ones with existing layouts.

Identifies and resolves equivalent cases where a certain cluster layout is represented on both strands in a set of genomes. Uses a directed graph to detect flipped layouts and corrects them accordingly.

Example

Assembly 1 has a cluster with a gene layout ABC on the positive strand (strand location layout +++). Assembly 2 has a cluster with a gene layout CBA on the negative strand (strand location layout —).

Since there is no way to recognise a strand as the positive or negative one, assembly 1’s positive strand is likely homologous to assembly 2’s negative strand.

Hence, the layout of the cluster of assembly 2 is flipped to ABC (strand location layout +++).

Correction strategy:

Make a directed network and remove backloops. Nodes: pairs of cluster layout and strand location layout strings Edges: One node being the complementary of an existing layout in the dataset

Example:

Reconsidering the previous example above, we will have two nodes, (ABC,+++) and (CBA,—), and two edges (ABC,+++) -> (CBA,—), and (CBA,—) -> (ABC,+++).

These two edges contain a backloop, so they are pruned to just one edge. For example, (ABC,+++) -> (CBA,—).

The remaining edges in the network will be used to correct the layouts, so in this case layouts (ABC,+++) will be corrected to (CBA,—). Or, all gene layouts ABC who are on the positive strand are equivalent to gene layouts CBA on the negative strand.

Parameters:

binary_df (pd.DataFrame) – A DataFrame containing ‘Strand’ and ‘Layout_group’ columns where Strand is a tuple of integers and Layout_group is a tuple of layout identifiers.

Returns:

The input DataFrame with corrected ‘Layout_group’ column where equivalent: layouts on complementary strands have been flipped.

Return type:

pd.DataFrame

Notes

Modifies the ‘Layout_group’ column in-place within the returned DataFrame.
Uses NetworkX to detect bidirectional edges (back-loops) between layout pairs.

cagecleaner.utils.generate_cblaster_session(hits: Path, clusters: Path, queries: Path, mode: str) → Session[source]

Generate a cblaster Session object from TSV tabular data files.

Reads cluster hit data from a folder containing hits, clusters, and queries TSV files and reconstructs a hierarchical cblaster Session object. The function parses tabular data into the nested structure required by cblaster (organisms > scaffolds > clusters > subjects).

Parameters:

hits (Path) – Path to the hits TSV file, containing individual hit records with columns db_id, query, scaff, strand, coords, evalue, score, seqid, tcov
clusters (Path) – Path to the clusters TSV file, containing cluster records with columns number, hits, start, end, length, score, scaff, taxon_name, taxon_id
queries (Path) – Path to the queries TSV file, containing query sequence records with at least id, start, end columns
mode (str) – The search mode to be set in the session parameters (e.g., ‘remote’, ‘local’).

Returns:

A cblaster Session object populated with organisms, scaffolds, clusters,: and subjects (hits) reconstructed from the input TSV files.

Return type:

Session

Notes

Query subjects are artificially positioned with a 500 amino acid margin between them, as in the original cblaster implementation.
Hit coordinates are validated to ensure they fall within the cluster’s defined interval.
The session includes placeholder fields (e.g., sequence data set to None/empty strings) that are not essential for the plotting and file export functionalities.
Organisms are indexed by (taxon_id, taxon_name) pairs internally but stored as a flat list in the final session object.

cagecleaner.utils.run_command(cmd_list: list, max_attempts: int = 3) → None[source]

Execute an externally composed command with automatic retry logic and streaming output logging.

Runs a subprocess command with up to max_attempts retries. Captures both stdout and stderr in separate daemon threads, logging output in real-time. Logs debug information for stdout, warnings for stderr, and errors if the command fails.

Parameters:

cmd_list (list) – A list where the first element is the executable name (resolved via shutil.which) and remaining elements are command arguments.
max_attempts (int, optional) – Maximum number of times to retry the command if it fails (non-zero exit code). Defaults to 3.

Returns:

None

Raises:

RuntimeError – If the supplied command fails the maxinum number of times.

Notes

The first element of cmd_list must be an executable available in the system PATH.
Retries only occur on non-zero exit codes.
Output is logged using the module’s LOG logger.
Daemon threads ensure subprocess output is captured and logged in real-time.