Usage

Normally, you don’t have to choose between remote and local modes. CAGEcleaner automatically deduces this from the session file.

Remote mode

The remote mode does not require other file inputs than your session file. A dereplication run using default settings can be started as simply as:

cagecleaner -s session.json

A slightly more complex example to run CAGEcleaner at 14 cores in full-genome dereplication mode with an identity threshold of 90 % and a coverage threshold of 50 %:

cagecleaner -s session.json --cores 14 -i 90 -c 50

A more complicated example to run CAGEcleaner at 14 cores in region dereplication mode with both thresholds at 95 %, adding a sequence margin of 5 kb to both sides of each cluster, and discarding clusters that are at a contig edge.

cagecleaner -s session.json --cores 14 -i 90 -c 90 --method regions -m 5000 --strict

A complex example that runs CAGEcleaner at 14 cores in region dereplication mode with the identity threshold at 95 %, the coverage threshold at 80 %, a sequence margin of size 10 kb, saving the results in a subfolder results, keeping all intermediate files, excluding a scaffold with a duplicate ID from the analysis, bypassing another one with a unique label, and disabling the hit recovery by outlier homology scores.

cagecleaner -s session.json --cores 14 -i 95 -c 80 --method regions -m 10000 -o results --keep_intermediate --exs assembly1:ctg1 --bys ctg131 --no_recovery_score

Local mode

To get a local mode session, you probably have generated a local cblaster database from a folder of fasta or genbank files. CAGEcleaner needs the path to that folder to be passed on via the -g flag. For each one of the examples in section above, you can just add the -g flag and that path to run CAGEcleaner from a local search session.

A dereplication run using default settings with genome folder genomes can be started as simply as:

cagecleaner -s session.json -g genomes

A slightly more complex example to run CAGEcleaner at 14 cores in full-genome dereplication mode with an identity threshold of 90 % and a coverage threshold of 50 %.

cagecleaner -s session.json -g genomes --cores 14 -i 90 -c 50

A more complicated example to run CAGEcleaner at 14 cores in region dereplication mode with both thresholds at 95 %, adding a sequence margin of 5 kb to both sides of each cluster, discarding clusters that are at a contig edge.

cagecleaner -s session.json -g genomes --cores 14 -i 95 -c 95 --method regions -m 5000 --strict

A complex example that runs CAGEcleaner at 14 cores in region dereplication mode with the identity threshold at 95 %, the coverage threshold at 80 %, a sequence margin of size 10 kb, saving the results in a subfolder results, keeping all intermediate files, excluding a scaffold with a duplicate ID from the analysis, bypassing another one with a unique label, and disabling the hit recovery by outlier homology scores.

cagecleaner -s session.json -g genomes --cores 14 -i 95 -c 80 --method regions -m 10000 -o results --keep_intermediate --exs assembly1:ctg1 --bys ctg131 --no_recovery_score

Output files

This tool produces at least six output files:

filtered_session.json a filtered session file
filtered_binary.txt a cblaster-like binary presence/absence table, containing only the retained hits
filtered_summary.txt a cblaster-like summary file, containing only the retained hits
extended_binary.txt a cblaster-like binary table extended with the following columns
- Number Number of the cluster in the original session
- Strand Tuple representing on which strand each query gene homolog was found.
- Layout_group Tuple representing the synteny layout
- representative the representative genome or region
- dereplication_status status of this cluster (Was it removed/kept? Why?)
- assembly_file fasta file of the hosting genome or region
retained_cluster_numbers.txt the cluster numbers of each selected hit
genome_cluster_sizes.txt a tab-separated text file with the number of genomes in each dereplication genome cluster

In genome dereplication mode, the file unmapped.scaffolds.txt may be generated. This file contains scaffold IDs for which no associated genome file was found. There are several reasons this might happen: ID mismatch between the ID in the session and the ID in the downloaded fasta file, a genome/region failed downloading…

Optionally, the downloaded sequence files and the dereplication output files generated by skDER or MMseqs2 can be saved in an additional subfolder downloads or dereplication, resp. (see also flags keep_downloads, --keep_dereplication, and --keep_intermediate).

In extended_binary.txt, there are four possible dereplication statuses.

dereplication_representative	Hit part of the sequence selected as sequence cluster representative.
readded_by_content	Kept due to a different group layout than the dereplication representative.
readded_by_score	Kept due to an outlier homology score.
redundant	Not retained and therefore removed.

Input from TSV files

CAGEcleaner can process outputs from other gene mining tools than cblaster, although you need to wrangle your non-cblaster output into three TSV files with specific formatting (see also the example files). Using our provided helper tool cagecleaner-generate-session, generate then a new cblaster session file, which you can use as input for CAGEcleaner.

The formatting of these three TSV files is described below.

*Don’t forget the header line in each file!*

hits.tsv

Column	Description
db_id	ID of the homolog (e.g. an NCBI Protein ID, an arbitrary gene label from a Bakta annotation)
query	ID of the matching query
scaff	ID of the scaffold that harbours this homolog (e.g. an NCBI Nucleotide ID, a contig label of your own local genome assembly)
strand	strand location of the homolog
coords	start and end coordinates of the homolog on the scaffold, separated by two dots. For multiple exons, use the following formatting `join(start1..end1,start2..end2)`.
evalue	a probabilistic value for the homolog
score	a score for the homolog
seqid	sequence identity of the homolog with its matching query
tcov	target coverage of the homolog with its matching query

clusters.tsv

Column	Description
number	an arbitrary but unique number to identify each cluster
hits	a comma-separated list of hit IDs (e.g. an NCBI Protein ID, an arbitrary gene label from a Bakta annotation)
start	starting coordinate of the cluster on the scaffold
end	ending coordinate of the cluster on the scaffold
length	length of the cluster in bp
score	a score for the cluster (e.g. the cblaster score)
scaff	an ID of the scaffold that harbours this cluster (e.g. an NCBI Nucleotide ID, a contig label of your own local genome assembly)
strand	strand location of the cluster
taxon_name	name of the organism (can be a human-readable name, or a NCBI Assembly ID)
taxon_id	a taxon ID (e.g. the NCBI taxon ID, an arbitrary ID)

queries.tsv

Column	Description
id	a label for the query
start	a starting coordinate for the query (can be arbitrary)
end	an ending coordinate for the query (can be arbitrary)

Hit recovery

The sequence diversity in genomes or genomic neighbourhoods does not always correlate well with the diversity in the gene clusters themselves, resulting some interesting clusters being discarded unnecessarily when not taking precautions. CAGEcleaner automatically recovers cluster diversity using the two strategies outlined below.

Cluster content

This strategy assesses the number of homologs that were found for each query gene.

For example, suppose you found two groups of hits. Group 1 has one gene A and one gene B, and contains the hit that was retained due to the dereplication. On the other hand, group 2 has two genes A and one gene B. If you would stick to the hit reduction after the dereplication, group 2 would not be represented anymore.

In each content group not containing the dereplication representative hit, the hit to be retained is chosen randomly.

Note

This recovery method may also recover redundant cluster hits of which a query homolog was missed by your mining tool, for example because of a too high e-value.

Homology score

This strategy assesses the homology score of each hit.

For example, suppose you found a group of gene clusters with similar gene contents. However, there is a cluster X that has evolved quite rapidly, or that has swapped one of its genes for a paralog. This shift translates in a more remote homology and thus a lower cblaster score. This diverging score is identified using outlier statistics and the hit is retained. If you would stick to the hit reduction after the dereplication, cluster X would have been lost.

Note

When recovery by cluster content is enabled, outliers are determined per content layout group.

Note

By default, outliers scores are required to be sufficiently different from the mean score, implemented using an absolute score threshold. Without it, you may find an outlier with a score of 3.17 in a group of hundreds of hits with a score of 3.16.

Example hit recovery

Let’s consider a (simplified) extended binary table output table where CAGEcleaner has added the representative, and the dereplication status for each row:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org1	scaff1	3.17	1	1	1	org1	dereplication_representative
org2	scaff2	3.16	1	1	1	org1	redundant
org3	scaff3	3.17	1	0	1	org1	redundant
org4	scaff4	3.17	1	1	1	org4	dereplication_representative
org5	scaff5	2.92	1	0	1	org4	redundant
org6	scaff6	3.12	1	0	1	org4	redundant
org7	scaff7	3.16	1	1	1	org4	redundant
org8	scaff8	3.14	1	0	1	org4	redundant

The first step is to group the rows based on their corresponding representatives. In this case, there are two groups; org1 and org4. This will result in the following two tables:

REPRESENTATIVE = org1:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org1	scaff1	3.17	1	1	1	org1	dereplication_representative
org2	scaff2	3.16	1	1	1	org1	redundant
org3	scaff3	3.17	1	0	1	org1	redundant

REPRESENTATIVE = org4:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org4	scaff4	3.17	1	1	1	org4	dereplication_representative
org5	scaff5	2.92	1	0	1	org4	redundant
org6	scaff6	3.12	1	0	1	org4	redundant
org7	scaff7	3.16	1	1	1	org4	redundant
org8	scaff8	3.14	1	0	1	org4	redundant

Now notice that the gene cluster composition varies within the groupings, something that might also occur in a real-life example. Some hits contain all the genes from the queries, whereas others are missing gene2. In the first grouping, org3 is missing gene2. If recovery by content is enabled, CAGEcleaner will make two groups and select a random representative for each group if it doesn’t contain a dereplication representative. Hence, in this case, two hits are retained: org1 as representative for the structural group of org1 and org2, and org3 for its own structural group.

ALL GENES PRESENT:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org1	scaff1	3.17	1	1	1	org1	dereplication_representative
org2	scaff2	3.16	1	1	1	org1	redundant

GENE2 MISSING:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org3	scaff3	3.17	1	0	1	org1	redundant

Now let’s consider the second grouping with org4 as dereplication representative and divide this table by gene cluster content:

ALL GENES PRESENT:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org4	scaff4	3.17	1	1	1	org4	dereplication_representative
org7	scaff7	3.16	1	1	1	org4	redundant

GENE2 MISSING:

Organism	Scaffold	Score	Gene1	Gene3	Representative	Dereplication status
org5	scaff5	2.92	1	1	org4	redundant
org6	scaff6	3.12	1	1	org4	redundant
org8	scaff8	3.14	1	1	org4	redundant

The grouping where all genes are present contains a representative, so no recovery is required. The other grouping contains hits where gene2 is missing. Notice that org5’s cblaster score (2.92) significantly deviates from the scores in the rest of this grouping. If recovery by score is enabled, CAGEcleaner will retain this score outlier. Equivalent to the org1 case, a random non-outlier hit will be selected to represent this grouping if recovery by content is enabled (in this case org6).

To summarise, the (simplified) binary table for this example case will look like below:

Organism	Scaffold	Score	Gene1	Gene2	Gene3	Representative	Dereplication status
org1	scaff1	3.17	1	1	1	org1	dereplication_representative
org3	scaff3	3.17	1	0	1	org1	redundant
org4	scaff4	3.17	1	1	1	org4	dereplication_representative
org5	scaff5	2.92	1	0	1	org4	redundant
org6	scaff6	3.12	1	0	1	org4	redundant