autometa.taxonomy package

Submodules

autometa.taxonomy.lca module

This script contains the LCA class containing methods to determine the Lowest Common Ancestor given a tab-delimited BLAST table, fasta file, or iterable of SeqRecords.

Note: LCA will assume the BLAST results table is in output format 6.

class autometa.taxonomy.lca.LCA(dbdir: str, verbose: bool = False)

Bases: autometa.taxonomy.ncbi.NCBI

LCA class containing methods to retrieve the Lowest Common Ancestor.

LCAs may be computed given taxids, a fasta or BLAST results.

Parameters:
  • dbdir (str) – </path/to/ncbi/databases/directory>
  • outdir (str) – </path/to/output/directory>
  • usepickle (bool, optional) – Whether to serialize intermediate files to disk for later lookup (the default is True).
  • verbose (bool, optional) – Add verbosity to logging stream (the default is False).
disable

Opposite of verbose. Used to disable tqdm module.

Type:bool
tour_fp

</path/to/serialized/file/eulerian/tour.pkl.gz>

Type:str
tour

Eulerian tour containing branches and leaves information from tree traversal.

Type:list
level_fp

</path/to/serialized/file/level.pkl.gz>

Type:str
level

Lengths from root corresponding to tour during tree traversal.

Type:list
occurrence_fp

</path/to/serialized/file/level.pkl.gz>

Type:str
occurrence

Contains first occurrence of each taxid while traversing tree (index in tour). e.g. {taxid:index, taxid: index, …}

Type:dict
sparse_fp

</path/to/serialized/file/sparse.pkl.gz>

Type:str
sparse

Precomputed LCA values corresponding to tour,`level` and occurrence.

Type:numpy.ndarray
lca_prepared

Whether LCA internals have been computed (e.g. tour,`level`,`occurrence`,`sparse`).

Type:bool
blast2lca(blast: str, out: str, force: bool = False) → str

Determine lowest common ancestor of provided amino-acid ORFs.

Parameters:
  • blast (str) – </path/to/diamond/outfmt6/blastp.tsv>.
  • out (str) – </path/to/output/lca.tsv>.
  • force (bool, optional) – Force overwrite of existing out.
Returns:

out </path/to/output/lca.tsv>.

Return type:

str

convert_sseqids_to_taxids(sseqids: Dict[str, str]) → Dict[str, int]

Translates subject sequence ids to taxids from prot.accession2taxid.gz.

Note

If an accession number is no longer available in prot.accesssion2taxid.gz (either due to being suppressed, deprecated or removed by NCBI), then root taxid (1) is returned as the taxid for the corresponding sseqid.

Parameters:sseqids (dict) – {qseqid: {sseqid, …}, …}
Returns:{qseqid: {taxid, taxid, …}, …}
Return type:dict
Raises:FileNotFoundError – prot.accession2taxid.gz database is required for sseqid to taxid conversion.
lca(node1, node2)

Performs Range Minimum Query between 2 taxids.

Parameters:
  • node1 (int) – taxid
  • node2 (int) – taxid
Returns:

LCA taxid

Return type:

int

Raises:

ValueError – Provided taxid is not in the nodes.dmp tree.

parse(lca_fpath: str, orfs_fpath: str = None) → Dict[str, Dict[str, Dict[int, int]]]

Retrieve and construct contig dictionary from provided lca_fpath.

Parameters:
  • lca_fpath (str) – </path/to/lcas.tsv> tab-delimited ordered columns: qseqid, name, rank, lca_taxid
  • orfs_fpath (str, optional (required if using prodigal version <2.6)) – </path/to/prodigal/called/orfs.fasta> Note: These ORFs should correspond to the ORFs provided in the BLAST table.
Returns:

{contig:{rank:{taxid:counts, …}, rank:{…}, …}, …}

Return type:

dict

Raises:
  • FileNotFoundErrorlca_fpath does not exist.
  • FileNotFoundErrororfs_fpath does not exist.
  • ValueError – If prodigal version is under 2.6, orfs_fpath is a required input.
prepare_lca()

Prepare LCA internal data structures for lca().

e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.

Returns:Prepares all LCA internals and if successful sets self.lca_prepared to True.
Return type:NoneType
prepare_tree()

Performs Eulerian tour of nodes.dmp taxids and constructs three data structures:

  1. tour : list of branches and leaves.
  2. level: list of distances from the root.
  3. occurrence: dict of occurences of the taxid respective to the root.

Notes

For more information on why we construct these three data structures see references below:

Returns:sets internals to be used for LCA lookup
Return type:NoneType
preprocess_minimums()

Preprocesses all possible LCAs.

This constructs a sparse table to be used for LCA/Range Minimum Query using the self.level array associated with its respective eulerian self.tour. For more information on these data structures see prepare_tree().

Sparse table size:
n = number of elements in level list rows range = (0 to n) columns range = (0 to logn)
Returns:sets self.sparse internal to be used for LCA lookup.
Return type:NoneType
reduce_taxids_to_lcas(taxids: Dict[str, int]) → Dict[str, int]

Retrieves the lowest common ancestor for each set of taxids in of the taxids

Parameters:taxids (dict) – {qseqid: {taxid, …}, qseqid: {taxid, …}, …}
Returns:{qseqid: lca, qseqid: lca, …}
Return type:dict
write_lcas(lcas: Dict[str, int], out: str) → str

Write lcas to tab-delimited file: out.

Ordered columns are:

  • qseqid : query seqid
  • name : LCA name
  • rank : LCA rank
  • lca : LCA taxid
Parameters:
  • lcas (dict) – {qseqid:lca_taxid, qseqid:lca_taxid, …}
  • out (str) – </path/to/output/file.tsv>
Returns:

out

Return type:

str

autometa.taxonomy.lca.main()

autometa.taxonomy.majority_vote module

This script contains the modified majority vote algorithm used in Autometa version 1.0

autometa.taxonomy.majority_vote.is_consistent_with_other_orfs(taxid: int, rank: str, rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → bool

Determines whether the majority of proteins in a contig, with rank equal to or above the given rank, are common ancestors of the taxid.

If the majority are, this function returns True, otherwise it returns False.

Parameters:
  • taxid (int) – taxid to search against other taxids at rank in rank_counts.
  • rank (str) – Canonical rank to search in rank_counts. Choices: species, genus, family, order, class, phylum, superkingdom.
  • rank_counts (dict) – LCA canonical rank counts retrieved from ORFs respective to a contig. e.g. {canonical_rank: {taxid: num_hits, …}, …}
  • ncbi (NCBI instance) – Instance or subclass of NCBI from autometa.taxonomy.ncbi.
Returns:

If the majority of ORFs in a contig are equal or above given rank then return True, otherwise return False.

Return type:

boolean

autometa.taxonomy.majority_vote.lowest_majority(rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → int

Determine the lowest majority given rank_counts by first attempting to get a taxid that leads in counts with the highest specificity in terms of canonical rank.

Parameters:
  • rank_counts (dict) – {canonical_rank:{taxid:num_hits, …}, rank2: {…}, …}
  • ncbi (NCBI instance) – NCBI object from autometa.taxonomy.ncbi
Returns:

Taxid above the lowest majority threshold.

Return type:

int

autometa.taxonomy.majority_vote.main()
autometa.taxonomy.majority_vote.majority_vote(lca_fpath: str, out: str, ncbi_dir: str, verbose: bool = False, orfs: str = None, force: bool = False) → str

Wrapper for modified majority voting algorithm from Autometa 1.0

Parameters:
  • lca_fpath (str) – Path to lowest common ancestor assignments table.
  • out (str) – Path to write assigned taxids.
  • ncbi_dir (str) – Path to NCBI databases directory.
  • verbose (bool, optional) – Increase verbosity of logging stream
  • orfs (str, optional) – Path to prodigal called orfs corresponding to LCA table computed from BLAST output
  • force (bool, optional) – Whether to overwrite existing LCA results.
Returns:

Path to assigned taxids table.

Return type:

str

autometa.taxonomy.majority_vote.rank_taxids(ctg_lcas: dict, ncbi: Union[autometa.taxonomy.ncbi.NCBI, autometa.taxonomy.lca.LCA], verbose: bool = False) → Dict[str, int]

Votes for taxids based on modified majority vote system where if a majority does not exist, the lowest majority is voted.

Parameters:
  • ctg_lcas (dict) – {ctg1:{canonical_rank:{taxid:num_hits,…},…}, ctg2:{…},…}
  • ncbi (ncbi.NCBI or lca.LCA object) – instance of NCBI subclass or NCBI containing NCBI methods.
  • verbose (bool) – Description of parameter verbose (the default is False).
Returns:

{contig:voted_taxid, contig:voted_taxid, …}

Return type:

dict

autometa.taxonomy.majority_vote.write_votes(results: Dict[str, int], out: str) → str

Writes voting results to provided outfpath.

Parameters:
  • results (dict) – {contig:voted_taxid, contig:voted_taxid, …}
  • out (str) – </path/to/results.tsv>.
Returns:

</path/to/results.tsv>

Return type:

str

Raises:

FileExistsError – Voting results file already exists

autometa.taxonomy.ncbi module

File containing definition of the NCBI class and containing functions useful for handling NCBI taxonomy databases

class autometa.taxonomy.ncbi.NCBI(dirpath, verbose=False)

Bases: object

Taxonomy utilities for NCBI databases.

CANONICAL_RANKS = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'root']
__repr__()

Operator overloading to return the string representation of the class object

Returns:String representation of the class object
Return type:str
__str__()

Operator overloading to return the directory path of the class object

Returns:Directory path of the class object
Return type:str
convert_taxid_dtype(taxid: int) → int
  1. Converts the given taxid to an integer and checks whether it is positive.

2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid.

Parameters:

taxid (int) – identifer for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp

Returns:

taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp

Return type:

int

Raises:
  • ValueError – Provided taxid is not a positive integer
  • DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
get_lineage_dataframe(taxids: Iterable[T_co], fillna: bool = True) → pandas.core.frame.DataFrame

Given an iterable of taxids generate a pandas DataFrame of their canonical lineages

Parameters:
  • taxids (iterable) – taxids whose lineage dataframe is being returned
  • fillna (bool, optional) – Whether to fill the empty cells with ‘unclassified’ or not, default True
Returns:

index = taxid columns = [superkingdom,phylum,class,order,family,genus,species]

Return type:

pd.DataFrame

Example

If you would like to merge the returned DataFrame (‘this_df’) with another DataFrame (‘your_df’). Let’s say where you retrieved your taxids:

merged_df = pd.merge(
    left=your_df,
    right=this_df,
    how='left',
    left_on=<taxid_column>,
    right_index=True)
is_common_ancestor(taxid_A: int, taxid_B: int) → bool

Determines whether the provided taxids have a non-root common ancestor

Parameters:
  • taxid_A (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
  • taxid_B (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
Returns:

True if taxids share a common ancestor else False

Return type:

boolean

lineage(taxid: int, canonical: bool = True) → List[Dict[KT, VT]]

Returns the lineage of taxids encountered when traversing to root

Parameters:
  • taxid (int) – taxid in nodes.dmp, whose lineage is being returned
  • canonical (bool, optional) – Lineage includes both canonical and non-canonical ranks when False, and only the canonical ranks when True Canonical ranks include : species, genus , family, order, class, phylum, superkingdom, root
Returns:

[{‘taxid’:taxid, ‘rank’:rank,’name’:name}, …]

Return type:

ordered list of dicts

name(taxid: int, rank: str = None) → str

Parses through the names.dmp in search of the given taxid and returns its name. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated name will be retrieved

Parameters:
  • taxid (int) – taxid whose name is being returned
  • rank (str, optional) – If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, ‘genus’) would return ‘Escherichia’, where 562 is the taxid for Escherichia coli
Returns:

Name of provided taxid if taxid is found in names.dmp else ‘unclassified’

Return type:

str

Raises:

DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

parent(taxid: int) → int

Retrieve the parent taxid of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated parent will be retrieved

Parameters:taxid (int) – child taxid to retrieve parent
Returns:Parent taxid if found in nodes.dmp otherwise 1
Return type:int
Raises:DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
parse_merged() → Dict[int, int]

Parse the merged.dmp database Note: This is performed when a new NCBI class instance is constructed

Returns:{old_taxid: new_taxid, …}
Return type:dict
parse_names() → Dict[int, str]

Parses through names.dmp database and loads taxids with scientific names

Returns:{taxid:name, …}
Return type:dict
parse_nodes() → Dict[int, str]

Parse the nodes.dmp database to be used later by autometa.taxonomy.ncbi.NCBI.parent(), autometa.taxonomy.ncbi.NCBI.rank() Note: This is performed when a new NCBI class instance is constructed

Returns:{child_taxid:{‘parent’:parent_taxid,’rank’:rank}, …}
Return type:dict
rank(taxid: int) → str

Return the respective rank of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated rank will be retrieved

Parameters:taxid (int) – taxid to retrieve rank from nodes.dmp
Returns:rank name if taxid is found in nodes.dmp else “unclassified”
Return type:str
Raises:DatabaseOutOfSyncError – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

autometa.taxonomy.vote module

Script to split metagenome assembly by kingdoms given the input votes. The lineages of the provided voted taxids will also be added and written to taxonomy.tsv

autometa.taxonomy.vote.add_ranks(df: pandas.core.frame.DataFrame, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str]) → pandas.core.frame.DataFrame

Add canonical ranks to df and write to out

Parameters:
  • df (pd.DataFrame) – index=”contig”, column=”taxid”
  • ncbi (str or NCBI) – Path to NCBI databases directory, or autometa NCBI instance.
Returns:

index=”contig”, columns=[“taxid”, *canonical_ranks]

Return type:

pd.DataFrame

autometa.taxonomy.vote.assign(out: str, method: str = 'majority_vote', assembly: str = None, prot_orfs: str = None, nucl_orfs: str = None, blast: str = None, lca_fpath: str = None, ncbi_dir: str = './autometa/databases/ncbi', force: bool = False, verbose: bool = False, parallel: bool = True, cpus: int = 0) → pandas.core.frame.DataFrame

Assign taxonomy using method and write to out.

Parameters:
  • out (str) – Path to write taxonomy table of votes
  • method (str, optional) – Method to assign contig taxonomy, by default “majority_vote”. choices include “majority_vote”, …
  • assembly (str, optional) – Path to assembly fasta file (nucleotide), by default None
  • prot_orfs (str, optional) – Path to amino-acid ORFs called from assembly, by default None
  • nucl_orfs (str, optional) – Path to nucleotide ORFs called from assembly, by default None
  • blast (str, optional) – Path to blastp table, by default None
  • lca_fpath (str, optional) – Path to output of LCA analysis, by default None
  • ncbi_dir (str, optional) – Path to NCBI databases directory, by default NCBI_DIR
  • force (bool, optional) – Overwrite existing annotations, by default False
  • verbose (bool, optional) – Increase verbosity, by default False
  • parallel (bool, optional) – Whether to perform annotations using multiprocessing and GNU parallel, by default True
  • cpus (int, optional) – Number of cpus to use if parallel is True, by default will try to use all available.
Returns:

index=”contig”, columns=[“taxid”]

Return type:

pd.DataFrame

Raises:
  • NotImplementedError – Provided method has not yet been implemented.
  • ValueError – Assembly file is required if no other annotations are provided.
autometa.taxonomy.vote.get(filepath_or_dataframe: Union[str, pandas.core.frame.DataFrame], kingdom: str, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str] = './autometa/databases/ncbi') → pandas.core.frame.DataFrame

Retrieve specific kingdom voted taxa for assembly from filepath

Parameters:
  • filepath (str) – Path to tab-delimited taxonomy table. cols=[‘contig’,’taxid’, *canonical_ranks]
  • kingdom (str) – rank to retrieve from superkingdom column in taxonomy table.
  • ncbi (str or autometa.taxonomy.NCBI instance, optional) – Path to NCBI database directory or NCBI instance, by default NCBI_DIR. This is necessary only if filepath does not already contain columns of canonical ranks.
Returns:

DataFrame of contigs pertaining to retrieved kingdom.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – Provided filepath does not exists or is empty.
  • TableFormatError – Provided filepath does not contain the ‘superkingdom’ column.
  • KeyErrorkingdom is absent in provided taxonomy table.
autometa.taxonomy.vote.main()
autometa.taxonomy.vote.write_ranks(taxonomy: pandas.core.frame.DataFrame, assembly: str, outdir: str, rank: str = 'superkingdom', prefix: str = None) → List[str]

Write fastas split by rank

Parameters:
  • taxonomy (pd.DataFrame) – dataframe containing canonical ranks of contigs assigned from :func:autometa.taxonomy.vote.assign(…)
  • assembly (str) – Path to assembly fasta file
  • outdir (str) – Path to output directory to write fasta files
  • rank (str, optional) – canonical rank column in taxonomy table to split by, by default “superkingdom”
  • prefix (str, optional) – Prefix each of the paths written with prefix string.
Returns:

[rank_name_fpath, …]

Return type:

list

Raises:

ValueErrorrank not in canonical ranks

Module contents