autometa.taxonomy package¶

Submodules¶

autometa.taxonomy.lca module¶

This script contains the LCA class containing methods to determine the Lowest Common Ancestor given a tab-delimited BLAST table, fasta file, or iterable of SeqRecords.

Note: LCA will assume the BLAST results table is in output format 6.

class autometa.taxonomy.lca.LCA(dbdir: str, verbose: bool = False)¶

Bases: autometa.taxonomy.ncbi.NCBI

LCA class containing methods to retrieve the Lowest Common Ancestor.

LCAs may be computed given taxids, a fasta or BLAST results.

Parameters:	dbdir (str) – </path/to/ncbi/databases/directory> outdir (str) – </path/to/output/directory> usepickle (bool, optional) – Whether to serialize intermediate files to disk for later lookup (the default is True). verbose (bool, optional) – Add verbosity to logging stream (the default is False).

disable¶

Opposite of verbose. Used to disable tqdm module.

Type:	bool

tour_fp¶

</path/to/serialized/file/eulerian/tour.pkl.gz>

Type:	str

tour¶

Eulerian tour containing branches and leaves information from tree traversal.

Type:	list

level_fp¶

</path/to/serialized/file/level.pkl.gz>

Type:	str

level¶

Lengths from root corresponding to tour during tree traversal.

Type:	list

occurrence_fp¶

</path/to/serialized/file/level.pkl.gz>

Type:	str

occurrence¶

Contains first occurrence of each taxid while traversing tree (index in tour). e.g. {taxid:index, taxid: index, …}

Type:	dict

sparse_fp¶

</path/to/serialized/file/sparse.pkl.gz>

Type:	str

sparse¶

Precomputed LCA values corresponding to tour,`level` and occurrence.

Type:	numpy.ndarray

lca_prepared¶

Whether LCA internals have been computed (e.g. tour,`level`,`occurrence`,`sparse`).

Type:	bool

blast2lca(blast: str, out: str, force: bool = False) → str¶

Determine lowest common ancestor of provided amino-acid ORFs.

Parameters:	blast (str) – </path/to/diamond/outfmt6/blastp.tsv>. out (str) – </path/to/output/lca.tsv>. force (bool, optional) – Force overwrite of existing out.
Returns:	out </path/to/output/lca.tsv>.
Return type:	str

convert_sseqids_to_taxids(sseqids: Dict[str, str]) → Dict[str, int]¶

Translates subject sequence ids to taxids from prot.accession2taxid.gz.

Note

If an accession number is no longer available in prot.accesssion2taxid.gz (either due to being suppressed, deprecated or removed by NCBI), then root taxid (1) is returned as the taxid for the corresponding sseqid.

Parameters:	sseqids (dict) – {qseqid: {sseqid, …}, …}
Returns:	{qseqid: {taxid, taxid, …}, …}
Return type:	dict
Raises:	`FileNotFoundError` – prot.accession2taxid.gz database is required for sseqid to taxid conversion.

lca(node1, node2)¶

Performs Range Minimum Query between 2 taxids.

Parameters:	node1 (int) – taxid node2 (int) – taxid
Returns:	LCA taxid
Return type:	int
Raises:	`ValueError` – Provided taxid is not in the nodes.dmp tree.

parse(lca_fpath: str, orfs_fpath: str = None) → Dict[str, Dict[str, Dict[int, int]]]¶

Retrieve and construct contig dictionary from provided lca_fpath.

Parameters:	lca_fpath (str) – </path/to/lcas.tsv> tab-delimited ordered columns: qseqid, name, rank, lca_taxid orfs_fpath (str, optional (required if using prodigal version <2.6)) – </path/to/prodigal/called/orfs.fasta> Note: These ORFs should correspond to the ORFs provided in the BLAST table.
Returns:	{contig:{rank:{taxid:counts, …}, rank:{…}, …}, …}
Return type:	dict
Raises:	`FileNotFoundError` – lca_fpath does not exist. `FileNotFoundError` – orfs_fpath does not exist. `ValueError` – If prodigal version is under 2.6, orfs_fpath is a required input.

prepare_lca()¶

Prepare LCA internal data structures for lca().

e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.

Returns:	Prepares all LCA internals and if successful sets self.lca_prepared to True.
Return type:	NoneType

prepare_tree()¶

Performs Eulerian tour of nodes.dmp taxids and constructs three data structures:

tour : list of branches and leaves.
level: list of distances from the root.
occurrence: dict of occurences of the taxid respective to the root.

Notes

For more information on why we construct these three data structures see references below:

`Geeksforgeeks: Find LCA in Binary Tree using RMQ https://www.geeksforgeeks.org/find-lca-in-binary-tree-using-rmq/`_
`Topcoder: Another easy solution in <O(N logN, O(logN)> https://www.topcoder.com/community/competitive-programming/tutorials/range-minimum-query-and-lowest-common-ancestor/#Another%20easy%20solution%20in%20O(N%20logN,%20O(logN)`_

Returns:	sets internals to be used for LCA lookup
Return type:	NoneType

preprocess_minimums()¶

Preprocesses all possible LCAs.

This constructs a sparse table to be used for LCA/Range Minimum Query using the self.level array associated with its respective eulerian self.tour. For more information on these data structures see prepare_tree().

Sparse table size:: n = number of elements in level list rows range = (0 to n) columns range = (0 to logn)

Returns:	sets self.sparse internal to be used for LCA lookup.
Return type:	NoneType

reduce_taxids_to_lcas(taxids: Dict[str, int]) → Dict[str, int]¶

Retrieves the lowest common ancestor for each set of taxids in of the taxids

Parameters:	taxids (dict) – {qseqid: {taxid, …}, qseqid: {taxid, …}, …}
Returns:	{qseqid: lca, qseqid: lca, …}
Return type:	dict

write_lcas(lcas: Dict[str, int], out: str) → str¶

Write lcas to tab-delimited file: out.

Ordered columns are:

qseqid : query seqid

name : LCA name

rank : LCA rank

lca : LCA taxid

Parameters:	lcas (dict) – {qseqid:lca_taxid, qseqid:lca_taxid, …} out (str) – </path/to/output/file.tsv>
Returns:	out
Return type:	str

autometa.taxonomy.lca.main()¶

autometa.taxonomy.majority_vote module¶

This script contains the modified majority vote algorithm used in Autometa version 1.0

autometa.taxonomy.majority_vote.is_consistent_with_other_orfs(taxid: int, rank: str, rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → bool¶

Determines whether the majority of proteins in a contig, with rank equal to or above the given rank, are common ancestors of the taxid.

If the majority are, this function returns True, otherwise it returns False.

Parameters:	taxid (int) – taxid to search against other taxids at rank in rank_counts. rank (str) – Canonical rank to search in rank_counts. Choices: species, genus, family, order, class, phylum, superkingdom. rank_counts (dict) – LCA canonical rank counts retrieved from ORFs respective to a contig. e.g. {canonical_rank: {taxid: num_hits, …}, …} ncbi (NCBI instance) – Instance or subclass of NCBI from autometa.taxonomy.ncbi.
Returns:	If the majority of ORFs in a contig are equal or above given rank then return True, otherwise return False.
Return type:	boolean

autometa.taxonomy.majority_vote.lowest_majority(rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → int¶

Determine the lowest majority given rank_counts by first attempting to get a taxid that leads in counts with the highest specificity in terms of canonical rank.

Parameters:	rank_counts (dict) – {canonical_rank:{taxid:num_hits, …}, rank2: {…}, …} ncbi (NCBI instance) – NCBI object from autometa.taxonomy.ncbi
Returns:	Taxid above the lowest majority threshold.
Return type:	int

autometa.taxonomy.majority_vote.main()¶

autometa.taxonomy.majority_vote.majority_vote(lca_fpath: str, out: str, ncbi_dir: str, verbose: bool = False, orfs: str = None, force: bool = False) → str¶

Wrapper for modified majority voting algorithm from Autometa 1.0

Parameters:	lca_fpath (str) – Path to lowest common ancestor assignments table. out (str) – Path to write assigned taxids. ncbi_dir (str) – Path to NCBI databases directory. verbose (bool, optional) – Increase verbosity of logging stream orfs (str, optional) – Path to prodigal called orfs corresponding to LCA table computed from BLAST output force (bool, optional) – Whether to overwrite existing LCA results.
Returns:	Path to assigned taxids table.
Return type:	str

autometa.taxonomy.majority_vote.rank_taxids(ctg_lcas: dict, ncbi: Union[autometa.taxonomy.ncbi.NCBI, autometa.taxonomy.lca.LCA], verbose: bool = False) → Dict[str, int]¶

Votes for taxids based on modified majority vote system where if a majority does not exist, the lowest majority is voted.

Parameters:	ctg_lcas (dict) – {ctg1:{canonical_rank:{taxid:num_hits,…},…}, ctg2:{…},…} ncbi (ncbi.NCBI or lca.LCA object) – instance of NCBI subclass or NCBI containing NCBI methods. verbose (bool) – Description of parameter verbose (the default is False).
Returns:	{contig:voted_taxid, contig:voted_taxid, …}
Return type:	dict

autometa.taxonomy.majority_vote.write_votes(results: Dict[str, int], out: str) → str¶

Writes voting results to provided outfpath.

Parameters:	results (dict) – {contig:voted_taxid, contig:voted_taxid, …} out (str) – </path/to/results.tsv>.
Returns:	</path/to/results.tsv>
Return type:	str
Raises:	`FileExistsError` – Voting results file already exists

autometa.taxonomy.ncbi module¶

File containing definition of the NCBI class and containing functions useful for handling NCBI taxonomy databases

class autometa.taxonomy.ncbi.NCBI(dirpath, verbose=False)¶

Bases: object

Taxonomy utilities for NCBI databases.

CANONICAL_RANKS = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'root']¶

__repr__()¶

Operator overloading to return the string representation of the class object

Returns:	String representation of the class object
Return type:	str

__str__()¶

Operator overloading to return the directory path of the class object

Returns:	Directory path of the class object
Return type:	str

convert_taxid_dtype(taxid: int) → int¶

Converts the given taxid to an integer and checks whether it is positive.

2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid.

Parameters:	taxid (int) – identifer for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
Returns:	taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp
Return type:	int
Raises:	`ValueError` – Provided taxid is not a positive integer `DatabaseOutOfSyncError` – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

get_lineage_dataframe(taxids: Iterable[T_co], fillna: bool = True) → pandas.core.frame.DataFrame¶

Given an iterable of taxids generate a pandas DataFrame of their canonical lineages

Parameters:	taxids (iterable) – taxids whose lineage dataframe is being returned fillna (bool, optional) – Whether to fill the empty cells with ‘unclassified’ or not, default True
Returns:	index = taxid columns = [superkingdom,phylum,class,order,family,genus,species]
Return type:	pd.DataFrame

Example

If you would like to merge the returned DataFrame (‘this_df’) with another DataFrame (‘your_df’). Let’s say where you retrieved your taxids:

merged_df = pd.merge(
    left=your_df,
    right=this_df,
    how='left',
    left_on=<taxid_column>,
    right_index=True)

is_common_ancestor(taxid_A: int, taxid_B: int) → bool¶

Determines whether the provided taxids have a non-root common ancestor

Parameters:	taxid_A (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp taxid_B (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
Returns:	True if taxids share a common ancestor else False
Return type:	boolean

lineage(taxid: int, canonical: bool = True) → List[Dict[KT, VT]]¶

Returns the lineage of taxids encountered when traversing to root

Parameters:	taxid (int) – taxid in nodes.dmp, whose lineage is being returned canonical (bool, optional) – Lineage includes both canonical and non-canonical ranks when False, and only the canonical ranks when True Canonical ranks include : species, genus , family, order, class, phylum, superkingdom, root
Returns:	[{‘taxid’:taxid, ‘rank’:rank,’name’:name}, …]
Return type:	ordered list of dicts

name(taxid: int, rank: str = None) → str¶

Parses through the names.dmp in search of the given taxid and returns its name. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated name will be retrieved

Parameters:	taxid (int) – taxid whose name is being returned rank (str, optional) – If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, ‘genus’) would return ‘Escherichia’, where 562 is the taxid for Escherichia coli
Returns:	Name of provided taxid if taxid is found in names.dmp else ‘unclassified’
Return type:	str
Raises:	`DatabaseOutOfSyncError` – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

parent(taxid: int) → int¶

Retrieve the parent taxid of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated parent will be retrieved

Parameters:	taxid (int) – child taxid to retrieve parent
Returns:	Parent taxid if found in nodes.dmp otherwise 1
Return type:	int
Raises:	`DatabaseOutOfSyncError` – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

parse_merged() → Dict[int, int]¶

Parse the merged.dmp database Note: This is performed when a new NCBI class instance is constructed

Returns:	{old_taxid: new_taxid, …}
Return type:	dict

parse_names() → Dict[int, str]¶

Parses through names.dmp database and loads taxids with scientific names

Returns:	{taxid:name, …}
Return type:	dict

parse_nodes() → Dict[int, str]¶

Parse the nodes.dmp database to be used later by autometa.taxonomy.ncbi.NCBI.parent(), autometa.taxonomy.ncbi.NCBI.rank() Note: This is performed when a new NCBI class instance is constructed

Returns:	{child_taxid:{‘parent’:parent_taxid,’rank’:rank}, …}
Return type:	dict

rank(taxid: int) → str¶

Return the respective rank of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated rank will be retrieved

Parameters:	taxid (int) – taxid to retrieve rank from nodes.dmp
Returns:	rank name if taxid is found in nodes.dmp else “unclassified”
Return type:	str
Raises:	`DatabaseOutOfSyncError` – NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other

autometa.taxonomy.vote module¶

Script to split metagenome assembly by kingdoms given the input votes. The lineages of the provided voted taxids will also be added and written to taxonomy.tsv

autometa.taxonomy.vote.add_ranks(df: pandas.core.frame.DataFrame, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str]) → pandas.core.frame.DataFrame¶

Add canonical ranks to df and write to out

Parameters:	df (pd.DataFrame) – index=”contig”, column=”taxid” ncbi (str or NCBI) – Path to NCBI databases directory, or autometa NCBI instance.
Returns:	index=”contig”, columns=[“taxid”, *canonical_ranks]
Return type:	pd.DataFrame

autometa.taxonomy.vote.assign(out: str, method: str = 'majority_vote', assembly: str = None, prot_orfs: str = None, nucl_orfs: str = None, blast: str = None, lca_fpath: str = None, ncbi_dir: str = './autometa/databases/ncbi', force: bool = False, verbose: bool = False, parallel: bool = True, cpus: int = 0) → pandas.core.frame.DataFrame¶

Assign taxonomy using method and write to out.

Parameters:	out (str) – Path to write taxonomy table of votes method (str, optional) – Method to assign contig taxonomy, by default “majority_vote”. choices include “majority_vote”, … assembly (str, optional) – Path to assembly fasta file (nucleotide), by default None prot_orfs (str, optional) – Path to amino-acid ORFs called from assembly, by default None nucl_orfs (str, optional) – Path to nucleotide ORFs called from assembly, by default None blast (str, optional) – Path to blastp table, by default None lca_fpath (str, optional) – Path to output of LCA analysis, by default None ncbi_dir (str, optional) – Path to NCBI databases directory, by default NCBI_DIR force (bool, optional) – Overwrite existing annotations, by default False verbose (bool, optional) – Increase verbosity, by default False parallel (bool, optional) – Whether to perform annotations using multiprocessing and GNU parallel, by default True cpus (int, optional) – Number of cpus to use if parallel is True, by default will try to use all available.
Returns:	index=”contig”, columns=[“taxid”]
Return type:	pd.DataFrame
Raises:	`NotImplementedError` – Provided method has not yet been implemented. `ValueError` – Assembly file is required if no other annotations are provided.

autometa.taxonomy.vote.get(filepath_or_dataframe: Union[str, pandas.core.frame.DataFrame], kingdom: str, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str] = './autometa/databases/ncbi') → pandas.core.frame.DataFrame¶

Retrieve specific kingdom voted taxa for assembly from filepath

Parameters:	filepath (str) – Path to tab-delimited taxonomy table. cols=[‘contig’,’taxid’, canonical_ranks] kingdom* (str) – rank to retrieve from superkingdom column in taxonomy table. ncbi (str or autometa.taxonomy.NCBI instance, optional) – Path to NCBI database directory or NCBI instance, by default NCBI_DIR. This is necessary only if filepath does not already contain columns of canonical ranks.
Returns:	DataFrame of contigs pertaining to retrieved kingdom.
Return type:	pd.DataFrame
Raises:	`FileNotFoundError` – Provided filepath does not exists or is empty. `TableFormatError` – Provided filepath does not contain the ‘superkingdom’ column. `KeyError` – kingdom is absent in provided taxonomy table.

autometa.taxonomy.vote.main()¶

autometa.taxonomy.vote.write_ranks(taxonomy: pandas.core.frame.DataFrame, assembly: str, outdir: str, rank: str = 'superkingdom', prefix: str = None) → List[str]¶

Write fastas split by rank

Parameters:	taxonomy (pd.DataFrame) – dataframe containing canonical ranks of contigs assigned from :func:autometa.taxonomy.vote.assign(…) assembly (str) – Path to assembly fasta file outdir (str) – Path to output directory to write fasta files rank (str, optional) – canonical rank column in taxonomy table to split by, by default “superkingdom” prefix (str, optional) – Prefix each of the paths written with prefix string.
Returns:	[rank_name_fpath, …]
Return type:	list
Raises:	`ValueError` – rank not in canonical ranks