autometa.taxonomy package¶
Submodules¶
autometa.taxonomy.lca module¶
This script contains the LCA class containing methods to determine the Lowest Common Ancestor given a tab-delimited BLAST table, fasta file, or iterable of SeqRecords.
Note: LCA will assume the BLAST results table is in output format 6.
-
class
autometa.taxonomy.lca.LCA(dbdir: str, verbose: bool = False)¶ Bases:
autometa.taxonomy.ncbi.NCBILCA class containing methods to retrieve the Lowest Common Ancestor.
LCAs may be computed given taxids, a fasta or BLAST results.
Parameters: - dbdir (str) – </path/to/ncbi/databases/directory>
- outdir (str) – </path/to/output/directory>
- usepickle (bool, optional) – Whether to serialize intermediate files to disk for later lookup (the default is True).
- verbose (bool, optional) – Add verbosity to logging stream (the default is False).
-
disable¶ Opposite of verbose. Used to disable tqdm module.
Type: bool
-
tour_fp¶ </path/to/serialized/file/eulerian/tour.pkl.gz>
Type: str
-
tour¶ Eulerian tour containing branches and leaves information from tree traversal.
Type: list
-
level_fp¶ </path/to/serialized/file/level.pkl.gz>
Type: str
-
level¶ Lengths from root corresponding to tour during tree traversal.
Type: list
-
occurrence_fp¶ </path/to/serialized/file/level.pkl.gz>
Type: str
-
occurrence¶ Contains first occurrence of each taxid while traversing tree (index in tour). e.g. {taxid:index, taxid: index, …}
Type: dict
-
sparse_fp¶ </path/to/serialized/file/sparse.pkl.gz>
Type: str
-
sparse¶ Precomputed LCA values corresponding to tour,`level` and occurrence.
Type: numpy.ndarray
-
lca_prepared¶ Whether LCA internals have been computed (e.g. tour,`level`,`occurrence`,`sparse`).
Type: bool
-
blast2lca(blast: str, out: str, force: bool = False) → str¶ Determine lowest common ancestor of provided amino-acid ORFs.
Parameters: - blast (str) – </path/to/diamond/outfmt6/blastp.tsv>.
- out (str) – </path/to/output/lca.tsv>.
- force (bool, optional) – Force overwrite of existing out.
Returns: out </path/to/output/lca.tsv>.
Return type: str
-
convert_sseqids_to_taxids(sseqids: Dict[str, str]) → Dict[str, int]¶ Translates subject sequence ids to taxids from prot.accession2taxid.gz.
Note
If an accession number is no longer available in prot.accesssion2taxid.gz (either due to being suppressed, deprecated or removed by NCBI), then root taxid (1) is returned as the taxid for the corresponding sseqid.
Parameters: sseqids (dict) – {qseqid: {sseqid, …}, …} Returns: {qseqid: {taxid, taxid, …}, …} Return type: dict Raises: FileNotFoundError– prot.accession2taxid.gz database is required for sseqid to taxid conversion.
-
lca(node1, node2)¶ Performs Range Minimum Query between 2 taxids.
Parameters: - node1 (int) – taxid
- node2 (int) – taxid
Returns: LCA taxid
Return type: int
Raises: ValueError– Provided taxid is not in the nodes.dmp tree.
-
parse(lca_fpath: str, orfs_fpath: str = None) → Dict[str, Dict[str, Dict[int, int]]]¶ Retrieve and construct contig dictionary from provided lca_fpath.
Parameters: - lca_fpath (str) – </path/to/lcas.tsv> tab-delimited ordered columns: qseqid, name, rank, lca_taxid
- orfs_fpath (str, optional (required if using prodigal version <2.6)) – </path/to/prodigal/called/orfs.fasta> Note: These ORFs should correspond to the ORFs provided in the BLAST table.
Returns: {contig:{rank:{taxid:counts, …}, rank:{…}, …}, …}
Return type: dict
Raises: FileNotFoundError– lca_fpath does not exist.FileNotFoundError– orfs_fpath does not exist.ValueError– If prodigal version is under 2.6, orfs_fpath is a required input.
-
prepare_lca()¶ Prepare LCA internal data structures for
lca().e.g. self.tour, self.level, self.occurrence, self.sparse are all ready.
Returns: Prepares all LCA internals and if successful sets self.lca_prepared to True. Return type: NoneType
-
prepare_tree()¶ Performs Eulerian tour of nodes.dmp taxids and constructs three data structures:
- tour : list of branches and leaves.
- level: list of distances from the root.
- occurrence: dict of occurences of the taxid respective to the root.
Notes
For more information on why we construct these three data structures see references below:
- `Geeksforgeeks: Find LCA in Binary Tree using RMQ https://www.geeksforgeeks.org/find-lca-in-binary-tree-using-rmq/`_
- `Topcoder: Another easy solution in <O(N logN, O(logN)> https://www.topcoder.com/community/competitive-programming/tutorials/range-minimum-query-and-lowest-common-ancestor/#Another%20easy%20solution%20in%20O(N%20logN,%20O(logN)`_
Returns: sets internals to be used for LCA lookup Return type: NoneType
-
preprocess_minimums()¶ Preprocesses all possible LCAs.
This constructs a sparse table to be used for LCA/Range Minimum Query using the self.level array associated with its respective eulerian self.tour. For more information on these data structures see
prepare_tree().- Sparse table size:
- n = number of elements in level list rows range = (0 to n) columns range = (0 to logn)
Returns: sets self.sparse internal to be used for LCA lookup. Return type: NoneType
-
reduce_taxids_to_lcas(taxids: Dict[str, int]) → Dict[str, int]¶ Retrieves the lowest common ancestor for each set of taxids in of the taxids
Parameters: taxids (dict) – {qseqid: {taxid, …}, qseqid: {taxid, …}, …} Returns: {qseqid: lca, qseqid: lca, …} Return type: dict
-
write_lcas(lcas: Dict[str, int], out: str) → str¶ Write lcas to tab-delimited file: out.
Ordered columns are:
- qseqid : query seqid
- name : LCA name
- rank : LCA rank
- lca : LCA taxid
Parameters: - lcas (dict) – {qseqid:lca_taxid, qseqid:lca_taxid, …}
- out (str) – </path/to/output/file.tsv>
Returns: out
Return type: str
-
autometa.taxonomy.lca.main()¶
autometa.taxonomy.majority_vote module¶
This script contains the modified majority vote algorithm used in Autometa version 1.0
-
autometa.taxonomy.majority_vote.is_consistent_with_other_orfs(taxid: int, rank: str, rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → bool¶ Determines whether the majority of proteins in a contig, with rank equal to or above the given rank, are common ancestors of the taxid.
If the majority are, this function returns True, otherwise it returns False.
Parameters: - taxid (int) – taxid to search against other taxids at rank in rank_counts.
- rank (str) – Canonical rank to search in rank_counts. Choices: species, genus, family, order, class, phylum, superkingdom.
- rank_counts (dict) – LCA canonical rank counts retrieved from ORFs respective to a contig. e.g. {canonical_rank: {taxid: num_hits, …}, …}
- ncbi (NCBI instance) – Instance or subclass of NCBI from autometa.taxonomy.ncbi.
Returns: If the majority of ORFs in a contig are equal or above given rank then return True, otherwise return False.
Return type: boolean
-
autometa.taxonomy.majority_vote.lowest_majority(rank_counts: Dict[str, Dict[KT, VT]], ncbi: autometa.taxonomy.ncbi.NCBI) → int¶ Determine the lowest majority given rank_counts by first attempting to get a taxid that leads in counts with the highest specificity in terms of canonical rank.
Parameters: - rank_counts (dict) – {canonical_rank:{taxid:num_hits, …}, rank2: {…}, …}
- ncbi (NCBI instance) – NCBI object from autometa.taxonomy.ncbi
Returns: Taxid above the lowest majority threshold.
Return type: int
-
autometa.taxonomy.majority_vote.main()¶
-
autometa.taxonomy.majority_vote.majority_vote(lca_fpath: str, out: str, ncbi_dir: str, verbose: bool = False, orfs: str = None, force: bool = False) → str¶ Wrapper for modified majority voting algorithm from Autometa 1.0
Parameters: - lca_fpath (str) – Path to lowest common ancestor assignments table.
- out (str) – Path to write assigned taxids.
- ncbi_dir (str) – Path to NCBI databases directory.
- verbose (bool, optional) – Increase verbosity of logging stream
- orfs (str, optional) – Path to prodigal called orfs corresponding to LCA table computed from BLAST output
- force (bool, optional) – Whether to overwrite existing LCA results.
Returns: Path to assigned taxids table.
Return type: str
-
autometa.taxonomy.majority_vote.rank_taxids(ctg_lcas: dict, ncbi: Union[autometa.taxonomy.ncbi.NCBI, autometa.taxonomy.lca.LCA], verbose: bool = False) → Dict[str, int]¶ Votes for taxids based on modified majority vote system where if a majority does not exist, the lowest majority is voted.
Parameters: - ctg_lcas (dict) – {ctg1:{canonical_rank:{taxid:num_hits,…},…}, ctg2:{…},…}
- ncbi (ncbi.NCBI or lca.LCA object) – instance of NCBI subclass or NCBI containing NCBI methods.
- verbose (bool) – Description of parameter verbose (the default is False).
Returns: {contig:voted_taxid, contig:voted_taxid, …}
Return type: dict
-
autometa.taxonomy.majority_vote.write_votes(results: Dict[str, int], out: str) → str¶ Writes voting results to provided outfpath.
Parameters: - results (dict) – {contig:voted_taxid, contig:voted_taxid, …}
- out (str) – </path/to/results.tsv>.
Returns: </path/to/results.tsv>
Return type: str
Raises: FileExistsError– Voting results file already exists
autometa.taxonomy.ncbi module¶
File containing definition of the NCBI class and containing functions useful for handling NCBI taxonomy databases
-
class
autometa.taxonomy.ncbi.NCBI(dirpath, verbose=False)¶ Bases:
objectTaxonomy utilities for NCBI databases.
-
CANONICAL_RANKS= ['species', 'genus', 'family', 'order', 'class', 'phylum', 'superkingdom', 'root']¶
-
__repr__()¶ Operator overloading to return the string representation of the class object
Returns: String representation of the class object Return type: str
-
__str__()¶ Operator overloading to return the directory path of the class object
Returns: Directory path of the class object Return type: str
-
convert_taxid_dtype(taxid: int) → int¶ - Converts the given taxid to an integer and checks whether it is positive.
2. Checks whether taxid is present in both nodes.dmp and names.dmp. 3a. If (2) is false, will check for corresponding taxid in merged.dmp and convert to this then redo (2). 3b. If (2) is true, will return converted taxid.
Parameters: taxid (int) – identifer for a taxon in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
Returns: taxid if the taxid is a positive integer and present in either nodes.dmp or names.dmp or taxid recovered from merged.dmp
Return type: int
Raises: ValueError– Provided taxid is not a positive integerDatabaseOutOfSyncError– NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
-
get_lineage_dataframe(taxids: Iterable[T_co], fillna: bool = True) → pandas.core.frame.DataFrame¶ Given an iterable of taxids generate a pandas DataFrame of their canonical lineages
Parameters: - taxids (iterable) – taxids whose lineage dataframe is being returned
- fillna (bool, optional) – Whether to fill the empty cells with ‘unclassified’ or not, default True
Returns: index = taxid columns = [superkingdom,phylum,class,order,family,genus,species]
Return type: pd.DataFrame
Example
If you would like to merge the returned DataFrame (‘this_df’) with another DataFrame (‘your_df’). Let’s say where you retrieved your taxids:
merged_df = pd.merge( left=your_df, right=this_df, how='left', left_on=<taxid_column>, right_index=True)
-
is_common_ancestor(taxid_A: int, taxid_B: int) → bool¶ Determines whether the provided taxids have a non-root common ancestor
Parameters: - taxid_A (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
- taxid_B (int) – taxid in NCBI taxonomy databases - nodes.dmp, names.dmp or merged.dmp
Returns: True if taxids share a common ancestor else False
Return type: boolean
-
lineage(taxid: int, canonical: bool = True) → List[Dict[KT, VT]]¶ Returns the lineage of taxids encountered when traversing to root
Parameters: - taxid (int) – taxid in nodes.dmp, whose lineage is being returned
- canonical (bool, optional) – Lineage includes both canonical and non-canonical ranks when False, and only the canonical ranks when True Canonical ranks include : species, genus , family, order, class, phylum, superkingdom, root
Returns: [{‘taxid’:taxid, ‘rank’:rank,’name’:name}, …]
Return type: ordered list of dicts
-
name(taxid: int, rank: str = None) → str¶ Parses through the names.dmp in search of the given taxid and returns its name. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated name will be retrieved
Parameters: - taxid (int) – taxid whose name is being returned
- rank (str, optional) – If provided, will return taxid name at rank, by default None Must be a canonical rank, choices: species, genus, family, order, class, phylum, superkingdom Eg. self.name(562, ‘genus’) would return ‘Escherichia’, where 562 is the taxid for Escherichia coli
Returns: Name of provided taxid if taxid is found in names.dmp else ‘unclassified’
Return type: str
Raises: DatabaseOutOfSyncError– NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
-
parent(taxid: int) → int¶ Retrieve the parent taxid of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated parent will be retrieved
Parameters: taxid (int) – child taxid to retrieve parent Returns: Parent taxid if found in nodes.dmp otherwise 1 Return type: int Raises: DatabaseOutOfSyncError– NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
-
parse_merged() → Dict[int, int]¶ Parse the merged.dmp database Note: This is performed when a new NCBI class instance is constructed
Returns: {old_taxid: new_taxid, …} Return type: dict
-
parse_names() → Dict[int, str]¶ Parses through names.dmp database and loads taxids with scientific names
Returns: {taxid:name, …} Return type: dict
-
parse_nodes() → Dict[int, str]¶ Parse the nodes.dmp database to be used later by
autometa.taxonomy.ncbi.NCBI.parent(),autometa.taxonomy.ncbi.NCBI.rank()Note: This is performed when a new NCBI class instance is constructedReturns: {child_taxid:{‘parent’:parent_taxid,’rank’:rank}, …} Return type: dict
-
rank(taxid: int) → str¶ Return the respective rank of provided taxid. If the taxid is deprecated, suppressed, withdrawn from NCBI (basically old) the updated rank will be retrieved
Parameters: taxid (int) – taxid to retrieve rank from nodes.dmp Returns: rank name if taxid is found in nodes.dmp else “unclassified” Return type: str Raises: DatabaseOutOfSyncError– NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other
-
autometa.taxonomy.vote module¶
Script to split metagenome assembly by kingdoms given the input votes. The lineages of the provided voted taxids will also be added and written to taxonomy.tsv
-
autometa.taxonomy.vote.add_ranks(df: pandas.core.frame.DataFrame, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str]) → pandas.core.frame.DataFrame¶ Add canonical ranks to df and write to out
Parameters: - df (pd.DataFrame) – index=”contig”, column=”taxid”
- ncbi (str or NCBI) – Path to NCBI databases directory, or autometa NCBI instance.
Returns: index=”contig”, columns=[“taxid”, *canonical_ranks]
Return type: pd.DataFrame
-
autometa.taxonomy.vote.assign(out: str, method: str = 'majority_vote', assembly: str = None, prot_orfs: str = None, nucl_orfs: str = None, blast: str = None, lca_fpath: str = None, ncbi_dir: str = './autometa/databases/ncbi', force: bool = False, verbose: bool = False, parallel: bool = True, cpus: int = 0) → pandas.core.frame.DataFrame¶ Assign taxonomy using method and write to out.
Parameters: - out (str) – Path to write taxonomy table of votes
- method (str, optional) – Method to assign contig taxonomy, by default “majority_vote”. choices include “majority_vote”, …
- assembly (str, optional) – Path to assembly fasta file (nucleotide), by default None
- prot_orfs (str, optional) – Path to amino-acid ORFs called from assembly, by default None
- nucl_orfs (str, optional) – Path to nucleotide ORFs called from assembly, by default None
- blast (str, optional) – Path to blastp table, by default None
- lca_fpath (str, optional) – Path to output of LCA analysis, by default None
- ncbi_dir (str, optional) – Path to NCBI databases directory, by default NCBI_DIR
- force (bool, optional) – Overwrite existing annotations, by default False
- verbose (bool, optional) – Increase verbosity, by default False
- parallel (bool, optional) – Whether to perform annotations using multiprocessing and GNU parallel, by default True
- cpus (int, optional) – Number of cpus to use if parallel is True, by default will try to use all available.
Returns: index=”contig”, columns=[“taxid”]
Return type: pd.DataFrame
Raises: NotImplementedError– Provided method has not yet been implemented.ValueError– Assembly file is required if no other annotations are provided.
-
autometa.taxonomy.vote.get(filepath_or_dataframe: Union[str, pandas.core.frame.DataFrame], kingdom: str, ncbi: Union[autometa.taxonomy.ncbi.NCBI, str] = './autometa/databases/ncbi') → pandas.core.frame.DataFrame¶ Retrieve specific kingdom voted taxa for assembly from filepath
Parameters: - filepath (str) – Path to tab-delimited taxonomy table. cols=[‘contig’,’taxid’, *canonical_ranks]
- kingdom (str) – rank to retrieve from superkingdom column in taxonomy table.
- ncbi (str or autometa.taxonomy.NCBI instance, optional) – Path to NCBI database directory or NCBI instance, by default NCBI_DIR. This is necessary only if filepath does not already contain columns of canonical ranks.
Returns: DataFrame of contigs pertaining to retrieved kingdom.
Return type: pd.DataFrame
Raises: FileNotFoundError– Provided filepath does not exists or is empty.TableFormatError– Provided filepath does not contain the ‘superkingdom’ column.KeyError– kingdom is absent in provided taxonomy table.
-
autometa.taxonomy.vote.main()¶
-
autometa.taxonomy.vote.write_ranks(taxonomy: pandas.core.frame.DataFrame, assembly: str, outdir: str, rank: str = 'superkingdom', prefix: str = None) → List[str]¶ Write fastas split by rank
Parameters: - taxonomy (pd.DataFrame) – dataframe containing canonical ranks of contigs assigned from :func:autometa.taxonomy.vote.assign(…)
- assembly (str) – Path to assembly fasta file
- outdir (str) – Path to output directory to write fasta files
- rank (str, optional) – canonical rank column in taxonomy table to split by, by default “superkingdom”
- prefix (str, optional) – Prefix each of the paths written with prefix string.
Returns: [rank_name_fpath, …]
Return type: list
Raises: ValueError– rank not in canonical ranks