autometa.common package¶
Subpackages¶
Submodules¶
autometa.common.coverage module¶
Calculates coverage of contigs
-
autometa.common.coverage.from_spades_names(records)¶ Retrieve coverages from SPAdes scaffolds headers.
Example SPAdes header : NODE_83_length_162517_cov_224.639
Parameters: records (list) – [SeqRecord,…] Returns: index=contig, name=’coverage’, dtype=float Return type: pd.Series
-
autometa.common.coverage.get(fasta, out, from_spades=False, fwd_reads=None, rev_reads=None, se_reads=None, sam=None, bam=None, lengths=None, bed=None, cpus=1)¶ Get coverages for assembly fasta file using provided files or if the metagenome assembly was generated from SPAdes, use the k-mer coverages provided in each contig’s header by specifying from_spades=True.
Either fwd_reads and rev_reads and/or se_reads or,`sam`, or bam, or bed must be provided if from_spades=False.
Notes
Will begin coverage calculation based on files provided checking in the following order:
- bed
- bam
- sam
- fwd_reads and rev_reads and se_reads
Event sequence to calculate contig coverages:
- align reads to generate alignment.sam
- sort samfile to generate alignment.bam
- calculate assembly coverages to generate alignment.bed
- calculate contig coverages to generate coverage.tsv
Parameters: - fasta (str) – </path/to/assembly.fasta>
- out (str) – </path/to/output/coverages.tsv>
- from_spades (bool, optional) – If True, will attempt to parse record ids for coverage information. This is only compatible with SPAdes assemblies. (the Default is False).
- fwd_reads (list, optional) – [</path/to/forward_reads.fastq>, …]
- rev_reads (list, optional) – [</path/to/reverse_reads.fastq>, …]
- se_reads (list, optional) – [</path/to/single_end_reads.fastq>, …]
- sam (str, optional) – </path/to/alignments.sam>
- bam (str, optional) – </path/to/alignments.bam>
- lengths (str, optional) – </path/to/lengths.tsv>
- bed (str, optional) – </path/to/alignments.bed>
- cpus (int, optional) – Number of cpus to use for coverage calculation.
Returns: index=contig cols=[‘coverage’]
Return type: pd.DataFrame
-
autometa.common.coverage.main()¶
-
autometa.common.coverage.make_length_table(fasta, out)¶ Writes a tab-delimited length table to out given an input fasta.
Parameters: - fasta (str) – </path/to/assembly.fasta>
- out (str) – </path/to/lengths.tsv>
Returns: </path/to/lengths.tsv>
Return type: str
-
autometa.common.coverage.normalize(df)¶
autometa.common.exceptions module¶
File containing customized AutometaErrors for more specific exception handling
-
exception
autometa.common.exceptions.AutometaError¶ Bases:
ExceptionBase class for Autometa Errors.
-
exception
autometa.common.exceptions.BinningError¶ Bases:
autometa.common.exceptions.AutometaErrorBinningError exception class.
Exception called when issues arise during or after the binning process.
This is usually a result of no clusters being recovered.
-
exception
autometa.common.exceptions.ChecksumMismatchError¶ Bases:
autometa.common.exceptions.AutometaErrorChecksumMismatchError exception class
Exception called when checksums do not match.
-
exception
autometa.common.exceptions.DatabaseOutOfSyncError(value)¶ Bases:
autometa.common.exceptions.AutometaErrorRaised when NCBI databases nodes.dmp, names.dmp and merged.dmp are out of sync with each other :param AutometaError: Base class for other exceptions :type AutometaError: class
-
__str__()¶ Operator overloading to return the text message written while raising the error, rather than the message of __str__ by base exception :returns: Message written alongside raising the exception :rtype: str
-
-
exception
autometa.common.exceptions.ExternalToolError(cmd, err)¶ Bases:
autometa.common.exceptions.AutometaErrorRaised when samtools sort is not executed properly.
Parameters: AutometaError (class) – Base class for other exceptions
-
exception
autometa.common.exceptions.TableFormatError¶ Bases:
autometa.common.exceptions.AutometaErrorTableFormatError exception class.
Exception called when Table format is incorrect.
This is usually a result of a table missing the ‘contig’ column as this is often used as the index.
autometa.common.kmers module¶
autometa.common.markers module¶
Autometa Marker class consisting of various methods to annotate sequences with marker sets depending on sequence set taxonomy
-
autometa.common.markers.get(kingdom: str, orfs: str, dbdir: str, scans: str = None, out: str = None, force: bool = False, format: str = 'wide', cpus: int = 2, parallel: bool = True, gnu_parallel: bool = False, seed: int = 42) → pandas.core.frame.DataFrame¶ Retrieve contigs’ markers from markers database that pass cutoffs filter.
Parameters: - kingdom (str) – kingdom to annotate markers choices = [‘bacteria’, ‘archaea’]
- orfs (str) – Path to amino-acid ORFs file
- dbdir – Directory should contain hmmpressed marker genes database files.
- scans (str, optional) – Path to existing hmmscan table to filter by cutoffs
- out (str, optional) – Path to write annotated markers table.
- force (bool, optional) – Whether to overwrite existing out file path, by default False.
- format (str, optional) –
- wide - returns wide dataframe of contig PFAM counts (default)
- long - returns long dataframe of contig PFAM counts
- list - returns list of pfams for each contig
- counts - returns count of pfams for each contig
- cpus (int, optional) – Number of cores to use if running in parallel, by default all available.
- parallel (bool, optional) – Whether to run hmmscan using its parallel option, by default True.
- gnu_parallel (bool, optional) – Whether to run hmmscan using gnu parallel, by default False.
- seed (int, optional) – Seed to pass into hmmscan for determinism, by default 42.
Returns: - wide - pd.DataFrame(index_col=contig, columns=[PFAM,…])
- long - pd.DataFrame(index_col=contig, columns=[‘sacc’,’count’])
- list - {contig:[pfam,pfam,…],contig:[…],…}
- counts - {contig:count, contig:count,…}
Return type: pd.Dataframe or dict
Raises: ValueError– Why the exception is raised.
-
autometa.common.markers.load(fpath, format='wide')¶ Read markers table into specified format.
Parameters: - fpath (str) – </path/to/kingdom.markers.tsv>
- format (str, optional) –
- wide - index=contig, cols=[domain sacc,..] (default)
- long - index=contig, cols=[‘sacc’,’count’]
- list - {contig:[sacc,…],…}
- counts - {contig:len([sacc,…]), …}
Returns: - wide - index=contig, cols=[domain sacc,..] (default)
- long - index=contig, cols=[‘sacc’,’count’]
- list - {contig:[sacc,…],…}
- counts - {contig:len([sacc,…]), …}
Return type: pd.DataFrame or dict
Raises: FileNotFoundError– Provided fpath does not existValueError– Provided format is not in choices: choices = wide, long, list or counts
-
autometa.common.markers.main()¶
autometa.common.metagenome module¶
Script containing Metagenome class for general handling of metagenome assembly
-
class
autometa.common.metagenome.Metagenome(assembly)¶ Bases:
objectAutometa Metagenome Class.
Parameters: assembly (str) – </path/to/metagenome/assembly.fasta> -
sequences¶ [seq,…]
Type: list
-
seqrecords¶ [SeqRecord,…]
Type: list
-
nseqs¶ Number of sequences in assembly.
Type: int
-
length_weighted_gc¶ Length weighted average GC% of assembly.
Type: float
-
size¶ Total assembly size in bp.
Type: int
-
largest_seq¶ id of longest sequence in assembly
Type: str
-
* self.fragmentation_metric()
-
* self.describe()
-
* self.length_filter()
-
describe() → pandas.core.frame.DataFrame¶ Return dataframe of details.
# assembly : Assembly input into Metagenome(…) [index column] # nseqs : Number of sequences in assembly # size : Size or total sum of all sequence lengths # N50 : # N10 : # N90 : # length_weighted_gc_content : Length weighted average GC content # largest_seq : Largest sequence in assemblyReturns: Return type: pd.DataFrame
-
fragmentation_metric(quality_measure: float = 0.5) → int¶ Describes the quality of assembled genomes that are fragmented in contigs of different length.
Note
For more information see this metagenomics wiki from Matthias Scholz
Parameters: quality_measure (0 < float < 1) – Description of parameter quality_measure (the default is .50). I.e. default measure is N50, but could use .1 for N10 or .9 for N90 Returns: Minimum contig length to cover quality_measure of genome (i.e. length weighted median) Return type: int
-
gc_content() → pandas.core.frame.DataFrame¶ Retrieves GC content from sequences in assembly
Returns: index=”contig”, columns=[“gc_content”,”length”] Return type: pd.DataFrame
-
largest_seq Retrieve the name of the largest sequence in the provided assembly.
Returns: record ID of the largest sequence in assembly. Return type: str
-
length_filter(out: str, cutoff: int = 3000, force: bool = False)¶ Filters sequences by length with provided cutoff.
Note
A WARNING will be emitted and the original metagenome will be returned if no contigs pass the length filter cutoff.
Parameters: - out (str) – Path to write length filtered output fasta file.
- cutoff (int, optional) – Lengths above or equal to cutoff that will be retained (the default is 3000).
- force (bool, optional) – Overwrite existing out file (the default is False).
Returns: autometa Metagenome object with only assembly sequences above the cutoff threshold.
Return type: Raises: TypeError– cutoff value must be a float or integerValueError– cutoff value must be a positive real numberFileExistsError– filepath consisting of sequences that passed filter already exists
-
length_weighted_gc Retrieve the length weighted average GC percentage of provided assembly.
Returns: GC percentage weighted by contig length. Return type: float
-
nseqs Retrieve the number of sequences in provided assembly.
Returns: Number of sequences parsed from assembly Return type: int
-
seqrecords Retrieve SeqRecord objects from provided assembly.
Returns: [SeqRecord, SeqRecord, …] Return type: list
-
sequences Retrieve the sequences from provided assembly.
Returns: [seq, seq, …] Return type: list
-
size Retrieve the summation of sizes for each contig in the provided assembly.
Returns: Total summation of contig sizes in assembly Return type: int
-
-
autometa.common.metagenome.main()¶
autometa.common.utilities module¶
File containing common utilities functions to be used by Autometa scripts.
-
autometa.common.utilities.calc_checksum(fpath: str) → str¶ Retrieve md5 checksum from provided fpath.
- fpath : str
- </path/to/file>
- str
- space-delimited hexdigest of fpath using md5sum and basename of fpath. e.g. ‘hash filename
‘
- FileNotFoundError
- Provided fpath does not exist
- TypeError
- fpath is not a string
-
autometa.common.utilities.file_length(fpath: str, approximate: bool = False) → int¶ Retrieve the number of lines in fpath
See: https://stackoverflow.com/q/845058/13118765
Parameters: - fpath (str) – Description of parameter fpath.
- approximate (bool) – If True, will approximate the length of the file from the file size.
Returns: Number of lines in fpath
Return type: int
Raises: FileNotFoundError– provided fpath does not exist
-
autometa.common.utilities.gunzip(infpath: str, outfpath: str, delete_original: bool = False, block_size: int = 65536) → str¶ Decompress gzipped infpath to outfpath and write checksum of outfpath upon successful decompression.
Parameters: - infpath (str) – </path/to/file.gz>
- outfpath (str) – </path/to/file>
- delete_original (bool) – Will delete the original file after successfully decompressing infpath (Default is False).
- block_size (int) – Amount of infpath to read in to memory before writing to outfpath (Default is 65536 bytes).
Returns: </path/to/file>
Return type: str
Raises: FileExistsError– outfpath already exists and is not empty
-
autometa.common.utilities.make_pickle(obj: Any, outfpath: str) → str¶ Serialize a python object (obj) to outfpath. Note: Opposite of
unpickle()Parameters: - obj (any) – Python object to serialize to outfpath.
- outfpath (str) – </path/to/pickled/file>.
Returns: </path/to/pickled/file.pkl>
Return type: str
Raises: ExceptionName– Why the exception is raised.
-
autometa.common.utilities.read_checksum(fpath: str) → str¶ Read checksum from provided checksum formatted fpath.
Note: See write_checksum for how a checksum file is generated.
Parameters: fpath (str) – </path/to/file.md5>
Returns: checksum retrieved from fpath.
Return type: str
Raises: TypeError– Provided fpath was not a string.FileNotFoundError– Provided fpath does not exist.
-
autometa.common.utilities.tarchive_results(outfpath: str, src_dirpath: str) → str¶ Generate a tar archive of Autometa Results
See: https://stackoverflow.com/questions/2032403/how-to-create-full-compressed-tar-file-using-python
Parameters: - outfpath (str) – </path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
- src_dirpath (str) – </paths/to/directory/to/archive>
Returns: </path/to/output/tar/archive.tar.gz || </path/to/output/tar/archive.tgz
Return type: str
Raises: FileExistsError– outfpath already exists
-
autometa.common.utilities.timeit(func: function) → function¶ Time function run time (to be used as a decorator). I.e. when defining a function use python’s decorator syntax
Example
@timeit def your_function(args): ...
Notes
See: https://docs.python.org/2/library/functools.html#functools.wraps
Parameters: func (function) – function to decorate timer Returns: timer decorated func. Return type: function
-
autometa.common.utilities.unpickle(fpath: str) → Any¶ Load a serialized fpath from
make_pickle().Parameters: fpath (str) – </path/to/file.pkl>. Returns: Python object that was serialized to file via make_pickle() Return type: any Raises: ExceptionName– Why the exception is raised.
-
autometa.common.utilities.untar(tarchive: str, outdir: str, member: str = None) → str¶ Decompress a tar archive (may be gzipped or bzipped). passing in member requires an outdir also be provided.
See: https://docs.python.org/3.8/library/tarfile.html#module-tarfile
Parameters: - tarchive (str) – </path/tarchive.tar.[compression]>
- outdir (str) – </path/to/output/directory>
- member (str, optional) – member file to extract.
Returns: </path/to/extracted/member/file> if member else </path/to/output/directory>
Return type: str
Raises: IsADirectoryError– outdir already existsValueError– tarchive is not a tar archiveKeyError– member was not found in tarchive
-
autometa.common.utilities.write_checksum(infpath: str, outfpath: str) → str¶ Calculate checksum for infpath and write to outfpath.
Parameters: - infpath (str) – </path/to/input/file>
- outfpath (str) – </path/to/output/checksum/file>
Returns: Description of returned object.
Return type: NoneType
Raises: FileNotFoundError– Provided infpath does not existTypeError– infpath or outfpath is not a string