pycallingcards.preprocessing.annotation#
- pycallingcards.preprocessing.annotation(peaks_frame=None, peaks_path=None, reference='hg38', save_annotation=None, bedtools_path=None, refGene=None)[source]#
Annotate the peak data using bedtools [Quinlan and Hall, 2010].
- Parameters:
peaks_frame (
Optional
[DataFrame
] (default:None
)) – pd.DataFrame with the first three columns as chromosome, start and end. Will not be used if peak_path is provided.peaks_path (
Optional
[str
] (default:None
)) – The path to the peak data. An external program is used in this function so peak_path is preferred over peaks_frame.reference (
Optional
[Literal
['hg38'
,'mm10'
,'sacCer3'
]] (default:'hg38'
)) – Default is ‘hg38’. Reference of the annotation data. Currently, only ‘hg38’, ‘mm10’, ‘sacCer3’ are provided.save_annotation (
Optional
[str
] (default:None
)) – The path and name of the annotation results would be saved.bedtools_path (
Optional
[str
] (default:None
)) – Default uses the default path for bedtools.refGene (
Optional
[DataFrame
] (default:None
)) – Default is None. If None, it would use the saved refgenome according to the reference provided. Else, if a Dataframe is input, it will use the refgenome provided. Please note that the DataFrame should follow the same format as this “https://github.com/The-Mitra-Lab/pycallingcards_data/releases/download/data/refGene.hg38.Sorted.bed” which contains 6 columns in total (Chrom, Start, End, Refseq, Name, Direction)
- Returns:
pd.DataFrame with the first three columns as chromosome, start and end. Following the columns is the peak_annotation.
Chr - The chromosome of the peak.Start - The start point of the peak.End - The end point of the peak.Nearest Refseq1 - The Refseq of the closest gene.Nearest Refseq2 - The name of the second closest gene.Gene Name1 - The name of the closest gene.Gene Name2 - The name of the second closest gene.- Return type:
- Example:
>>> import pycallingcards as cc >>> qbed_data = cc.datasets.mousecortex_data(data="qbed") >>> peak_data = cc.pp.callpeaks(qbed_data, method = "CCcaller", reference = "mm10", record = True) >>> peak_annotation = cc.pp.annotation(peak_data, reference = "mm10")