pycallingcards.preprocessing.annotation#

pycallingcards.preprocessing.annotation(peaks_frame=None, peaks_path=None, reference='hg38', save_annotation=None, bedtools_path=None, refGene=None)[source]#

Annotate the peak data using bedtools [Quinlan and Hall, 2010].

Parameters:
  • peaks_frame (Optional[DataFrame] (default: None)) – pd.DataFrame with the first three columns as chromosome, start and end. Will not be used if peak_path is provided.

  • peaks_path (Optional[str] (default: None)) – The path to the peak data. An external program is used in this function so peak_path is preferred over peaks_frame.

  • reference (Optional[Literal['hg38', 'mm10', 'sacCer3']] (default: 'hg38')) – Default is ‘hg38’. Reference of the annotation data. Currently, only ‘hg38’, ‘mm10’, ‘sacCer3’ are provided.

  • save_annotation (Optional[str] (default: None)) – The path and name of the annotation results would be saved.

  • bedtools_path (Optional[str] (default: None)) – Default uses the default path for bedtools.

  • refGene (Optional[DataFrame] (default: None)) – Default is None. If None, it would use the saved refgenome according to the reference provided. Else, if a Dataframe is input, it will use the refgenome provided. Please note that the DataFrame should follow the same format as this “https://github.com/The-Mitra-Lab/pycallingcards_data/releases/download/data/refGene.hg38.Sorted.bed” which contains 6 columns in total (Chrom, Start, End, Refseq, Name, Direction)

Returns:

pd.DataFrame with the first three columns as chromosome, start and end. Following the columns is the peak_annotation.

Chr - The chromosome of the peak.
Start - The start point of the peak.
End - The end point of the peak.
Nearest Refseq1 - The Refseq of the closest gene.
Nearest Refseq2 - The name of the second closest gene.
Gene Name1 - The name of the closest gene.
Gene Name2 - The name of the second closest gene.
Return type:

DataFrame

Example:

>>> import pycallingcards as cc
>>> qbed_data = cc.datasets.mousecortex_data(data="qbed")
>>> peak_data = cc.pp.callpeaks(qbed_data, method = "CCcaller", reference = "mm10", record = True)
>>> peak_annotation = cc.pp.annotation(peak_data, reference = "mm10")