pycallingcards.preprocessing.call_peaks#

pycallingcards.preprocessing.call_peaks(expdata, background=None, method='CCcaller', reference='hg38', pvalue_cutoff=0.0001, pvalue_cutoffbg=0.0001, pvalue_cutoffTTAA=1e-05, pvalue_adj_cutoff=None, min_insertions=5, minlen=0, extend=200, maxbetween=2000, minnum=0, test_method='poisson', window_size=1500, lam_win_size=100000, step_size=500, pseudocounts=0.2, min_length=None, max_length=None, record=True, save=None)[source]#

Call peaks from qbed data.

Parameters:
  • expdata (DataFrame) – pd.DataFrame with the first three columns as chromosome, start and end.

  • background (Optional[DataFrame] (default: None)) – Default is None for backgound free situation. pd.DataFrame with the first three columns as chromosome, start and end.

  • method (Optional[Literal['CCcaller', 'MACCs', 'Blockify']] (default: 'CCcaller')) – ‘CCcaller’ is a method considering the maxdistance between insertions in the data, ‘MACCs’ uses the idea adapted from [Zhang et al., 2008] and here. ‘Blockify’ uses the method from [Moudgil et al., 2020] and here.

  • reference (Optional[Literal['hg38', 'mm10', 'sacCer3']] (default: 'hg38')) – We currently have ‘hg38’ for human data, ‘mm10’ for mouse data and ‘sacCer3’ for yeast data.

  • pvalue_cutoff (float (default: 0.0001)) – The P-value cutoff for a backgound free situation.

  • pvalue_cutoffbg (float (default: 0.0001)) – The P-value cutoff for backgound data when backgound exists.

  • pvalue_cutoffTTAA (float (default: 1e-05)) – The P-value cutoff for reference data when backgound exists. Note that pvalue_cutoffTTAA is recommended to be lower than pvalue_cutoffbg.

  • pvalue_adj_cutoff (Optional[float] (default: None)) – The cutoff for the adjusted pvalue. If None, no adjusted pvalue will be the same is pvalue_cutoff (for backgound free) or pvalue_cutoffTTAA (for with backgound) .

  • min_insertions (int (default: 5)) – The number of minimal insertions for each peak.

  • minlen (int (default: 0)) – Valid only for method = ‘CCcaller’. The minimal length for a peak without extend.

  • extend (int (default: 200)) – Valid for method = ‘CCcaller’ and ‘MACCs’. The length (bp) that peaks extend for both sides.

  • maxbetween (int (default: 2000)) – Valid only for method = ‘CCcaller’. The maximum length of nearby position within one peak.

  • minnum (int (default: 0)) – Valid only for method = ‘CCcaller’. The minmum number of insertions for the nearby position.

  • test_method (Optional[Literal['poisson', 'binomial']] (default: 'poisson')) – The method for making hypothesis.

  • window_size (int (default: 1500)) – Valid only for method = ‘MACCs’. The length of window looking for.

  • lam_win_size (Optional[int] (default: 100000)) – Valid for method = ‘CCcaller’ and ‘MACCs’. The length of peak area considered when performing a CCcaller.

  • step_size (int (default: 500)) – Valid only for ‘MACCs’. The length of each step.

  • pseudocounts (float (default: 0.2)) – Number for pseudocounts added for the pyhothesis.

  • min_length (Optional[int] (default: None)) – minimum length of peak, valid for Blockify.

  • max_length (Optional[int] (default: None)) – maximum length of peak, valid for Blockify.

  • record (bool (default: True)) – Controls if information is recorded. If False, the output would only have three columns: Chromosome, Start, End.

  • save (Optional[str] (default: None)) – The file name for the file we saved.

Returns:
Chr - The chromosome of the peak.
Start - The start point of the peak.
End - The end point of the peak.
Experiment Insertions - The total number of insertions within a peak in the experiment data.
Reference Insertions - The total number of insertions of within a peak in the reference data.
Background insertions - The total number of insertions within a peak in the experiment data.
Expected Insertions - The total number of expected insertions under null hypothesis from the reference data (in a background free situation).
Expected Insertions background - The total number of expected insertions under null hypothesis from the background data (in a background situation).
Expected Insertions Reference - The total number of expected insertions under null hypothesis from the reference data (in a background situation).
pvalue - The pvalue we calculate from null hypothesis (in a background free situation or method = ‘Blockify’).
pvalue Reference - The total number of insertions of within a peak in the reference data (in a background situation).
pvalue Background - The total number of insertions of within a peak in the reference data (in a background situation).
Fraction Experiment - The fraction of insertions in the experiment data.
TPH Experiment - Transpositions per hundred million insertions in the experiment data for mammalian and transpositions per hundred million insertions in the experiment data for sacCer3.
Fraction Background - The fraction of insertions in the background data.
TPH Background - Transpositions per hundred million insertions in the background data for mammalian and transpositions per hundred million insertions in the background data for sacCer3.
TPH Background subtracted - The difference between TPH Experiment and TPH Background.
Return type:

DataFrame

Examples:

>>> import pycallingcards as cc
>>> qbed_data = cc.datasets.mousecortex_data(data="qbed")
>>> peak_data = cc.pp.call_peaks(qbed_data, method = "CCcaller", reference = "mm10",  maxbetween = 2000,pvalue_cutoff = 0.01, pseudocounts = 1, record = True)