Tutorial: Yeast TYE7 bulk calling cards data#

In this tutorial, we analyze a calling card experiment that mapped the binding of the yeast transcription factor TYE7. The dataset is from Shively CA, PNAS. (2020) and can be downloaded from GEO. One of the main findings of that paper was that Tye7p cooperatively binds with the transcription factor Gcr2p. Here we will analyze Tye7p binding in wild-type and gcr2 KO cells.

In this tutorial, we will call peaks, annotate them, and perform a differential peak analysis.

[1]:
import pycallingcards as cc
import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 150

We start by reading in a abed datafile. Because this data was published several years ago, the data is in the old format: one row is one insertion and columns indicate the chromosome, start point and read number. Thus, we first convert it to the new file format.

Use cc.rd.read_qbed(filename) to read your own qbed data.

[2]:
TYE7 = cc.rd.read_qbed("https://github.com/The-Mitra-Lab/pycallingcards_data/releases/download/data/GSM3946397_TEF1p_TYE7_ALL.gnashy.txt")
TYE7
[2]:
Chr Start End Reads Direction Barcodes
0 1 1501 111 NaN NaN NaN
1 1 1502 1 NaN NaN NaN
2 1 5320 1 NaN NaN NaN
3 1 5544 199 NaN NaN NaN
4 1 32962 1 NaN NaN NaN
... ... ... ... ... ... ...
11450 16 928369 1 NaN NaN NaN
11451 16 928381 14 NaN NaN NaN
11452 16 928555 34 NaN NaN NaN
11453 16 930647 330 NaN NaN NaN
11454 16 939536 22 NaN NaN NaN

11455 rows × 6 columns

[3]:
TYE7_gcr2ko = cc.rd.read_qbed("https://github.com/The-Mitra-Lab/pycallingcards_data/releases/download/data/GSM3946398_TEF1p_TYE7_gcr2ko_ALL.gnashy.txt")
TYE7_gcr2ko
[3]:
Chr Start End Reads Direction Barcodes
0 1 3108 1 NaN NaN NaN
1 1 8064 9 NaN NaN NaN
2 1 30957 2 NaN NaN NaN
3 1 30989 1 NaN NaN NaN
4 1 31530 1 NaN NaN NaN
... ... ... ... ... ... ...
12283 16 930468 2 NaN NaN NaN
12284 16 930698 13 NaN NaN NaN
12285 16 936191 4 NaN NaN NaN
12286 16 938479 3 NaN NaN NaN
12287 16 941830 116 NaN NaN NaN

12288 rows × 6 columns

Define a function to transfer data to new version.

[4]:
def transfer(data):
    for i in range(len(data)):
        data.iloc[i,3] = data.iloc[i,2]
        data.iloc[i,2] = data.iloc[i,1] + 1
        num = data.iloc[i,0]
        if num == 1:
            data.iloc[i,0] = 'chrI'
        elif num == 2:
            data.iloc[i,0] = 'chrII'
        elif num == 3:
            data.iloc[i,0] = 'chrIII'
        elif num == 4:
            data.iloc[i,0] = 'chrIV'
        elif num == 5:
            data.iloc[i,0] = 'chrV'
        elif num == 6:
            data.iloc[i,0] = 'chrVI'
        elif num == 7:
            data.iloc[i,0] = 'chrVII'
        elif num == 8:
            data.iloc[i,0] = 'chrVIII'
        elif num == 9:
            data.iloc[i,0] = 'chrIX'
        elif num == 10:
            data.iloc[i,0] = 'chrX'
        elif num == 11:
            data.iloc[i,0] = 'chrXI'
        elif num == 12:
            data.iloc[i,0] = 'chrXII'
        elif num == 13:
            data.iloc[i,0] = 'chrXIII'
        elif num == 14:
            data.iloc[i,0] = 'chrXIV'
        elif num == 15:
            data.iloc[i,0] = 'chrXV'
        elif num == 16:
            data.iloc[i,0] = 'chrXVI'
    return data
[5]:
TYE7 = transfer(TYE7)
TYE7['group'] = 'TYE7'
TYE7
[5]:
Chr Start End Reads Direction Barcodes group
0 chrI 1501 1502 111.0 NaN NaN TYE7
1 chrI 1502 1503 1.0 NaN NaN TYE7
2 chrI 5320 5321 1.0 NaN NaN TYE7
3 chrI 5544 5545 199.0 NaN NaN TYE7
4 chrI 32962 32963 1.0 NaN NaN TYE7
... ... ... ... ... ... ... ...
11450 chrXVI 928369 928370 1.0 NaN NaN TYE7
11451 chrXVI 928381 928382 14.0 NaN NaN TYE7
11452 chrXVI 928555 928556 34.0 NaN NaN TYE7
11453 chrXVI 930647 930648 330.0 NaN NaN TYE7
11454 chrXVI 939536 939537 22.0 NaN NaN TYE7

11455 rows × 7 columns

[6]:
TYE7_gcr2ko = transfer(TYE7_gcr2ko)
TYE7_gcr2ko['group'] = 'TYE7_gcr2ko'
TYE7_gcr2ko
[6]:
Chr Start End Reads Direction Barcodes group
0 chrI 3108 3109 1.0 NaN NaN TYE7_gcr2ko
1 chrI 8064 8065 9.0 NaN NaN TYE7_gcr2ko
2 chrI 30957 30958 2.0 NaN NaN TYE7_gcr2ko
3 chrI 30989 30990 1.0 NaN NaN TYE7_gcr2ko
4 chrI 31530 31531 1.0 NaN NaN TYE7_gcr2ko
... ... ... ... ... ... ... ...
12283 chrXVI 930468 930469 2.0 NaN NaN TYE7_gcr2ko
12284 chrXVI 930698 930699 13.0 NaN NaN TYE7_gcr2ko
12285 chrXVI 936191 936192 4.0 NaN NaN TYE7_gcr2ko
12286 chrXVI 938479 938480 3.0 NaN NaN TYE7_gcr2ko
12287 chrXVI 941830 941831 116.0 NaN NaN TYE7_gcr2ko

12288 rows × 7 columns

In order to call differential peaks, we conbine two qbed files together.

[7]:
qbed_data = cc.rd.combine_qbed([TYE7, TYE7_gcr2ko])
qbed_data
[7]:
Chr Start End Reads Direction Barcodes group
0 chrI 1501 1502 111.0 NaN NaN TYE7
1 chrI 1502 1503 1.0 NaN NaN TYE7
2 chrI 1557 1558 304.0 NaN NaN TYE7_gcr2ko
3 chrI 1687 1688 8.0 NaN NaN TYE7_gcr2ko
4 chrI 1690 1691 7.0 NaN NaN TYE7_gcr2ko
... ... ... ... ... ... ... ...
23738 chrXVI 939075 939076 120.0 NaN NaN TYE7_gcr2ko
23739 chrXVI 939536 939537 22.0 NaN NaN TYE7
23740 chrXVI 940225 940226 1.0 NaN NaN TYE7_gcr2ko
23741 chrXVI 941830 941831 116.0 NaN NaN TYE7_gcr2ko
23742 chrXVI 943167 943168 170.0 NaN NaN TYE7

23743 rows × 7 columns

Because insertions are discrete, we now need to call peaks to deduce potential binding sites. Three different methods (CCcaller, MACCs, Blockify) are available along with three different species (hg38, mm10, sacCer3).

Because of the particularity of yeast calling cards data, we use MACCs in yeast(‘sacCer3’) data. window_size is the most important parameter for MACCs, it is highly related to the length of a peak. 100-200 is a good fit for window_size. step_size is another important paramenter and it controls how careful we are at looking into the chromosomes. 30-100 is good for step_size. pvalue_cutoff is also an important parameter and numbers from 0.0001 to 0.01 are strongly advised. pseudocounts is advised to be 0.1-1.

[8]:
peak_data = cc.pp.call_peaks(qbed_data, method = "MACCs", reference = "sacCer3", extend = 50, window_size = 125,
                             step_size = 60, pvalue_cutoff = 0.0001, lam_win_size = None, pseudocounts = 1,
                             record = True, save = "peak.bed")
peak_data
For the MACCS method without background, [expdata, reference, pvalue_cutoff, lam_win_size, window_size, step_size, extend, pseudocounts, test_method, min_insertions, record] would be utilized.
100%|██████████| 16/16 [00:00<00:00, 19.97it/s]
[8]:
Chr Start End Center pvalue Experiment Insertions Reference Insertions Fraction Experiment TPH Experiment Expect insertions pvalue_adj
0 chrI 68120 68335 68249.0 1.607677e-09 27 25 0.001137 1.137177e+05 7.429007 1.040712e-06
1 chrI 71058 71450 71268.0 0.000000e+00 57 27 0.002401 2.400708e+05 7.943327 0.000000e+00
2 chrI 229758 229966 229890.0 1.183165e-12 43 45 0.001811 1.811060e+05 12.572212 8.633875e-10
3 chrII 29937 30481 30158.5 0.000000e+00 71 0 0.002990 2.990355e+05 1.252814 0.000000e+00
4 chrII 221245 221577 221472.0 0.000000e+00 86 8 0.003622 3.622120e+05 3.022509 0.000000e+00
... ... ... ... ... ... ... ... ... ... ... ...
64 chrXV 987325 988055 987562.0 0.000000e+00 224 64 0.009434 9.434360e+05 20.135194 0.000000e+00
65 chrXV 988153 988367 988290.0 1.295000e-07 24 22 0.001011 1.010824e+05 7.577723 7.532581e-05
66 chrXVI 411349 411915 411592.0 0.000000e+00 110 34 0.004633 4.632944e+05 11.167010 0.000000e+00
67 chrXVI 412048 412354 412177.0 0.000000e+00 75 19 0.003159 3.158826e+05 6.681564 0.000000e+00
68 chrXVI 855654 856341 855836.0 0.000000e+00 623 12 0.026239 2.623931e+06 4.588356 0.000000e+00

69 rows × 11 columns

In order to choose the suitable method and parameters for peak calling, please take a look at genome areas. We stongly advise to adjust the parameters for cc.pp.call_peaks() to call better peaks.

In this plot, the colored ones are the experiment qbed data and the middle section is the distribution of insertions. The bottom section represents reference genes and peaks.

In yeast, one peak might be divided into two and the true binding site is in the middle space. Thus, please carefully modify the window_size and step_size parameters.

[9]:
cc.pl.draw_area("chrXVI", 860297, 861116, 8000, peak_data, qbed_data, "sacCer3", font_size=2,
                figsize = (30,8), peak_line = 1, save = False, plotsize = [1,1,5], example_length = 1000)
../../_images/tutorials_notebooks_yeast_16_0.png
[10]:
cc.pl.draw_area("chrI", 68926, 67722, 6000, peak_data, qbed_data, "sacCer3", font_size=2,
                figsize = (30,6), peak_line = 1, save = False, plotsize = [1,1,4], example_length = 1000)
../../_images/tutorials_notebooks_yeast_17_0.png

We can also visualize our data directly in the WashU Epigenome Browser. This can be useful for overlaying your data with other published datasets. Please note that this link only valid for 24hrs, so you will have to rerun it if you want to use it after this time period.

For bedgraph file, it is aimed to show the distribution of qbed data. Please set the the Aggregate method of bedgraph file to count or sum and fix the Y-axis if you want to compare several datasets.

[11]:
qbed= {"TYE7":TYE7, "TYE7_gcr2ko":TYE7_gcr2ko}
bed = {'PEAK':peak_data}
cc.pl.WashU_browser_url(qbed, bed, genome = 'sacCer3')
All qbed addressed
All bed addressed
Uploading files
Please click the following link to see the data on WashU Epigenome Browser directly.
https://epigenomegateway.wustl.edu/browser/?genome=sacCer3&hub=https://companion.epigenomegateway.org//task/6480d2f42f1177624560a55e7d51258f/output//datahub.json

Pycallingcards can be used to visual peak locations acorss the genome to see that the distribution of peaks is unbiased and that all chromosomes are represented.

[12]:
cc.pl.whole_peaks(peak_data, reference = 'sacCer3', height_scale = 1.7)
../../_images/tutorials_notebooks_yeast_21_0.png

In the next step, we annotate the peaks by their closest genes using bedtools and pybedtools. Make sure they are all previously installed before using.

[13]:
peak_annotation = cc.pp.annotation(peak_data, reference = 'sacCer3')
peak_annotation = cc.pp.combine_annotation(peak_data, peak_annotation)
peak_annotation
In the bedtools method, we would use bedtools in the default path. Set bedtools path by 'bedtools_path' if needed.
[13]:
Chr Start End Center pvalue Experiment Insertions Reference Insertions Fraction Experiment TPH Experiment Expect insertions pvalue_adj Nearest Refseq1 Gene Name1 Direction1 Distance1 Nearest Refseq2 Gene Name2 Direction2 Distance2
0 chrI 68120 68335 68249.0 1.607677e-09 27 25 0.001137 1.137177e+05 7.429007 1.040712e-06 S000000037 CYC3 - 382 S000000038 CLN3 - -601
1 chrI 71058 71450 71268.0 0.000000e+00 57 27 0.002401 2.400708e+05 7.943327 0.000000e+00 S000000036 CDC19 + 337 S000000037 CYC3 - -1534
2 chrI 229758 229966 229890.0 1.183165e-12 43 45 0.001811 1.811060e+05 12.572212 8.633875e-10 S000000094 PHO11 + -2896 S000000084 FLO1 + -21743
3 chrII 29937 30481 30158.5 0.000000e+00 71 0 0.002990 2.990355e+05 1.252814 0.000000e+00 S000000197 ECM21 - -1639 S000000198 SFT2 + -5193
4 chrII 221245 221577 221472.0 0.000000e+00 86 8 0.003622 3.622120e+05 3.022509 0.000000e+00 S000000101 PDR3 + -846 S000000102 LDB7 - -4117
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
64 chrXV 987325 988055 987562.0 0.000000e+00 224 64 0.009434 9.434360e+05 20.135194 0.000000e+00 S000005875 PUT4 - 0 S000005874 PYK2 - -864
65 chrXV 988153 988367 988290.0 1.295000e-07 24 22 0.001011 1.010824e+05 7.577723 7.532581e-05 S000005875 PUT4 - 0 S000005876 CIN1 + 1423
66 chrXVI 411349 411915 411592.0 0.000000e+00 110 34 0.004633 4.632944e+05 11.167010 0.000000e+00 S000005997 GPI2 + -65 S000005996 GCR1 + 340
67 chrXVI 412048 412354 412177.0 0.000000e+00 75 19 0.003159 3.158826e+05 6.681564 0.000000e+00 S000005996 GCR1 + 0 S000005997 GPI2 + -764
68 chrXVI 855654 856341 855836.0 0.000000e+00 623 12 0.026239 2.623931e+06 4.588356 0.000000e+00 S000006363 KRE6 + 1243 S000006364 GPH1 + 4966

69 rows × 19 columns

Use qbed data, peak data and group names to make a group by peak anndata object.

[14]:
adata_cc = cc.pp.make_Anndata(qbed_data, peak_annotation, ['TYE7', 'TYE7_gcr2ko'], key = 'group', reference = "sacCer3")
adata_cc
100%|██████████| 15/15 [00:00<00:00, 151.88it/s]
[14]:
AnnData object with n_obs × n_vars = 2 × 69
    var: 'Chr', 'Start', 'End', 'Center', 'pvalue', 'Experiment Insertions', 'Reference Insertions', 'Fraction Experiment', 'TPH Experiment', 'Expect insertions', 'pvalue_adj', 'Nearest Refseq1', 'Gene Name1', 'Direction1', 'Distance1', 'Nearest Refseq2', 'Gene Name2', 'Direction2', 'Distance2'

Differential peak analysis would find out the significant binding for each group. In this example, we use fisher exact test to find out.

[15]:
cc.tl.rank_peak_groups(adata_cc, "Index", method = 'fisher_exact', key_added = 'fisher_exact')
100%|██████████| 2/2 [00:00<00:00,  5.13it/s]

Plot the results for differential peak analysis.

Currently, the peaks are ranked by pvalues. It could also be ranked by logfoldchanges by the following codes:

cc.tl.rank_peak_groups(adata_cc,"Index",method = 'fisher_exact',key_added = 'fisher_exact',rankby = 'logfoldchanges')
cc.pl.rank_peak_groups(adata_cc, key = 'fisher_exact',rankby = 'logfoldchanges')
[16]:
cc.pl.rank_peak_groups(adata_cc, key = 'fisher_exact')
../../_images/tutorials_notebooks_yeast_29_0.png

Then, we take a look at the genome for highly differentiated peaks. The colored ones are the insertions for specific cluster and the grey ones are the total insertions information. If we input the backgound file, the grey ones would be the backgound insertions.

[17]:
cc.pl.draw_area("chrVII", 883694, 884861, 4000, peak_data, qbed_data, "sacCer3", adata = adata_cc, font_size=2,
                name = "TYE7", key = "Index", insertionkey = "group", figsize = (30,8), peak_line = 1,
                name_insertion1 = 'TYE7 Insertions', name_density1 = 'TYE7 Insertion Density',
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                bins = 600, example_length = 500, color = "purple", title = "TYE7")
cc.pl.draw_area("chrVII", 883694, 884861, 4000, peak_data, qbed_data, "sacCer3", adata = adata_cc, font_size=2,
                name = "TYE7_gcr2ko", key = "Index", insertionkey = "group", figsize = (30,8), peak_line = 1,
                name_insertion1 = 'TYE7_gcr2ko Insertions', name_density1 = 'TYE7_gcr2ko Insertion Density',
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                bins = 600, example_length = 500, title = "TYE7_gcr2ko")
../../_images/tutorials_notebooks_yeast_31_0.png
../../_images/tutorials_notebooks_yeast_31_1.png
[18]:
cc.pl.draw_area("chrII", 613790, 616171, 1000, peak_data, qbed_data, "sacCer3",adata = adata_cc, font_size = 2,
                name = "TYE7", key = "Index", insertionkey = "group", figsize = (30,8), peak_line = 2,
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'TYE7 Insertions', name_density1 = 'TYE7 Insertion Density',
                bins = 600, example_length = 500, color = "purple", title = "TYE7")
cc.pl.draw_area("chrII", 613790, 616171, 1000, peak_data, qbed_data, "sacCer3", adata = adata_cc, font_size = 2,
                name = "TYE7_gcr2ko", key = "Index", insertionkey = "group", figsize = (30,8), peak_line = 2,
                name_insertion1 = 'TYE7_gcr2ko Insertions', name_density1 = 'TYE7_gcr2ko Insertion Density',
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                bins = 600, example_length = 500, title = "TYE7_gcr2ko")
../../_images/tutorials_notebooks_yeast_32_0.png
../../_images/tutorials_notebooks_yeast_32_1.png

Plot the volcano plot for differential binding sites.

[19]:
cc.pl.volcano_plot(adata_cc, pvalue_cutoff = 0.05, lfc_cutoff = 2, labelright = (5,70), labelleft = (-9,70))
../../_images/tutorials_notebooks_yeast_34_0.png
[20]:
cc.tl.rank_peak_groups_df(adata_cc,'fisher_exact', logfc_min = 3, pval_cutoff = 0.05)
[20]:
names logfoldchanges pvalues pvalues_adj number number_rest total total_rest group
0 chrVII_883857_884257 3.606330 1.927126e-70 1.329717e-68 441 24 5031 3335 TYE7
1 chrX_454688_455338 3.949230 2.108841e-51 7.275501e-50 303 13 5031 3335 TYE7
2 chrII_613940_614505 5.382557 4.503557e-38 7.768636e-37 189 3 5031 3335 TYE7
3 chrV_544670_545587 3.728325 9.272983e-33 1.279672e-31 200 10 5031 3335 TYE7
4 chrVIII_450860_451208 6.446487 2.321509e-28 2.669735e-27 132 1 5031 3335 TYE7
5 chrXVI_411349_411915 4.133628 5.321212e-19 3.671637e-18 106 4 5031 3335 TYE7
6 chrII_614793_615361 4.959104 1.137108e-18 7.132768e-18 94 2 5031 3335 TYE7
7 chrXVI_412048_412354 13.863855 2.628075e-17 1.208914e-16 75 0 5031 3335 TYE7
8 chrXV_160747_161147 3.246710 4.388851e-07 1.164734e-06 43 3 5031 3335 TYE7
9 chrXI_164749_164897 3.724329 4.839595e-04 8.562361e-04 20 1 5031 3335 TYE7
10 chrIV_215708_216063 3.480023 3.447024e-20 2.642718e-19 74 10 3335 5031 TYE7_gcr2ko
11 chrVII_625218_625597 3.177404 7.027647e-14 3.030673e-13 54 9 3335 5031 TYE7_gcr2ko
12 chrII_533304_533666 3.800928 2.132744e-11 7.745227e-11 37 4 3335 5031 TYE7_gcr2ko
13 chrXIV_301779_302150 4.589684 3.215739e-11 1.109430e-10 32 2 3335 5031 TYE7_gcr2ko
14 chrXV_304395_304786 3.399200 5.453564e-10 1.710436e-09 35 5 3335 5031 TYE7_gcr2ko
15 chrIV_1270863_1270994 12.035796 2.517398e-06 6.203588e-06 14 0 3335 5031 TYE7_gcr2ko
16 chrII_455444_455788 4.493130 1.004659e-05 2.310716e-05 15 1 3335 5031 TYE7_gcr2ko
17 chrIV_927099_927255 11.550507 1.005197e-04 1.926627e-04 10 0 3335 5031 TYE7_gcr2ko

This is the heatmap for relative calling cards bindings.

[21]:
cc.pl.heatmap(adata_cc, figsize=(15,1))
../../_images/tutorials_notebooks_yeast_37_0.png
[22]:
cc.tl.rank_peak_groups_df(adata_cc,'fisher_exact')
[22]:
names logfoldchanges pvalues pvalues_adj number number_rest total total_rest group
0 chrVII_883857_884257 3.606330 1.927126e-70 1.329717e-68 441 24 5031 3335 TYE7
1 chrX_454688_455338 3.949230 2.108841e-51 7.275501e-50 303 13 5031 3335 TYE7
2 chrVII_999899_1001006 2.295043 6.436514e-40 1.480398e-38 385 52 5031 3335 TYE7
3 chrII_613940_614505 5.382557 4.503557e-38 7.768636e-37 189 3 5031 3335 TYE7
4 chrV_544670_545587 3.728325 9.272983e-33 1.279672e-31 200 10 5031 3335 TYE7
... ... ... ... ... ... ... ... ... ...
133 chrXV_970959_971080 -0.576370 5.822512e-01 6.180820e-01 4 9 3335 5031 TYE7_gcr2ko
134 chrXV_117912_118160 0.136291 6.403453e-01 6.694519e-01 51 70 3335 5031 TYE7_gcr2ko
135 chrXI_381717_381964 0.330000 6.645344e-01 6.843713e-01 10 12 3335 5031 TYE7_gcr2ko
136 chrVIII_549242_549430 -0.313615 6.755711e-01 6.855059e-01 8 15 3335 5031 TYE7_gcr2ko
137 chrVII_567315_567563 0.123640 8.551552e-01 8.551552e-01 13 18 3335 5031 TYE7_gcr2ko

138 rows × 9 columns

[23]:
cc.tl.rank_peak_groups_df(adata_cc,'fisher_exact', logfc_min = 3, pval_cutoff = 0.05)
[23]:
names logfoldchanges pvalues pvalues_adj number number_rest total total_rest group
0 chrVII_883857_884257 3.606330 1.927126e-70 1.329717e-68 441 24 5031 3335 TYE7
1 chrX_454688_455338 3.949230 2.108841e-51 7.275501e-50 303 13 5031 3335 TYE7
2 chrII_613940_614505 5.382557 4.503557e-38 7.768636e-37 189 3 5031 3335 TYE7
3 chrV_544670_545587 3.728325 9.272983e-33 1.279672e-31 200 10 5031 3335 TYE7
4 chrVIII_450860_451208 6.446487 2.321509e-28 2.669735e-27 132 1 5031 3335 TYE7
5 chrXVI_411349_411915 4.133628 5.321212e-19 3.671637e-18 106 4 5031 3335 TYE7
6 chrII_614793_615361 4.959104 1.137108e-18 7.132768e-18 94 2 5031 3335 TYE7
7 chrXVI_412048_412354 13.863855 2.628075e-17 1.208914e-16 75 0 5031 3335 TYE7
8 chrXV_160747_161147 3.246710 4.388851e-07 1.164734e-06 43 3 5031 3335 TYE7
9 chrXI_164749_164897 3.724329 4.839595e-04 8.562361e-04 20 1 5031 3335 TYE7
10 chrIV_215708_216063 3.480023 3.447024e-20 2.642718e-19 74 10 3335 5031 TYE7_gcr2ko
11 chrVII_625218_625597 3.177404 7.027647e-14 3.030673e-13 54 9 3335 5031 TYE7_gcr2ko
12 chrII_533304_533666 3.800928 2.132744e-11 7.745227e-11 37 4 3335 5031 TYE7_gcr2ko
13 chrXIV_301779_302150 4.589684 3.215739e-11 1.109430e-10 32 2 3335 5031 TYE7_gcr2ko
14 chrXV_304395_304786 3.399200 5.453564e-10 1.710436e-09 35 5 3335 5031 TYE7_gcr2ko
15 chrIV_1270863_1270994 12.035796 2.517398e-06 6.203588e-06 14 0 3335 5031 TYE7_gcr2ko
16 chrII_455444_455788 4.493130 1.004659e-05 2.310716e-05 15 1 3335 5031 TYE7_gcr2ko
17 chrIV_927099_927255 11.550507 1.005197e-04 1.926627e-04 10 0 3335 5031 TYE7_gcr2ko
[24]:
adata_cc.var
[24]:
Chr Start End Center pvalue Experiment Insertions Reference Insertions Fraction Experiment TPH Experiment Expect insertions pvalue_adj Nearest Refseq1 Gene Name1 Direction1 Distance1 Nearest Refseq2 Gene Name2 Direction2 Distance2
name
chrIII_137273_137488 chrIII 137273 137488 137384.0 1.040404e-07 19 18 0.000800 8.002359e+04 5.077267 6.140676e-05 S000000605 PGK1 + 259 S000000604 ADP1 - -401
chrIII_50464_50773 chrIII 50464 50773 50622.0 0.000000e+00 57 17 0.002401 2.400708e+05 4.850752 0.000000e+00 S000000545 GLK1 + 66 S000000548 PDI1 - -244
chrIII_68547_68827 chrIII 68547 68827 68644.0 0.000000e+00 59 33 0.002485 2.484943e+05 8.474990 0.000000e+00 S000000534 BIK1 - 0 S000000535 HIS4 - -215
chrII_221245_221577 chrII 221245 221577 221472.0 0.000000e+00 86 8 0.003622 3.622120e+05 3.022509 0.000000e+00 S000000101 PDR3 + -846 S000000102 LDB7 - -4117
chrII_260413_260575 chrII 260413 260575 260489.0 5.870277e-09 22 18 0.000927 9.265889e+04 5.550644 3.681306e-06 S000000215 IPP1 - -2439 S000000214 HHT1 + -3673
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
chrXV_970959_971080 chrXV 970959 971080 971024.5 1.148182e-11 13 0 0.000548 5.475298e+04 1.298987 7.945218e-09 S000005868 RPA190 + -4979 S000005871 TYE7 - 6115
chrXV_987325_988055 chrXV 987325 988055 987562.0 0.000000e+00 224 64 0.009434 9.434360e+05 20.135194 0.000000e+00 S000005875 PUT4 - 0 S000005874 PYK2 - -864
chrXV_988153_988367 chrXV 988153 988367 988290.0 1.295000e-07 24 22 0.001011 1.010824e+05 7.577723 7.532581e-05 S000005875 PUT4 - 0 S000005876 CIN1 + 1423
chrX_337351_338085 chrX 337351 338085 337757.0 0.000000e+00 738 44 0.031083 3.108285e+06 19.316612 0.000000e+00 S000003588 TDH1 + 187 S000003589 PEP8 + -315
chrX_454688_455338 chrX 454688 455338 454954.0 0.000000e+00 316 23 0.013309 1.330919e+06 10.574593 0.000000e+00 S000003769 TDH2 - -8 S000003771 MET3 + 902

69 rows × 19 columns

[25]:
adata_cc.var[adata_cc.var.index.isin(list(cc.tl.rank_peak_groups_df(adata_cc,'fisher_exact', logfc_min = 3, pval_cutoff = 0.05)['names']))]
[25]:
Chr Start End Center pvalue Experiment Insertions Reference Insertions Fraction Experiment TPH Experiment Expect insertions pvalue_adj Nearest Refseq1 Gene Name1 Direction1 Distance1 Nearest Refseq2 Gene Name2 Direction2 Distance2
name
chrII_455444_455788 chrII 455444 455788 455660.5 2.775558e-15 16 0 0.000674 6.738828e+04 1.252814 2.101830e-12 S000000312 AIM3 + 0 S000000311 IML3 - -915
chrII_533304_533666 chrII 533304 533666 533482.5 0.000000e+00 41 11 0.001727 1.726825e+05 3.780949 0.000000e+00 S000000349 ADH5 + 97 S000000347 SUP45 - -1123
chrII_613940_614505 chrII 613940 614505 614184.0 0.000000e+00 192 41 0.008087 8.086594e+05 11.365356 0.000000e+00 S000000400 PGI1 - -41 S000000402 TAF5 - 1623
chrII_614793_615361 chrII 614793 615361 615087.0 0.000000e+00 96 10 0.004043 4.043297e+05 3.528136 0.000000e+00 S000000402 TAF5 - 767 S000000400 PGI1 - -894
chrIV_1270863_1270994 chrIV 1270863 1270994 1270932.5 2.821892e-10 14 4 0.000590 5.896475e+04 1.926398 1.919604e-07 S000002808 URH1 + 70 S000002807 HPT1 + -131
chrIV_215708_216063 chrIV 215708 216063 215952.0 0.000000e+00 84 26 0.003538 3.537885e+05 7.021585 0.000000e+00 S000002297 RGT2 + -67 S000002296 ARF2 + 467
chrIV_927099_927255 chrIV 927099 927255 927176.0 5.238660e-08 10 2 0.000421 4.211768e+04 1.463199 3.185661e-05 S000002639 COX20 - -190 S000002640 HEM1 + 198
chrVIII_450860_451208 chrVIII 450860 451208 451015.0 0.000000e+00 133 22 0.005602 5.601651e+05 5.946786 0.000000e+00 S000001215 SPC97 + -55 S000001217 ENO2 - 120
chrVII_625218_625597 chrVII 625218 625597 625465.0 0.000000e+00 63 8 0.002653 2.653414e+05 4.569200 0.000000e+00 S000003300 ART5 - 0 S000003302 ROM1 + 2210
chrVII_883857_884257 chrVII 883857 884257 884120.0 0.000000e+00 465 41 0.019585 1.958472e+06 19.292152 0.000000e+00 S000003424 TDH3 - -48 S000003425 PDX1 - 253
chrV_544670_545587 chrV 544670 545587 545228.0 0.000000e+00 210 106 0.008845 8.844712e+05 25.204724 0.000000e+00 S000000978 ECM32 + 0 S000000979 BMH1 + 25
chrXIV_301779_302150 chrXIV 301779 302150 301988.0 4.203859e-12 34 25 0.001432 1.432001e+05 8.725072 2.960033e-09 S000005122 RPS3 + 531 S000005124 RHO5 - -1131
chrXI_164749_164897 chrXI 164749 164897 164815.5 1.333711e-12 21 11 0.000884 8.844712e+04 3.292782 9.558659e-10 S000001635 GPM1 - -365 S000001633 MCR1 + 1648
chrXVI_411349_411915 chrXVI 411349 411915 411592.0 0.000000e+00 110 34 0.004633 4.632944e+05 11.167010 0.000000e+00 S000005997 GPI2 + -65 S000005996 GCR1 + 340
chrXVI_412048_412354 chrXVI 412048 412354 412177.0 0.000000e+00 75 19 0.003159 3.158826e+05 6.681564 0.000000e+00 S000005996 GCR1 + 0 S000005997 GPI2 + -764
chrXV_160747_161147 chrXV 160747 161147 160921.0 0.000000e+00 46 23 0.001937 1.937413e+05 7.876710 0.000000e+00 S000005446 ADH1 - -154 S000005444 PHM7 + 1210
chrXV_304395_304786 chrXV 304395 304786 304559.0 0.000000e+00 40 10 0.001685 1.684707e+05 3.989874 0.000000e+00 S000005372 HTZ1 - -413 S000005371 PLB3 + 564
chrX_454688_455338 chrX 454688 455338 454954.0 0.000000e+00 316 23 0.013309 1.330919e+06 10.574593 0.000000e+00 S000003769 TDH2 - -8 S000003771 MET3 + 902

In some cases, the peaks are divided into two and the gap might be acture binding sites. This often happens in yeast data because insertions can bind anywhere, but it rarely happens in human/mouse data, because it can only bind in TTAA sites.

We could perform a footprint analysis on yeast data.

[26]:
adata_cc.var = cc.tl.footprint(adata_cc.var, qbed_data)
adata_cc.var
[26]:
Chr Start End Center pvalue Experiment Insertions Reference Insertions Fraction Experiment TPH Experiment Expect insertions ... Gene Name1 Direction1 Distance1 Nearest Refseq2 Gene Name2 Direction2 Distance2 Chr_footprint Start_footprint End_footprint
name
chrIII_137273_137488 chrIII 137273 137488 137384.0 1.040404e-07 19 18 0.000800 8.002359e+04 5.077267 ... PGK1 + 259 S000000604 ADP1 - -401 chrIII 137342 137398
chrIII_50464_50773 chrIII 50464 50773 50622.0 0.000000e+00 57 17 0.002401 2.400708e+05 4.850752 ... GLK1 + 66 S000000548 PDI1 - -244 chrIII 50528 50645
chrIII_68547_68827 chrIII 68547 68827 68644.0 0.000000e+00 59 33 0.002485 2.484943e+05 8.474990 ... BIK1 - 0 S000000535 HIS4 - -215 chrIII 68634 68742
chrII_221245_221577 chrII 221245 221577 221472.0 0.000000e+00 86 8 0.003622 3.622120e+05 3.022509 ... PDR3 + -846 S000000102 LDB7 - -4117 chrII 221449 221557
chrII_260413_260575 chrII 260413 260575 260489.0 5.870277e-09 22 18 0.000927 9.265889e+04 5.550644 ... IPP1 - -2439 S000000214 HHT1 + -3673 chrII 260466 260535
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
chrXV_970959_971080 chrXV 970959 971080 971024.5 1.148182e-11 13 0 0.000548 5.475298e+04 1.298987 ... RPA190 + -4979 S000005871 TYE7 - 6115 chrXV 970959 971080
chrXV_987325_988055 chrXV 987325 988055 987562.0 0.000000e+00 224 64 0.009434 9.434360e+05 20.135194 ... PUT4 - 0 S000005874 PYK2 - -864 chrXV 987523 987900
chrXV_988153_988367 chrXV 988153 988367 988290.0 1.295000e-07 24 22 0.001011 1.010824e+05 7.577723 ... PUT4 - 0 S000005876 CIN1 + 1423 chrXV 988214 988319
chrX_337351_338085 chrX 337351 338085 337757.0 0.000000e+00 738 44 0.031083 3.108285e+06 19.316612 ... TDH1 + 187 S000003589 PEP8 + -315 chrX 337611 337820
chrX_454688_455338 chrX 454688 455338 454954.0 0.000000e+00 316 23 0.013309 1.330919e+06 10.574593 ... TDH2 - -8 S000003771 MET3 + 902 chrX 454946 455020

69 rows × 22 columns

Above is the footprint data saved in adata_cc. Here we do it again and perserve it in bed data.

[27]:
footprint_bed = cc.tl.footprint(peak_data, qbed_data, return_bed = True, delete_unfound = True)
footprint_bed[0:10]
[27]:
Chr_footprint Start_footprint End_footprint
0 chrI 68246 68324
1 chrI 71216 71400
2 chrI 229803 229906
3 chrII 30129 30410
4 chrII 221449 221557
5 chrII 260466 260535
6 chrII 455557 455688
7 chrII 533329 533500
8 chrII 614133 614356
9 chrII 615046 615270

Let’s visulize part of the footprint points.

[28]:
cc.pl.draw_area("chrII", 613790, 616171, 1000, footprint_bed, qbed_data, "sacCer3", font_size=2,
                key = "Index", insertionkey = "group", figsize = (30,6), peak_line = 2, bins = 600,
                example_length = 500, color = "purple")
cc.pl.draw_area("chrX", 337680, 337768, 3000, footprint_bed, qbed_data, "sacCer3", font_size=2,
                key = "Index", insertionkey = "group", figsize = (30,7), peak_line = 1, bins = 600,
                example_length = 500, color = "purple")
../../_images/tutorials_notebooks_yeast_47_0.png
../../_images/tutorials_notebooks_yeast_47_1.png

Saved the file if needed.

[29]:
adata_cc.write("yeast.h5ad")