Tutorial: K562 HCT116 SP1 single-cell calling cards data.#

In this tutorial, we will analyze the binding of the transcription factor Sp1 in K562 and HCT116 cell lines. These data are generated using single-cell calling cards (scCCs). The data is from Moudgil et al., Cell. (2020) and can be downloaded from GEO.

We will cover how to call TF peaks using a background file, annotate these peaks, compare them with Chip-seq reference data, and perform a differential peak analysis.

[1]:
import pycallingcards as cc
import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 150

We start by reading the qbed datafile. In this file, each row represents a Sp1-directed insertion and columns indicate the chromosome, start point and end point, reads number, the direction and cell barcode of each insertion. For example, the first row tells us the first insertion is on Chromosome 1 in a TTAA site located at genomic coordinates 30116 to 30120. There are 12 reads supporting this insertion, which maps to the negative strand of the genome. The cell barcode is CCCAATCCATCGGTTA-1. Note that the cell barcode allows us to connect this insertino to scRNA-seq data collected from the same cell.

Use cc.rd.read_qbed(filename) to read your own ccf data.

[2]:
# read in experiement data
HCT116_SP1 = cc.datasets.SP1_K562HCT116_data(data = "HCT116_SP1_qbed")
HCT116_SP1
[2]:
Chr Start End Reads Direction Barcodes
0 chr1 30116 30120 5 - CCCAATCCATCGGTTA-1
1 chr1 34568 34572 3 - CCTTCGAAGGGCTTCC-1
2 chr1 36736 36740 29 + ACGAGCCGTATAGGTA-1
3 chr1 42447 42451 3 - CTCTACGTCGGAGCAA-1
4 chr1 89697 89701 119 - AGCTCTCGTTTGTTTC-1
... ... ... ... ... ... ...
77205 chrY 25518788 25518792 2 + TGGGCGTTCGAACGGA-1
77206 chrY 56987633 56987637 13 + CAGTCCTAGGCACATG-1
77207 chrY 57080855 57080859 17 + CGGAGCTCATCGACGC-1
77208 chrY 57080855 57080859 7 + GTAACGTAGTTACGGG-1
77209 chrY 57080855 57080859 9 + TCAGCAAGTTGAACTC-1

77210 rows × 6 columns

[3]:
# read in backgound data
HCT116_brd4 = cc.datasets.SP1_K562HCT116_data(data="HCT116_brd4_qbed")
HCT116_brd4
[3]:
Chr Start End Reads Direction Barcodes
0 chr1 89697 89701 14 + TCTGAGACAATGGTCT-1
1 chr1 89697 89701 8 + CAGCGACCAAATACAG-1
2 chr1 203932 203936 99 + TTCTCCTTCTACTTAC-1
3 chr1 204063 204067 5 - TGTTCCGGTGTAAGTA-1
4 chr1 204063 204067 7 - CAAGATCTCGACCAGC-1
... ... ... ... ... ... ...
37769 chrY 18037315 18037319 9 - GCAGTTAAGATCTGAA-1
37770 chrY 24036504 24036508 168 + GCAGTTAAGATCTGAA-1
37771 chrY 24036504 24036508 508 + CATATGGCAGCCAGAA-1
37772 chrY 25633622 25633626 13 - GCAGTTAAGATCTGAA-1
37773 chrY 25633622 25633626 32 - CATATGGCAGCCAGAA-1

37774 rows × 6 columns

We first need to call peaks in order to find candidate SP1 binding sites. There are three different methods (CCcaller, MACCs, Blockify) available in Pycallingcards. Here, we will use MACCs to call peaks. The appropriate reference genome for these data is human(‘hg38’). The window_size parameter is the most important parameter for MACCs, it is highly related to the length of a peak. A value of 1000-2000 is recommended for most sequence-specific TFs. step_size is another important parameter and it controls whether two nearby clusters of insertions are called as one peak or split into two peaks. 500-800 is good for step_size. pvalue_cutoffTTAA is the pvalue cutoff for TTAA data and pvalue_cutoffbg is the pvalue cutoff for the background ccf data. Normally, the setting for pvalue_cutoffbg is considerably higher than pvalue_cutoffTTAA. pvalue_cutoffbg is recommended to be 0.1 and pvalue_cutoffTTAA is recommended from 0.001 to 0.05. pseudocounts is advised to be 0.1-1.

[4]:
peak_data_HCT116 = cc.pp.call_peaks(HCT116_SP1, HCT116_brd4, method = "MACCs", reference = "hg38",  window_size = 2000, step_size = 500,
                  pvalue_cutoffTTAA = 0.001, pvalue_cutoffbg = 0.1, lam_win_size = None,  pseudocounts = 0.1, record = True, save = "peak_HCT116.bed")
peak_data_HCT116
For the MACCs method with background, [expdata, background, reference, pvalue_cutoffbg, pvalue_cutoffTTAA, lam_win_size, window_size, step_size, extend, pseudocounts, test_method, min_insertions, record] would be utilized.
100%|██████████| 25/25 [00:43<00:00,  1.75s/it]
[4]:
Chr Start End Center Experiment Insertions Background insertions Reference Insertions pvalue Reference pvalue Background Fraction Experiment TPH Experiment Fraction background TPH background TPH background subtracted pvalue_adj Reference
0 chr1 906689 907160 906957.0 5 0 3 3.099334e-09 1.546531e-04 0.000065 6475.845098 0.000000 0.000000 6475.845098 3.371990e-06
1 chr1 999921 1000324 1000121.0 20 0 1 0.000000e+00 0.000000e+00 0.000259 25903.380391 0.000000 0.000000 25903.380391 0.000000e+00
2 chr1 1156947 1157863 1157660.0 11 0 2 0.000000e+00 1.274899e-09 0.000142 14246.859215 0.000000 0.000000 14246.859215 0.000000e+00
3 chr1 1692740 1693542 1693339.0 6 0 3 5.135270e-11 1.546531e-04 0.000078 7771.014117 0.000000 0.000000 7771.014117 7.604315e-08
4 chr1 1744492 1746808 1746605.0 11 0 7 0.000000e+00 1.274899e-09 0.000142 14246.859215 0.000000 0.000000 14246.859215 0.000000e+00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3003 chrY 1245503 1247177 1245703.0 5 0 7 1.652897e-09 4.678840e-03 0.000065 6475.845098 0.000000 0.000000 6475.845098 2.056405e-06
3004 chrY 1280372 1281989 1281786.0 5 0 3 1.426995e-09 4.678840e-03 0.000065 6475.845098 0.000000 0.000000 6475.845098 1.828173e-06
3005 chrY 1586317 1587733 1587530.0 7 0 8 3.370637e-13 1.546531e-04 0.000091 9066.183137 0.000000 0.000000 9066.183137 6.766326e-10
3006 chrY 2391936 2392440 2392237.0 6 1 2 1.985434e-11 9.958372e-02 0.000078 7771.014117 0.000026 2647.323556 5123.690561 3.329699e-08
3007 chrY 2608793 2610054 2608993.0 8 0 2 2.775558e-15 1.546531e-04 0.000104 10361.352156 0.000000 0.000000 10361.352156 6.730626e-12

3008 rows × 15 columns

In order to tune parameters for peak calling, we advise looking at the data and evaluating the validity of the called peaks. The default settings are recommended, but for some TFs, adjacent peaks may be merged that should not be, or, alternatively, peaks that should be joined may be called separately.

Below, we plot the scCC data in HCT116 cells for a region in chromosome 1. The top track displays the locations of Sp1-directed (red) and background (grey) transpositions and their read counts. Each dot represents an insertion and the height is log(reads+1). The middle track plots the insertion density. The third track represents the reference genes and peaks. Finally, the last track represents peak calls. Below you can see that regions with high densities of insertions are accurately called as Sp1 binding sites.

[5]:
cc.pl.draw_area("chr1",999921,1000324,20000,peak_data_HCT116, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,10), peak_line = 1, save = False, plotsize = [1,1,4], example_length = 1000)
../../_images/tutorials_notebooks_K562HCT116_SP1_9_0.png

We can also visualize our data directly in the WashU Epigenome Browser. This can be useful for overlaying your data with other published datasets. Notice that this link only valid for 24hrs, so please rerun it if you want to use it.

[6]:
qbed = {"SP1":HCT116_SP1, "Brd4": HCT116_brd4}
bed = {"peak":peak_data_HCT116}
cc.pl.WashU_browser_url(qbed = qbed,bed = bed,genome = 'hg38')
All qbed addressed
All bed addressed
Uploading files
Please click the following link to see the data on WashU Epigenome Browser directly.
https://epigenomegateway.wustl.edu/browser/?genome=hg38&hub=https://companion.epigenomegateway.org//task/cb339e5b2d8c774345b7cf372cf4f97c/output//datahub.json

Pycallingcards can be used to visual peak locations acorss the genome to see that the distribution of peaks is unbiased and that all chromosomes are represented.

[7]:
cc.pl.whole_peaks(peak_data_HCT116, reference = "hg38", figsize=(100, 70), height_scale = 1.7)
../../_images/tutorials_notebooks_K562HCT116_SP1_13_0.png

We can then analyze the scCC Sp1 peaks to see if there is an enrichment of ChIP-seq signal at these locations using a reference Chip-seq dataset of SP1 binding in HCT116 from ENCSR000BSF (use the bigWig file ENCFF587ZMX generated by it).

Download the data, if needed, with:

!wget https://www.encodeproject.org/files/ENCFF587ZMX/@@download/ENCFF587ZMX.bigWig

We first calculate the signal of the Chip-seq signal around the peak.

[8]:
mtx_HCT116 = cc.pl.calculate_signal(peak_data_HCT116,
                                    chipseq_signal = "ENCFF587ZMX.bigWig")
100%|██████████| 3008/3008 [00:14<00:00, 200.91it/s]

Visualize it by the plotting the signal values.

[9]:
cc.pl.signal_plot(mtx_HCT116, alpha = 0.05, figsize=(6, 4))
../../_images/tutorials_notebooks_K562HCT116_SP1_18_0.png

Visualized by the plotting the signal heatmap plot.

[10]:
cc.pl.signal_heatmap(mtx_HCT116,pad = 0.035)
../../_images/tutorials_notebooks_K562HCT116_SP1_20_0.png

We can now use HOMER to call motifs. We hope to find the canonical Sp1 motif enriched under the call peaks.

[11]:
cc.tl.call_motif("peak_HCT116.bed",reference ="hg38",save_homer = "Homer/peak_HCT116",
                 homer_path = "/home/juanru/miniconda3/bin/", num_cores=12)
Peak data peak_HCT116.bed is used here.

        Position file = peak_HCT116.bed
        Genome = hg38
        Output Directory = Homer/peak_HCT116
        Fragment size set to 1000
        Using 12 CPUs
        Will not run homer for de novo motifs
        Found mset for "human", will check against vertebrates motifs
        Peak/BED file conversion summary:
                BED/Header formatted lines: 3008
                peakfile formatted lines: 0

        Peak File Statistics:
                Total Peaks: 3008
                Redundant Peak IDs: 0
                Peaks lacking information: 0 (need at least 5 columns per peak)
                Peaks with misformatted coordinates: 0 (should be integer)
                Peaks with misformatted strand: 0 (should be either +/- or 0/1)

        Peak file looks good!

        Background files for 1000 bp fragments found.
        Custom genome sequence directory: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38//

        Extracting sequences from file: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa
        Looking for peak sequences in a single file (/home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa)
        Extracting 334 sequences from chr1
        Extracting 147 sequences from chr10
        Extracting 132 sequences from chr11
        Extracting 179 sequences from chr12
        Extracting 56 sequences from chr13
        Extracting 85 sequences from chr14
        Extracting 83 sequences from chr15
        Extracting 127 sequences from chr16
        Extracting 205 sequences from chr17
        Extracting 54 sequences from chr18
        Extracting 249 sequences from chr19
        Extracting 194 sequences from chr2
        Extracting 71 sequences from chr20
        Extracting 37 sequences from chr21
        Extracting 63 sequences from chr22
        Extracting 143 sequences from chr3
        Extracting 146 sequences from chr4
        Extracting 96 sequences from chr5
        Extracting 159 sequences from chr6
        Extracting 143 sequences from chr7
        Extracting 117 sequences from chr8
        Extracting 136 sequences from chr9
        Extracting 47 sequences from chrX
        Extracting 5 sequences from chrY

        Not removing redundant sequences


        Sequences processed:
                Auto detected maximum sequence length of 1001 bp
                3008 total

        Frequency Bins: 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8
        Freq    Bin     Count
        0.3     2       1
        0.35    3       46
        0.4     4       183
        0.45    5       336
        0.5     6       476
        0.6     7       1116
        0.7     8       706
        0.8     9       142
        10      10      2

        Total sequences set to 50000

        Choosing background that matches in CpG/GC content...
        Bin     # Targets       # Background    Background Weight
        2       1       16      0.976
        3       46      719     0.999
        4       183     2859    1.000
        5       336     5249    1.000
        6       476     7436    1.000
        7       1116    17435   1.000
        8       706     11029   1.000
        9       142     2218    1.000
        10      2       31      1.008
        Assembling sequence file...
        Normalizing lower order oligos using homer2

        Reading input files...
        50000 total sequences read
        Autonormalization: 1-mers (4 total)
                A       22.98%  23.24%  0.989
                C       27.02%  26.76%  1.010
                G       27.02%  26.76%  1.010
                T       22.98%  23.24%  0.989
        Autonormalization: 2-mers (16 total)
                AA      6.71%   6.20%   1.082
                CA      6.45%   7.36%   0.876
                GA      5.97%   6.11%   0.977
                TA      3.85%   3.56%   1.081
                AC      4.73%   4.90%   0.965
                CC      8.95%   8.75%   1.023
                GC      7.37%   7.01%   1.051
                TC      5.97%   6.11%   0.977
                AG      7.19%   7.61%   0.944
                CG      4.43%   3.06%   1.451
                GG      8.95%   8.75%   1.023
                TG      6.45%   7.36%   0.876
                AT      4.36%   4.52%   0.964
                CT      7.19%   7.61%   0.944
                GT      4.73%   4.90%   0.965
                TT      6.71%   6.20%   1.082
        Autonormalization: 3-mers (64 total)
        Normalization weights can be found in file: Homer/peak_HCT116/seq.autonorm.tsv
        Converging on autonormalization solution:
        ...............................................................................
        Final normalization:    Autonormalization: 1-mers (4 total)
                A       22.98%  22.93%  1.002
                C       27.02%  27.07%  0.998
                G       27.02%  27.07%  0.998
                T       22.98%  22.93%  1.002
        Autonormalization: 2-mers (16 total)
                AA      6.71%   6.50%   1.032
                CA      6.45%   6.69%   0.965
                GA      5.97%   5.97%   1.000
                TA      3.85%   3.78%   1.020
                AC      4.73%   4.71%   1.003
                CC      8.95%   8.89%   1.007
                GC      7.37%   7.50%   0.983
                TC      5.97%   5.97%   1.000
                AG      7.19%   7.21%   0.997
                CG      4.43%   4.29%   1.035
                GG      8.95%   8.89%   1.007
                TG      6.45%   6.69%   0.965
                AT      4.36%   4.51%   0.966
                CT      7.19%   7.21%   0.997
                GT      4.73%   4.71%   1.003
                TT      6.71%   6.50%   1.032
        Autonormalization: 3-mers (64 total)
        Finished preparing sequence/group files

        ----------------------------------------------------------
        Known motif enrichment

        Reading input files...
        50000 total sequences read
        440 motifs loaded
        Cache length = 11180
        Using binomial scoring
        Checking enrichment of 440 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================
Finished!

        Preparing HTML output with sequence logos...
                1 of 440 (1e-105) Sp5(Zf)/mES-Sp5.Flag-ChIP-Seq(GSE72989)/Homer
                2 of 440 (1e-97) KLF3(Zf)/MEF-Klf3-ChIP-Seq(GSE44748)/Homer
                3 of 440 (1e-92) Sp1(Zf)/Promoter/Homer
                4 of 440 (1e-87) KLF1(Zf)/HUDEP2-KLF1-CutnRun(GSE136251)/Homer
                5 of 440 (1e-78) Sp2(Zf)/HEK293-Sp2.eGFP-ChIP-Seq(Encode)/Homer
                6 of 440 (1e-77) Klf9(Zf)/GBM-Klf9-ChIP-Seq(GSE62211)/Homer
                7 of 440 (1e-69) KLF6(Zf)/PDAC-KLF6-ChIP-Seq(GSE64557)/Homer
                8 of 440 (1e-68) KLF5(Zf)/LoVo-KLF5-ChIP-Seq(GSE49402)/Homer
                9 of 440 (1e-65) KLF14(Zf)/HEK293-KLF14.GFP-ChIP-Seq(GSE58341)/Homer
                10 of 440 (1e-62) Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer
                11 of 440 (1e-58) Klf4(Zf)/mES-Klf4-ChIP-Seq(GSE11431)/Homer
                12 of 440 (1e-51) Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer
                13 of 440 (1e-50) Fosl2(bZIP)/3T3L1-Fosl2-ChIP-Seq(GSE56872)/Homer
                14 of 440 (1e-43) TATA-Box(TBP)/Promoter/Homer
                15 of 440 (1e-42) Fra2(bZIP)/Striatum-Fra2-ChIP-Seq(GSE43429)/Homer
                16 of 440 (1e-41) Fra1(bZIP)/BT549-Fra1-ChIP-Seq(GSE46166)/Homer
                17 of 440 (1e-41) Fos(bZIP)/TSC-Fos-ChIP-Seq(GSE110950)/Homer
                18 of 440 (1e-41) JunB(bZIP)/DendriticCells-Junb-ChIP-Seq(GSE36099)/Homer
                19 of 440 (1e-37) Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer
                20 of 440 (1e-35) KLF10(Zf)/HEK293-KLF10.GFP-ChIP-Seq(GSE58341)/Homer
                21 of 440 (1e-30) BATF(bZIP)/Th17-BATF-ChIP-Seq(GSE39756)/Homer
                22 of 440 (1e-28) AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                23 of 440 (1e-25) Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer
                24 of 440 (1e-22) Pitx1(Homeobox)/Chicken-Pitx1-ChIP-Seq(GSE38910)/Homer
                25 of 440 (1e-20) Elk4(ETS)/Hela-Elk4-ChIP-Seq(GSE31477)/Homer
                26 of 440 (1e-19) E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer
                27 of 440 (1e-19) Hoxd11(Homeobox)/ChickenMSG-Hoxd11.Flag-ChIP-Seq(GSE86088)/Homer
                28 of 440 (1e-18) Nkx6.1(Homeobox)/Islet-Nkx6.1-ChIP-Seq(GSE40975)/Homer
                29 of 440 (1e-18) Hoxa13(Homeobox)/ChickenMSG-Hoxa13.Flag-ChIP-Seq(GSE86088)/Homer
                30 of 440 (1e-17) WT1(Zf)/Kidney-WT1-ChIP-Seq(GSE90016)/Homer
                31 of 440 (1e-16) Fli1(ETS)/CD8-FLI-ChIP-Seq(GSE20898)/Homer
                32 of 440 (1e-16) Hoxa11(Homeobox)/ChickenMSG-Hoxa11.Flag-ChIP-Seq(GSE86088)/Homer
                33 of 440 (1e-16) Isl1(Homeobox)/Neuron-Isl1-ChIP-Seq(GSE31456)/Homer
                34 of 440 (1e-15) Elk1(ETS)/Hela-Elk1-ChIP-Seq(GSE31477)/Homer
                35 of 440 (1e-15) Egr1(Zf)/K562-Egr1-ChIP-Seq(GSE32465)/Homer
                36 of 440 (1e-15) E2F6(E2F)/Hela-E2F6-ChIP-Seq(GSE31477)/Homer
                37 of 440 (1e-15) ETV4(ETS)/HepG2-ETV4-ChIP-Seq(ENCODE)/Homer
                38 of 440 (1e-15) EKLF(Zf)/Erythrocyte-Klf1-ChIP-Seq(GSE20478)/Homer
                39 of 440 (1e-14) Egr2(Zf)/Thymocytes-Egr2-ChIP-Seq(GSE34254)/Homer
                40 of 440 (1e-13) ZNF467(Zf)/HEK293-ZNF467.GFP-ChIP-Seq(GSE58341)/Homer
                41 of 440 (1e-12) E2F3(E2F)/MEF-E2F3-ChIP-Seq(GSE71376)/Homer
                42 of 440 (1e-11) ELF1(ETS)/Jurkat-ELF1-ChIP-Seq(SRA014231)/Homer
                43 of 440 (1e-11) Nrf2(bZIP)/Lymphoblast-Nrf2-ChIP-Seq(GSE37589)/Homer
                44 of 440 (1e-11) Unknown(Homeobox)/Limb-p300-ChIP-Seq/Homer
                45 of 440 (1e-10) Lhx1(Homeobox)/EmbryoCarcinoma-Lhx1-ChIP-Seq(GSE70957)/Homer
                46 of 440 (1e-10) E2F1(E2F)/Hela-E2F1-ChIP-Seq(GSE22478)/Homer
                47 of 440 (1e-10) En1(Homeobox)/SUM149-EN1-ChIP-Seq(GSE120957)/Homer
                48 of 440 (1e-9) Lhx3(Homeobox)/Neuron-Lhx3-ChIP-Seq(GSE31456)/Homer
                49 of 440 (1e-9) CRX(Homeobox)/Retina-Crx-ChIP-Seq(GSE20012)/Homer
                50 of 440 (1e-8) Zfp281(Zf)/ES-Zfp281-ChIP-Seq(GSE81042)/Homer
                51 of 440 (1e-8) GABPA(ETS)/Jurkat-GABPa-ChIP-Seq(GSE17954)/Homer
                52 of 440 (1e-8) NF-E2(bZIP)/K562-NFE2-ChIP-Seq(GSE31477)/Homer
                53 of 440 (1e-8) Hoxd13(Homeobox)/ChickenMSG-Hoxd13.Flag-ChIP-Seq(GSE86088)/Homer
                54 of 440 (1e-8) HIF-1b(HLH)/T47D-HIF1b-ChIP-Seq(GSE59937)/Homer
                55 of 440 (1e-8) ETV1(ETS)/GIST48-ETV1-ChIP-Seq(GSE22441)/Homer
                56 of 440 (1e-7) LHX9(Homeobox)/Hct116-LHX9.V5-ChIP-Seq(GSE116822)/Homer
                57 of 440 (1e-7) DLX2(Homeobox)/BasalGanglia-Dlx2-ChIP-seq(GSE124936)/Homer
                58 of 440 (1e-7) Bach1(bZIP)/K562-Bach1-ChIP-Seq(GSE31477)/Homer
                59 of 440 (1e-7) DLX1(Homeobox)/BasalGanglia-Dlx1-ChIP-seq(GSE124936)/Homer
                60 of 440 (1e-7) EHF(ETS)/LoVo-EHF-ChIP-Seq(GSE49402)/Homer
                61 of 440 (1e-7) BMYB(HTH)/Hela-BMYB-ChIP-Seq(GSE27030)/Homer
                62 of 440 (1e-7) NFE2L2(bZIP)/HepG2-NFE2L2-ChIP-Seq(Encode)/Homer
                63 of 440 (1e-7) Stat3(Stat)/mES-Stat3-ChIP-Seq(GSE11431)/Homer
                64 of 440 (1e-6) Hoxa9(Homeobox)/ChickenMSG-Hoxa9.Flag-ChIP-Seq(GSE86088)/Homer
                65 of 440 (1e-6) MYB(HTH)/ERMYB-Myb-ChIPSeq(GSE22095)/Homer
                66 of 440 (1e-6) Znf263(Zf)/K562-Znf263-ChIP-Seq(GSE31477)/Homer
                67 of 440 (1e-6) HNF1b(Homeobox)/PDAC-HNF1B-ChIP-Seq(GSE64557)/Homer
                68 of 440 (1e-6) CHR(?)/Hela-CellCycle-Expression/Homer
                69 of 440 (1e-6) CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer
                70 of 440 (1e-6) ETS(ETS)/Promoter/Homer
                71 of 440 (1e-6) ETS1(ETS)/Jurkat-ETS1-ChIP-Seq(GSE17954)/Homer
                72 of 440 (1e-6) ZNF652/HepG2-ZNF652.Flag-ChIP-Seq(Encode)/Homer
                73 of 440 (1e-5) Nanog(Homeobox)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                74 of 440 (1e-5) Lhx2(Homeobox)/HFSC-Lhx2-ChIP-Seq(GSE48068)/Homer
                75 of 440 (1e-5) Elf4(ETS)/BMDM-Elf4-ChIP-Seq(GSE88699)/Homer
                76 of 440 (1e-5) NFY(CCAAT)/Promoter/Homer
                77 of 440 (1e-5) ELF5(ETS)/T47D-ELF5-ChIP-Seq(GSE30407)/Homer
                78 of 440 (1e-5) AP-2gamma(AP2)/MCF7-TFAP2C-ChIP-Seq(GSE21234)/Homer
                79 of 440 (1e-5) ERG(ETS)/VCaP-ERG-ChIP-Seq(GSE14097)/Homer
                80 of 440 (1e-5) EWS:ERG-fusion(ETS)/CADO_ES1-EWS:ERG-ChIP-Seq(SRA014231)/Homer
                81 of 440 (1e-5) AMYB(HTH)/Testes-AMYB-ChIP-Seq(GSE44588)/Homer
                82 of 440 (1e-5) ELF3(ETS)/PDAC-ELF3-ChIP-Seq(GSE64557)/Homer
                83 of 440 (1e-5) HOXB13(Homeobox)/ProstateTumor-HOXB13-ChIP-Seq(GSE56288)/Homer
                84 of 440 (1e-5) STAT4(Stat)/CD4-Stat4-ChIP-Seq(GSE22104)/Homer
                85 of 440 (1e-5) MafK(bZIP)/C2C12-MafK-ChIP-Seq(GSE36030)/Homer
                86 of 440 (1e-4) SPDEF(ETS)/VCaP-SPDEF-ChIP-Seq(SRA014231)/Homer
                87 of 440 (1e-4) Ets1-distal(ETS)/CD4+-PolII-ChIP-Seq(Barski_et_al.)/Homer
                88 of 440 (1e-4) DLX5(Homeobox)/BasalGanglia-Dlx5-ChIP-seq(GSE124936)/Homer
                89 of 440 (1e-4) Dlx3(Homeobox)/Kerainocytes-Dlx3-ChIP-Seq(GSE89884)/Homer
                90 of 440 (1e-4) PRDM15(Zf)/ESC-Prdm15-ChIP-Seq(GSE73694)/Homer
                91 of 440 (1e-4) Stat3+il21(Stat)/CD4-Stat3-ChIP-Seq(GSE19198)/Homer
                92 of 440 (1e-4) Etv2(ETS)/ES-ER71-ChIP-Seq(GSE59402)/Homer
                93 of 440 (1e-4) HIF2a(bHLH)/785_O-HIF2a-ChIP-Seq(GSE34871)/Homer
                94 of 440 (1e-4) TEAD1(TEAD)/HepG2-TEAD1-ChIP-Seq(Encode)/Homer
                95 of 440 (1e-3) ZNF189(Zf)/HEK293-ZNF189.GFP-ChIP-Seq(GSE58341)/Homer
                96 of 440 (1e-3) STAT5(Stat)/mCD4+-Stat5-ChIP-Seq(GSE12346)/Homer
                97 of 440 (1e-3) AP-2alpha(AP2)/Hela-AP2alpha-ChIP-Seq(GSE31477)/Homer
                98 of 440 (1e-3) ZFX(Zf)/mES-Zfx-ChIP-Seq(GSE11431)/Homer
                99 of 440 (1e-3) NFkB-p65-Rel(RHD)/ThioMac-LPS-Expression(GSE23622)/Homer
                100 of 440 (1e-3) BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer
                101 of 440 (1e-3) E2F7(E2F)/Hela-E2F7-ChIP-Seq(GSE32673)/Homer
                102 of 440 (1e-3) EWS:FLI1-fusion(ETS)/SK_N_MC-EWS:FLI1-ChIP-Seq(SRA014231)/Homer
                103 of 440 (1e-3) c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer
                104 of 440 (1e-3) Otx2(Homeobox)/EpiLC-Otx2-ChIP-Seq(GSE56098)/Homer
                105 of 440 (1e-3) TEAD4(TEA)/Tropoblast-Tead4-ChIP-Seq(GSE37350)/Homer
                106 of 440 (1e-2) Npas4(bHLH)/Neuron-Npas4-ChIP-Seq(GSE127793)/Homer
                107 of 440 (1e-2) CArG(MADS)/PUER-Srf-ChIP-Seq(Sullivan_et_al.)/Homer
                108 of 440 (1e-2) ZNF711(Zf)/SHSY5Y-ZNF711-ChIP-Seq(GSE20673)/Homer
                109 of 440 (1e-2) Prop1(Homeobox)/GHFT1-PROP1.biotin-ChIP-Seq(GSE77302)/Homer
                110 of 440 (1e-2) GSC(Homeobox)/FrogEmbryos-GSC-ChIP-Seq(DRA000576)/Homer
                111 of 440 (1e-2) Hnf1(Homeobox)/Liver-Foxa2-Chip-Seq(GSE25694)/Homer
                112 of 440 (1e-2) TRPS1(Zf)/MCF7-TRPS1-ChIP-Seq(GSE107013)/Homer
                113 of 440 (1e-2) Tbr1(T-box)/Cortex-Tbr1-ChIP-Seq(GSE71384)/Homer
                114 of 440 (1e-2) NFAT(RHD)/Jurkat-NFATC1-ChIP-Seq(Jolma_et_al.)/Homer
                115 of 440 (1e-2) CLOCK(bHLH)/Liver-Clock-ChIP-Seq(GSE39860)/Homer
                116 of 440 (1e-2) STAT1(Stat)/HelaS3-STAT1-ChIP-Seq(GSE12782)/Homer
                117 of 440 (1e-2) Barx1(Homeobox)/Stomach-Barx1.3xFlag-ChIP-Seq(GSE69483)/Homer
                118 of 440 (1e-2) NFkB-p65(RHD)/GM12787-p65-ChIP-Seq(GSE19485)/Homer
                119 of 440 (1e-2) CEBP(bZIP)/ThioMac-CEBPb-ChIP-Seq(GSE21512)/Homer
                120 of 440 (1e-2) Arnt:Ahr(bHLH)/MCF7-Arnt-ChIP-Seq(Lo_et_al.)/Homer
                121 of 440 (1e-2) HINFP(Zf)/K562-HINFP.eGFP-ChIP-Seq(Encode)/Homer
                122 of 440 (1e-2) CRE(bZIP)/Promoter/Homer
                123 of 440 (1e-2) E-box(bHLH)/Promoter/Homer
                124 of 440 (1e-2) Hoxd12(Homeobox)/ChickenMSG-Hoxd12.Flag-ChIP-Seq(GSE86088)/Homer
                125 of 440 (1e-2) TEAD(TEA)/Fibroblast-PU.1-ChIP-Seq(Unpublished)/Homer
                126 of 440 (1e-2) ZSCAN22(Zf)/HEK293-ZSCAN22.GFP-ChIP-Seq(GSE58341)/Homer
                127 of 440 (1e-2) IRF:BATF(IRF:bZIP)/pDC-Irf8-ChIP-Seq(GSE66899)/Homer
                128 of 440 (1e-2) CDX4(Homeobox)/ZebrafishEmbryos-Cdx4.Myc-ChIP-Seq(GSE48254)/Homer
        Skipping...
        Job finished - if results look good, please send beer to ..

        Cleaning up tmp files...

In the motif analysis result, SP1 motif and many other family members rank top.

drawing

Do the exact same thing for K562 SP1 data.

[12]:
# read experiment data
K562_SP1 = cc.datasets.SP1_K562HCT116_data(data="K562_SP1_qbed")
K562_SP1
[12]:
Chr Start End Reads Direction Barcodes
0 chr1 16529 16533 163 - GCTCCTAAGTACGTTC-1
1 chr1 29884 29888 10 + CTCACACCAGACGCTC-1
2 chr1 29884 29888 155 + TGGCCAGCACCCATTC-1
3 chr1 29884 29888 285 + GTGGGTCCACGGCCAT-1
4 chr1 29884 29888 7 + CGTCTACTCAACACGT-1
... ... ... ... ... ... ...
327460 chrY 57061562 57061566 6 + CTCATTATCATCATTC-1
327461 chrY 57061562 57061566 67 + TGCGTGGCATTAGGCT-1
327462 chrY 57145084 57145088 2 - ACATACGTCGCGCCAA-1
327463 chrY 57148630 57148634 2 - TATGCCCGTACAGTTC-1
327464 chrY 57183913 57183917 228 - AAACCTGGTCCTGCTT-1

327465 rows × 6 columns

[13]:
# read background data
K562_brd4 =cc.datasets.SP1_K562HCT116_data(data="K562_brd4_qbed")
K562_brd4
[13]:
Chr Start End Reads Direction Barcodes
0 chr1 30238 30242 3 + TTTACTGCATAAAGGT-1
1 chr1 30355 30359 2 - ATCACGAAGAGTAATC-1
2 chr1 30355 30359 70 + TTGAACGCAAATCCGT-1
3 chr1 31101 31105 2 + CCTCAGTCATCAGTAC-1
4 chr1 32116 32120 5 + CTAGTGAAGACAAAGG-1
... ... ... ... ... ... ...
107380 chrY 57080210 57080214 9 - AAGGAGCCAGTATAAG-1
107381 chrY 57087785 57087789 24 - CGAGCCAGTCTCTCTG-1
107382 chrY 57144853 57144857 5 + GAAGCAGTCCCATTTA-1
107383 chrY 57183772 57183776 2 - TCTTTCCTCTTGCCGT-1
107384 chrY 57204853 57204857 369 - ATAACGCAGTTTGCGT-1

107385 rows × 6 columns

[14]:
peak_data_K562 = cc.pp.call_peaks(K562_SP1, K562_brd4, method = "MACCs", reference = "hg38", window_size = 2000, step_size = 500,
                  pvalue_cutoffTTAA = 0.0001,  pvalue_cutoffbg = 0.1, lam_win_size = None,  pseudocounts = 0.1, record = True, save = "peak_k562.bed")
peak_data_K562
For the MACCs method with background, [expdata, background, reference, pvalue_cutoffbg, pvalue_cutoffTTAA, lam_win_size, window_size, step_size, extend, pseudocounts, test_method, min_insertions, record] would be utilized.
100%|██████████| 24/24 [01:34<00:00,  3.96s/it]
[14]:
Chr Start End Center Experiment Insertions Background insertions Reference Insertions pvalue Reference pvalue Background Fraction Experiment TPH Experiment Fraction background TPH background TPH background subtracted pvalue_adj Reference
0 chr1 29684 30087 29884.0 6 0 1 8.878753e-11 1.546531e-04 0.000018 1832.256882 0.000000 0.000000 1832.256882 3.323285e-08
1 chr1 36239 38107 37578.0 24 2 15 0.000000e+00 1.486029e-03 0.000073 7329.027530 0.000019 1862.457513 5466.570017 0.000000e+00
2 chr1 198893 201208 200869.0 28 2 11 0.000000e+00 6.927041e-05 0.000086 8550.532118 0.000019 1862.457513 6688.074605 0.000000e+00
3 chr1 203351 207161 205004.0 92 13 22 0.000000e+00 4.337485e-05 0.000281 28094.605530 0.000121 12105.973832 15988.631698 0.000000e+00
4 chr1 265549 266336 265749.0 5 0 3 3.731359e-08 4.678840e-03 0.000015 1526.880735 0.000000 0.000000 1526.880735 1.056034e-05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9404 chrY 15158250 15158653 15158450.0 11 0 1 0.000000e+00 1.546531e-04 0.000034 3359.137618 0.000000 0.000000 3359.137618 0.000000e+00
9405 chrY 16985442 16985845 16985642.0 5 0 2 1.806731e-09 4.678840e-03 0.000015 1526.880735 0.000000 0.000000 1526.880735 6.060202e-07
9406 chrY 19753311 19753714 19753511.0 33 0 1 0.000000e+00 2.269296e-13 0.000101 10077.412853 0.000000 0.000000 10077.412853 0.000000e+00
9407 chrY 21011133 21011828 21011333.0 5 0 4 2.510448e-09 4.678840e-03 0.000015 1526.880735 0.000000 0.000000 1526.880735 8.296735e-07
9408 chrY 56952574 56957328 56953707.0 40 1 37 0.000000e+00 2.427052e-06 0.000122 12215.045883 0.000009 931.228756 11283.817126 0.000000e+00

9409 rows × 15 columns

[15]:
cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data_K562,K562_SP1, "hg38", K562_brd4 , font_size=2,
                figsize = (30,15),peak_line = 4,save = False,bins =400, plotsize = [1,1,5], example_length = 1000)
../../_images/tutorials_notebooks_K562HCT116_SP1_28_0.png
[16]:
qbed = {"SP1":K562_SP1, "Brd4": K562_brd4}
bed = {"peak":peak_data_K562}
cc.pl.WashU_browser_url(qbed = qbed,bed = bed,genome = 'hg38')
All qbed addressed
All bed addressed
Uploading files
Please click the following link to see the data on WashU Epigenome Browser directly.
https://epigenomegateway.wustl.edu/browser/?genome=hg38&hub=https://companion.epigenomegateway.org//task/db6498ca05ea546a0938a34cd8de001c/output//datahub.json
[17]:
cc.pl.whole_peaks(peak_data_K562, reference = "hg38",figsize=(100, 70),height_scale = 1.7)
../../_images/tutorials_notebooks_K562HCT116_SP1_30_0.png

We can see that SP1 binds much more frequently in K562 than HCT116.

We can then check with reference Chip-seq data of SP1 in K562 from ENCSR372IML (and use the bigWig file ENCFF588UII generated by it).

Download the data if needed:

!wget https://www.encodeproject.org/files/ENCFF588UII/@@download/ENCFF588UII.bigWig
[18]:
mtx_K562 = cc.pl.calculate_signal(peak_data_K562, chipseq_signal = "ENCFF588UII.bigWig")
100%|██████████| 9409/9409 [00:41<00:00, 225.34it/s]
[19]:
cc.pl.signal_plot(mtx_K562, alpha = 0.05, figsize=(6, 4))
../../_images/tutorials_notebooks_K562HCT116_SP1_34_0.png
[20]:
cc.pl.signal_heatmap(mtx_K562,pad =  0.023, belowlength = 100)
../../_images/tutorials_notebooks_K562HCT116_SP1_35_0.png

We can see that calling cards peaks are consistent with Chip-seq data. Peak centers tend to have a higher signal and the signal goes lower as the distance increases.

Call motif to check the peak results.

[21]:
cc.tl.call_motif("peak_k562.bed",reference ="hg38",save_homer = "Homer/peak_k562",
                 homer_path = "/home/juanru/miniconda3/bin/", num_cores=12)
Peak data peak_k562.bed is used here.

        Position file = peak_k562.bed
        Genome = hg38
        Output Directory = Homer/peak_k562
        Fragment size set to 1000
        Using 12 CPUs
        Will not run homer for de novo motifs
        Found mset for "human", will check against vertebrates motifs
        Peak/BED file conversion summary:
                BED/Header formatted lines: 9409
                peakfile formatted lines: 0

        Peak File Statistics:
                Total Peaks: 9409
                Redundant Peak IDs: 2
                Peaks lacking information: 0 (need at least 5 columns per peak)
                Peaks with misformatted coordinates: 0 (should be integer)
                Peaks with misformatted strand: 0 (should be either +/- or 0/1)

        Redunant Peaks found: Remove or rename these or some programs may have trouble...

        2 duplicate peak IDs out of 9409 total peaks
        Background files for 1000 bp fragments found.
        Custom genome sequence directory: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38//

        Extracting sequences from file: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa
        Looking for peak sequences in a single file (/home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa)
        Extracting 1062 sequences from chr1
        Extracting 426 sequences from chr10
        Extracting 475 sequences from chr11
        Extracting 443 sequences from chr12
        Extracting 182 sequences from chr13
        Extracting 227 sequences from chr14
        Extracting 302 sequences from chr15
        Extracting 342 sequences from chr16
        Extracting 409 sequences from chr17
        Extracting 169 sequences from chr18
        Extracting 505 sequences from chr19
        Extracting 660 sequences from chr2
        Extracting 249 sequences from chr20
        Extracting 154 sequences from chr21
        Extracting 186 sequences from chr22
        Extracting 508 sequences from chr3
        Extracting 428 sequences from chr4
        Extracting 408 sequences from chr5
        Extracting 620 sequences from chr6
        Extracting 572 sequences from chr7
        Extracting 375 sequences from chr8
        Extracting 408 sequences from chr9
        Extracting 281 sequences from chrX
        Extracting 18 sequences from chrY

        Not removing redundant sequences


        Sequences processed:
                Auto detected maximum sequence length of 1001 bp
                9409 total

        Frequency Bins: 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8
        Freq    Bin     Count
        0.3     2       52
        0.35    3       412
        0.4     4       1127
        0.45    5       1812
        0.5     6       1961
        0.6     7       2657
        0.7     8       1164
        0.8     9       219
        10      10      5

        Total sequences set to 50000

        Choosing background that matches in CpG/GC content...
        Bin     # Targets       # Background    Background Weight
        2       52      224     1.001
        3       412     1777    1.000
        4       1127    4862    1.000
        5       1812    7817    1.000
        6       1961    8460    1.000
        7       2657    11462   1.000
        8       1164    5022    1.000
        9       219     945     1.000
        10      5       22      0.980
        Assembling sequence file...
        Normalizing lower order oligos using homer2

        Reading input files...
        50000 total sequences read
        Autonormalization: 1-mers (4 total)
                A       25.36%  25.49%  0.995
                C       24.64%  24.51%  1.005
                G       24.64%  24.51%  1.005
                T       25.36%  25.49%  0.995
        Autonormalization: 2-mers (16 total)
                AA      7.80%   7.45%   1.047
                CA      6.85%   7.40%   0.925
                GA      6.01%   6.09%   0.986
                TA      4.70%   4.54%   1.036
                AC      4.92%   4.99%   0.986
                CC      7.54%   7.44%   1.013
                GC      6.17%   5.99%   1.029
                TC      6.01%   6.09%   0.986
                AG      7.26%   7.47%   0.972
                CG      2.98%   2.20%   1.356
                GG      7.54%   7.44%   1.013
                TG      6.85%   7.40%   0.925
                AT      5.37%   5.57%   0.966
                CT      7.26%   7.47%   0.972
                GT      4.92%   4.99%   0.986
                TT      7.80%   7.45%   1.047
        Autonormalization: 3-mers (64 total)
        Normalization weights can be found in file: Homer/peak_k562/seq.autonorm.tsv
        Converging on autonormalization solution:
        ...............................................................................
        Final normalization:    Autonormalization: 1-mers (4 total)
                A       25.36%  25.34%  1.001
                C       24.64%  24.66%  0.999
                G       24.64%  24.66%  0.999
                T       25.36%  25.34%  1.001
        Autonormalization: 2-mers (16 total)
                AA      7.80%   7.68%   1.016
                CA      6.85%   6.98%   0.982
                GA      6.01%   6.00%   1.000
                TA      4.70%   4.67%   1.006
                AC      4.92%   4.88%   1.008
                CC      7.54%   7.52%   1.002
                GC      6.17%   6.25%   0.987
                TC      6.01%   6.00%   1.000
                AG      7.26%   7.24%   1.003
                CG      2.98%   2.92%   1.023
                GG      7.54%   7.52%   1.002
                TG      6.85%   6.98%   0.982
                AT      5.37%   5.53%   0.971
                CT      7.26%   7.24%   1.003
                GT      4.92%   4.88%   1.008
                TT      7.80%   7.68%   1.016
        Autonormalization: 3-mers (64 total)
        Finished preparing sequence/group files

        ----------------------------------------------------------
        Known motif enrichment

        Reading input files...
        50000 total sequences read
        440 motifs loaded
        Cache length = 11180
        Using binomial scoring
        Checking enrichment of 440 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================
Finished!

        Preparing HTML output with sequence logos...
                1 of 440 (1e-169) Sp2(Zf)/HEK293-Sp2.eGFP-ChIP-Seq(Encode)/Homer
                2 of 440 (1e-148) Sp5(Zf)/mES-Sp5.Flag-ChIP-Seq(GSE72989)/Homer
                3 of 440 (1e-137) KLF1(Zf)/HUDEP2-KLF1-CutnRun(GSE136251)/Homer
                4 of 440 (1e-125) KLF6(Zf)/PDAC-KLF6-ChIP-Seq(GSE64557)/Homer
                5 of 440 (1e-124) KLF5(Zf)/LoVo-KLF5-ChIP-Seq(GSE49402)/Homer
                6 of 440 (1e-123) Sp1(Zf)/Promoter/Homer
                7 of 440 (1e-114) KLF3(Zf)/MEF-Klf3-ChIP-Seq(GSE44748)/Homer
                8 of 440 (1e-114) Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer
                9 of 440 (1e-100) KLF14(Zf)/HEK293-KLF14.GFP-ChIP-Seq(GSE58341)/Homer
                10 of 440 (1e-96) Klf9(Zf)/GBM-Klf9-ChIP-Seq(GSE62211)/Homer
                11 of 440 (1e-74) Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer
                12 of 440 (1e-67) Gata2(Zf)/K562-GATA2-ChIP-Seq(GSE18829)/Homer
                13 of 440 (1e-65) Fosl2(bZIP)/3T3L1-Fosl2-ChIP-Seq(GSE56872)/Homer
                14 of 440 (1e-61) TATA-Box(TBP)/Promoter/Homer
                15 of 440 (1e-60) Gata4(Zf)/Heart-Gata4-ChIP-Seq(GSE35151)/Homer
                16 of 440 (1e-59) Gata6(Zf)/HUG1N-GATA6-ChIP-Seq(GSE51936)/Homer
                17 of 440 (1e-57) Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer
                18 of 440 (1e-52) BMYB(HTH)/Hela-BMYB-ChIP-Seq(GSE27030)/Homer
                19 of 440 (1e-52) GATA3(Zf)/iTreg-Gata3-ChIP-Seq(GSE20898)/Homer
                20 of 440 (1e-48) Fra2(bZIP)/Striatum-Fra2-ChIP-Seq(GSE43429)/Homer
                21 of 440 (1e-47) Klf4(Zf)/mES-Klf4-ChIP-Seq(GSE11431)/Homer
                22 of 440 (1e-46) Gata1(Zf)/K562-GATA1-ChIP-Seq(GSE18829)/Homer
                23 of 440 (1e-45) Fos(bZIP)/TSC-Fos-ChIP-Seq(GSE110950)/Homer
                24 of 440 (1e-45) JunB(bZIP)/DendriticCells-Junb-ChIP-Seq(GSE36099)/Homer
                25 of 440 (1e-43) Fra1(bZIP)/BT549-Fra1-ChIP-Seq(GSE46166)/Homer
                26 of 440 (1e-41) AMYB(HTH)/Testes-AMYB-ChIP-Seq(GSE44588)/Homer
                27 of 440 (1e-39) TRPS1(Zf)/MCF7-TRPS1-ChIP-Seq(GSE107013)/Homer
                28 of 440 (1e-36) MYB(HTH)/ERMYB-Myb-ChIPSeq(GSE22095)/Homer
                29 of 440 (1e-36) KLF10(Zf)/HEK293-KLF10.GFP-ChIP-Seq(GSE58341)/Homer
                30 of 440 (1e-36) Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer
                31 of 440 (1e-35) Bach1(bZIP)/K562-Bach1-ChIP-Seq(GSE31477)/Homer
                32 of 440 (1e-34) BATF(bZIP)/Th17-BATF-ChIP-Seq(GSE39756)/Homer
                33 of 440 (1e-33) NF-E2(bZIP)/K562-NFE2-ChIP-Seq(GSE31477)/Homer
                34 of 440 (1e-33) Pitx1(Homeobox)/Chicken-Pitx1-ChIP-Seq(GSE38910)/Homer
                35 of 440 (1e-32) WT1(Zf)/Kidney-WT1-ChIP-Seq(GSE90016)/Homer
                36 of 440 (1e-32) Elk4(ETS)/Hela-Elk4-ChIP-Seq(GSE31477)/Homer
                37 of 440 (1e-28) Isl1(Homeobox)/Neuron-Isl1-ChIP-Seq(GSE31456)/Homer
                38 of 440 (1e-27) AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                39 of 440 (1e-27) Nrf2(bZIP)/Lymphoblast-Nrf2-ChIP-Seq(GSE37589)/Homer
                40 of 440 (1e-26) Elk1(ETS)/Hela-Elk1-ChIP-Seq(GSE31477)/Homer
                41 of 440 (1e-24) ZNF467(Zf)/HEK293-ZNF467.GFP-ChIP-Seq(GSE58341)/Homer
                42 of 440 (1e-23) NFE2L2(bZIP)/HepG2-NFE2L2-ChIP-Seq(Encode)/Homer
                43 of 440 (1e-23) E2F6(E2F)/Hela-E2F6-ChIP-Seq(GSE31477)/Homer
                44 of 440 (1e-23) CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer
                45 of 440 (1e-23) ETV4(ETS)/HepG2-ETV4-ChIP-Seq(ENCODE)/Homer
                46 of 440 (1e-23) Fli1(ETS)/CD8-FLI-ChIP-Seq(GSE20898)/Homer
                47 of 440 (1e-23) Egr1(Zf)/K562-Egr1-ChIP-Seq(GSE32465)/Homer
                48 of 440 (1e-22) GATA:SCL(Zf,bHLH)/Ter119-SCL-ChIP-Seq(GSE18720)/Homer
                49 of 440 (1e-21) Egr2(Zf)/Thymocytes-Egr2-ChIP-Seq(GSE34254)/Homer
                50 of 440 (1e-21) Zfp281(Zf)/ES-Zfp281-ChIP-Seq(GSE81042)/Homer
                51 of 440 (1e-20) ETV1(ETS)/GIST48-ETV1-ChIP-Seq(GSE22441)/Homer
                52 of 440 (1e-19) Elf4(ETS)/BMDM-Elf4-ChIP-Seq(GSE88699)/Homer
                53 of 440 (1e-19) Nkx6.1(Homeobox)/Islet-Nkx6.1-ChIP-Seq(GSE40975)/Homer
                54 of 440 (1e-18) ZNF652/HepG2-ZNF652.Flag-ChIP-Seq(Encode)/Homer
                55 of 440 (1e-18) Hoxa13(Homeobox)/ChickenMSG-Hoxa13.Flag-ChIP-Seq(GSE86088)/Homer
                56 of 440 (1e-16) Znf263(Zf)/K562-Znf263-ChIP-Seq(GSE31477)/Homer
                57 of 440 (1e-15) E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer
                58 of 440 (1e-15) ELF1(ETS)/Jurkat-ELF1-ChIP-Seq(SRA014231)/Homer
                59 of 440 (1e-15) EKLF(Zf)/Erythrocyte-Klf1-ChIP-Seq(GSE20478)/Homer
                60 of 440 (1e-14) GABPA(ETS)/Jurkat-GABPa-ChIP-Seq(GSE17954)/Homer
                61 of 440 (1e-13) CRX(Homeobox)/Retina-Crx-ChIP-Seq(GSE20012)/Homer
                62 of 440 (1e-13) Lhx3(Homeobox)/Neuron-Lhx3-ChIP-Seq(GSE31456)/Homer
                63 of 440 (1e-12) Hoxa11(Homeobox)/ChickenMSG-Hoxa11.Flag-ChIP-Seq(GSE86088)/Homer
                64 of 440 (1e-11) STAT5(Stat)/mCD4+-Stat5-ChIP-Seq(GSE12346)/Homer
                65 of 440 (1e-11) ETS1(ETS)/Jurkat-ETS1-ChIP-Seq(GSE17954)/Homer
                66 of 440 (1e-11) ZNF317(Zf)/HEK293-ZNF317.GFP-ChIP-Seq(GSE58341)/Homer
                67 of 440 (1e-11) En1(Homeobox)/SUM149-EN1-ChIP-Seq(GSE120957)/Homer
                68 of 440 (1e-10) BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer
                69 of 440 (1e-10) MITF(bHLH)/MastCells-MITF-ChIP-Seq(GSE48085)/Homer
                70 of 440 (1e-10) RORg(NR)/Liver-Rorc-ChIP-Seq(GSE101115)/Homer
                71 of 440 (1e-10) Hoxd13(Homeobox)/ChickenMSG-Hoxd13.Flag-ChIP-Seq(GSE86088)/Homer
                72 of 440 (1e-10) EHF(ETS)/LoVo-EHF-ChIP-Seq(GSE49402)/Homer
                73 of 440 (1e-10) EWS:FLI1-fusion(ETS)/SK_N_MC-EWS:FLI1-ChIP-Seq(SRA014231)/Homer
                74 of 440 (1e-9) RXR(NR),DR1/3T3L1-RXR-ChIP-Seq(GSE13511)/Homer
                75 of 440 (1e-9) ELF5(ETS)/T47D-ELF5-ChIP-Seq(GSE30407)/Homer
                76 of 440 (1e-9) PU.1(ETS)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                77 of 440 (1e-9) FOXK2(Forkhead)/U2OS-FOXK2-ChIP-Seq(E-MTAB-2204)/Homer
                78 of 440 (1e-9) Hoxd11(Homeobox)/ChickenMSG-Hoxd11.Flag-ChIP-Seq(GSE86088)/Homer
                79 of 440 (1e-9) PPARa(NR),DR1/Liver-Ppara-ChIP-Seq(GSE47954)/Homer
                80 of 440 (1e-9) PU.1-IRF(ETS:IRF)/Bcell-PU.1-ChIP-Seq(GSE21512)/Homer
                81 of 440 (1e-9) Stat3(Stat)/mES-Stat3-ChIP-Seq(GSE11431)/Homer
                82 of 440 (1e-9) Unknown(Homeobox)/Limb-p300-ChIP-Seq/Homer
                83 of 440 (1e-9) Hoxa9(Homeobox)/ChickenMSG-Hoxa9.Flag-ChIP-Seq(GSE86088)/Homer
                84 of 440 (1e-9) ETS(ETS)/Promoter/Homer
                85 of 440 (1e-8) HIF-1b(HLH)/T47D-HIF1b-ChIP-Seq(GSE59937)/Homer
                86 of 440 (1e-8) Hnf1(Homeobox)/Liver-Foxa2-Chip-Seq(GSE25694)/Homer
                87 of 440 (1e-8) STAT1(Stat)/HelaS3-STAT1-ChIP-Seq(GSE12782)/Homer
                88 of 440 (1e-8) PPARE(NR),DR1/3T3L1-Pparg-ChIP-Seq(GSE13511)/Homer
                89 of 440 (1e-8) Smad4(MAD)/ESC-SMAD4-ChIP-Seq(GSE29422)/Homer
                90 of 440 (1e-8) PR(NR)/T47D-PR-ChIP-Seq(GSE31130)/Homer
                91 of 440 (1e-8) ERG(ETS)/VCaP-ERG-ChIP-Seq(GSE14097)/Homer
                92 of 440 (1e-8) MafK(bZIP)/C2C12-MafK-ChIP-Seq(GSE36030)/Homer
                93 of 440 (1e-8) MafB(bZIP)/BMM-Mafb-ChIP-Seq(GSE75722)/Homer
                94 of 440 (1e-8) FOXP1(Forkhead)/H9-FOXP1-ChIP-Seq(GSE31006)/Homer
                95 of 440 (1e-7) E2F3(E2F)/MEF-E2F3-ChIP-Seq(GSE71376)/Homer
                96 of 440 (1e-7) E2F7(E2F)/Hela-E2F7-ChIP-Seq(GSE32673)/Homer
                97 of 440 (1e-7) Lhx1(Homeobox)/EmbryoCarcinoma-Lhx1-ChIP-Seq(GSE70957)/Homer
                98 of 440 (1e-7) Smad2(MAD)/ES-SMAD2-ChIP-Seq(GSE29422)/Homer
                99 of 440 (1e-7) Nanog(Homeobox)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                100 of 440 (1e-7) Max(bHLH)/K562-Max-ChIP-Seq(GSE31477)/Homer
                101 of 440 (1e-7) DLX2(Homeobox)/BasalGanglia-Dlx2-ChIP-seq(GSE124936)/Homer
                102 of 440 (1e-7) MafA(bZIP)/Islet-MafA-ChIP-Seq(GSE30298)/Homer
                103 of 440 (1e-7) ZFX(Zf)/mES-Zfx-ChIP-Seq(GSE11431)/Homer
                104 of 440 (1e-7) DLX1(Homeobox)/BasalGanglia-Dlx1-ChIP-seq(GSE124936)/Homer
                105 of 440 (1e-7) E2F1(E2F)/Hela-E2F1-ChIP-Seq(GSE22478)/Homer
                106 of 440 (1e-7) CHR(?)/Hela-CellCycle-Expression/Homer
                107 of 440 (1e-6) Ets1-distal(ETS)/CD4+-PolII-ChIP-Seq(Barski_et_al.)/Homer
                108 of 440 (1e-6) MafF(bZIP)/HepG2-MafF-ChIP-Seq(GSE31477)/Homer
                109 of 440 (1e-6) USF1(bHLH)/GM12878-Usf1-ChIP-Seq(GSE32465)/Homer
                110 of 440 (1e-6) LRF(Zf)/Erythroblasts-ZBTB7A-ChIP-Seq(GSE74977)/Homer
                111 of 440 (1e-6) Foxo3(Forkhead)/U2OS-Foxo3-ChIP-Seq(E-MTAB-2701)/Homer
                112 of 440 (1e-6) SpiB(ETS)/OCILY3-SPIB-ChIP-Seq(GSE56857)/Homer
                113 of 440 (1e-6) p53(p53)/mES-cMyc-ChIP-Seq(GSE11431)/Homer
                114 of 440 (1e-6) Tbox:Smad(T-box,MAD)/ESCd5-Smad2_3-ChIP-Seq(GSE29422)/Homer
                115 of 440 (1e-6) Otx2(Homeobox)/EpiLC-Otx2-ChIP-Seq(GSE56098)/Homer
                116 of 440 (1e-6) MYNN(Zf)/HEK293-MYNN.eGFP-ChIP-Seq(Encode)/Homer
                117 of 440 (1e-6) YY1(Zf)/Promoter/Homer
                118 of 440 (1e-6) SPDEF(ETS)/VCaP-SPDEF-ChIP-Seq(SRA014231)/Homer
                119 of 440 (1e-5) NeuroD1(bHLH)/Islet-NeuroD1-ChIP-Seq(GSE30298)/Homer
                120 of 440 (1e-5) SCL(bHLH)/HPC7-Scl-ChIP-Seq(GSE13511)/Homer
                121 of 440 (1e-5) IRF3(IRF)/BMDM-Irf3-ChIP-Seq(GSE67343)/Homer
                122 of 440 (1e-5) GFY-Staf(?,Zf)/Promoter/Homer
                123 of 440 (1e-5) Foxf1(Forkhead)/Lung-Foxf1-ChIP-Seq(GSE77951)/Homer
                124 of 440 (1e-5) Foxo1(Forkhead)/RAW-Foxo1-ChIP-Seq(Fan_et_al.)/Homer
                125 of 440 (1e-5) CLOCK(bHLH)/Liver-Clock-ChIP-Seq(GSE39860)/Homer
                126 of 440 (1e-5) ETS:E-box(ETS,bHLH)/HPC7-Scl-ChIP-Seq(GSE22178)/Homer
                127 of 440 (1e-5) Atoh1(bHLH)/Cerebellum-Atoh1-ChIP-Seq(GSE22111)/Homer
                128 of 440 (1e-5) Tbr1(T-box)/Cortex-Tbr1-ChIP-Seq(GSE71384)/Homer
                129 of 440 (1e-4) STAT4(Stat)/CD4-Stat4-ChIP-Seq(GSE22104)/Homer
                130 of 440 (1e-4) Sox3(HMG)/NPC-Sox3-ChIP-Seq(GSE33059)/Homer
                131 of 440 (1e-4) c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer
                132 of 440 (1e-4) LHX9(Homeobox)/Hct116-LHX9.V5-ChIP-Seq(GSE116822)/Homer
                133 of 440 (1e-4) ELF3(ETS)/PDAC-ELF3-ChIP-Seq(GSE64557)/Homer
                134 of 440 (1e-4) Stat3+il21(Stat)/CD4-Stat3-ChIP-Seq(GSE19198)/Homer
                135 of 440 (1e-4) DLX5(Homeobox)/BasalGanglia-Dlx5-ChIP-seq(GSE124936)/Homer
                136 of 440 (1e-4) CEBP:AP1(bZIP)/ThioMac-CEBPb-ChIP-Seq(GSE21512)/Homer
                137 of 440 (1e-4) AR-halfsite(NR)/LNCaP-AR-ChIP-Seq(GSE27824)/Homer
                138 of 440 (1e-4) Six1(Homeobox)/Myoblast-Six1-ChIP-Chip(GSE20150)/Homer
                139 of 440 (1e-4) Tcf12(bHLH)/GM12878-Tcf12-ChIP-Seq(GSE32465)/Homer
                140 of 440 (1e-4) CArG(MADS)/PUER-Srf-ChIP-Seq(Sullivan_et_al.)/Homer
                141 of 440 (1e-3) Etv2(ETS)/ES-ER71-ChIP-Seq(GSE59402)/Homer
                142 of 440 (1e-3) Zfp57(Zf)/H1-ZFP57.HA-ChIP-Seq(GSE115387)/Homer
                143 of 440 (1e-3) NFY(CCAAT)/Promoter/Homer
                144 of 440 (1e-3) PU.1:IRF8(ETS:IRF)/pDC-Irf8-ChIP-Seq(GSE66899)/Homer
                145 of 440 (1e-3) NPAS(bHLH)/Liver-NPAS-ChIP-Seq(GSE39860)/Homer
                146 of 440 (1e-3) EWS:ERG-fusion(ETS)/CADO_ES1-EWS:ERG-ChIP-Seq(SRA014231)/Homer
                147 of 440 (1e-3) FOXK1(Forkhead)/HEK293-FOXK1-ChIP-Seq(GSE51673)/Homer
                148 of 440 (1e-3) Zic2(Zf)/ESC-Zic2-ChIP-Seq(SRP197560)/Homer
                149 of 440 (1e-3) MNT(bHLH)/HepG2-MNT-ChIP-Seq(Encode)/Homer
                150 of 440 (1e-3) IRF2(IRF)/Erythroblas-IRF2-ChIP-Seq(GSE36985)/Homer
                151 of 440 (1e-3) Tcf21(bHLH)/ArterySmoothMuscle-Tcf21-ChIP-Seq(GSE61369)/Homer
                152 of 440 (1e-3) BHLHA15(bHLH)/NIH3T3-BHLHB8.HA-ChIP-Seq(GSE119782)/Homer
                153 of 440 (1e-3) Ap4(bHLH)/AML-Tfap4-ChIP-Seq(GSE45738)/Homer
                154 of 440 (1e-3) IRF8(IRF)/BMDM-IRF8-ChIP-Seq(GSE77884)/Homer
                155 of 440 (1e-3) NFAT(RHD)/Jurkat-NFATC1-ChIP-Seq(Jolma_et_al.)/Homer
                156 of 440 (1e-3) Unknown-ESC-element(?)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                157 of 440 (1e-3) Sox9(HMG)/Limb-SOX9-ChIP-Seq(GSE73225)/Homer
                158 of 440 (1e-3) MyoG(bHLH)/C2C12-MyoG-ChIP-Seq(GSE36024)/Homer
                159 of 440 (1e-3) Six2(Homeobox)/NephronProgenitor-Six2-ChIP-Seq(GSE39837)/Homer
                160 of 440 (1e-3) Lhx2(Homeobox)/HFSC-Lhx2-ChIP-Seq(GSE48068)/Homer
                161 of 440 (1e-3) GSC(Homeobox)/FrogEmbryos-GSC-ChIP-Seq(DRA000576)/Homer
                162 of 440 (1e-3) HNF1b(Homeobox)/PDAC-HNF1B-ChIP-Seq(GSE64557)/Homer
                163 of 440 (1e-3) IRF4(IRF)/GM12878-IRF4-ChIP-Seq(GSE32465)/Homer
                164 of 440 (1e-3) Prop1(Homeobox)/GHFT1-PROP1.biotin-ChIP-Seq(GSE77302)/Homer
                165 of 440 (1e-3) NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
                166 of 440 (1e-3) BMAL1(bHLH)/Liver-Bmal1-ChIP-Seq(GSE39860)/Homer
                167 of 440 (1e-3) ZNF189(Zf)/HEK293-ZNF189.GFP-ChIP-Seq(GSE58341)/Homer
                168 of 440 (1e-3) EBF1(EBF)/Near-E2A-ChIP-Seq(GSE21512)/Homer
                169 of 440 (1e-3) HIF2a(bHLH)/785_O-HIF2a-ChIP-Seq(GSE34871)/Homer
                170 of 440 (1e-3) E2F(E2F)/Hela-CellCycle-Expression/Homer
                171 of 440 (1e-2) Rfx6(HTH)/Min6b1-Rfx6.HA-ChIP-Seq(GSE62844)/Homer
                172 of 440 (1e-2) EBF2(EBF)/BrownAdipose-EBF2-ChIP-Seq(GSE97114)/Homer
                173 of 440 (1e-2) Zac1(Zf)/Neuro2A-Plagl1-ChIP-Seq(GSE75942)/Homer
                174 of 440 (1e-2) VDR(NR),DR3/GM10855-VDR+vitD-ChIP-Seq(GSE22484)/Homer
                175 of 440 (1e-2) Rbpj1(?)/Panc1-Rbpj1-ChIP-Seq(GSE47459)/Homer
                176 of 440 (1e-2) STAT6(Stat)/Macrophage-Stat6-ChIP-Seq(GSE38377)/Homer
                177 of 440 (1e-2) Ascl1(bHLH)/NeuralTubes-Ascl1-ChIP-Seq(GSE55840)/Homer
                178 of 440 (1e-2) n-Myc(bHLH)/mES-nMyc-ChIP-Seq(GSE11431)/Homer
                179 of 440 (1e-2) MyoD(bHLH)/Myotube-MyoD-ChIP-Seq(GSE21614)/Homer
                180 of 440 (1e-2) FOXA1(Forkhead)/MCF7-FOXA1-ChIP-Seq(GSE26831)/Homer
                181 of 440 (1e-2) Zfp809(Zf)/ES-Zfp809-ChIP-Seq(GSE70799)/Homer
                182 of 440 (1e-2) Sox21(HMG)/ESC-SOX21-ChIP-Seq(GSE110505)/Homer
                183 of 440 (1e-2) CRE(bZIP)/Promoter/Homer
                184 of 440 (1e-2) RUNX1(Runt)/Jurkat-RUNX1-ChIP-Seq(GSE29180)/Homer
                185 of 440 (1e-2) JunD(bZIP)/K562-JunD-ChIP-Seq/Homer
                186 of 440 (1e-2) Foxa3(Forkhead)/Liver-Foxa3-ChIP-Seq(GSE77670)/Homer
                187 of 440 (1e-2) FOXA1(Forkhead)/LNCAP-FOXA1-ChIP-Seq(GSE27824)/Homer
                188 of 440 (1e-2) ZNF711(Zf)/SHSY5Y-ZNF711-ChIP-Seq(GSE20673)/Homer
                189 of 440 (1e-2) TCF4(bHLH)/SHSY5Y-TCF4-ChIP-Seq(GSE96915)/Homer
                190 of 440 (1e-2) PRDM15(Zf)/ESC-Prdm15-ChIP-Seq(GSE73694)/Homer
                191 of 440 (1e-2) Esrrb(NR)/mES-Esrrb-ChIP-Seq(GSE11431)/Homer
                192 of 440 (1e-2) ZNF692(Zf)/HEK293-ZNF692.GFP-ChIP-Seq(GSE58341)/Homer
                193 of 440 (1e-2) PRDM1(Zf)/Hela-PRDM1-ChIP-Seq(GSE31477)/Homer
                194 of 440 (1e-2) Sox10(HMG)/SciaticNerve-Sox3-ChIP-Seq(GSE35132)/Homer
                195 of 440 (1e-2) Zic(Zf)/Cerebellum-ZIC1.2-ChIP-Seq(GSE60731)/Homer
                196 of 440 (1e-2) PRDM14(Zf)/H1-PRDM14-ChIP-Seq(GSE22767)/Homer
                197 of 440 (1e-2) Myf5(bHLH)/GM-Myf5-ChIP-Seq(GSE24852)/Homer
                198 of 440 (1e-2) Barx1(Homeobox)/Stomach-Barx1.3xFlag-ChIP-Seq(GSE69483)/Homer
                199 of 440 (1e-2) IRF1(IRF)/PBMC-IRF1-ChIP-Seq(GSE43036)/Homer
                200 of 440 (1e-2) AP-2gamma(AP2)/MCF7-TFAP2C-ChIP-Seq(GSE21234)/Homer
                201 of 440 (1e-2) bHLHE40(bHLH)/HepG2-BHLHE40-ChIP-Seq(GSE31477)/Homer
                202 of 440 (1e-2) ZBTB18(Zf)/HEK293-ZBTB18.GFP-ChIP-Seq(GSE58341)/Homer
                203 of 440 (1e-2) TR4(NR),DR1/Hela-TR4-ChIP-Seq(GSE24685)/Homer
                204 of 440 (1e-2) Olig2(bHLH)/Neuron-Olig2-ChIP-Seq(GSE30882)/Homer
                205 of 440 (1e-2) HNF4a(NR),DR1/HepG2-HNF4a-ChIP-Seq(GSE25021)/Homer
                206 of 440 (1e-2) Twist2(bHLH)/Myoblast-Twist2.Ty1-ChIP-Seq(GSE127998)/Homer
                207 of 440 (1e-2) FOXM1(Forkhead)/MCF7-FOXM1-ChIP-Seq(GSE72977)/Homer
                208 of 440 (1e-2) RORgt(NR)/EL4-RORgt.Flag-ChIP-Seq(GSE56019)/Homer
                209 of 440 (1e-2) RORgt(NR)/EL4-RORgt.Flag-ChIP-Seq(GSE56019)/Homer
                210 of 440 (1e-2) STAT6(Stat)/CD4-Stat6-ChIP-Seq(GSE22104)/Homer
                211 of 440 (1e-2) RUNX2(Runt)/PCa-RUNX2-ChIP-Seq(GSE33889)/Homer
                212 of 440 (1e-2) Eomes(T-box)/H9-Eomes-ChIP-Seq(GSE26097)/Homer
                213 of 440 (1e-2) Six4(Homeobox)/MCF7-SIX4-ChIP-Seq(Encode)/Homer
                214 of 440 (1e-2) ZNF341(Zf)/EBV-ZNF341-ChIP-Seq(GSE113194)/Homer
                215 of 440 (1e-2) Foxa2(Forkhead)/Liver-Foxa2-ChIP-Seq(GSE25694)/Homer
        Skipping...
        Job finished - if results look good, please send beer to ..

        Cleaning up tmp files...

In the motif analysis result, SP1 motif and many other family members rank top.

drawing

Next we want to identify binding sites that are differentially bound in K562 and Hct-116 cells. This can be challenging as the two samples may have slightly shifted peaks centers at a given genomic region, leading to false positive differential peak calls. To handle this, Pycallingcards first combines the insertions from the two samples and calls peaks on the joint dataset. We do this using bedtools and pybedtools.

[22]:
import pybedtools
peak = cc.rd.combine_qbed([peak_data_HCT116, peak_data_K562])
peak = pybedtools.BedTool.from_dataframe(peak).merge().to_dataframe()
peak_data = peak.rename(columns={"chrom":"Chr", "start":"Start", "end":"End"})
peak_data
[22]:
Chr Start End
0 chr1 29684 30087
1 chr1 36239 38107
2 chr1 198893 201208
3 chr1 203351 207161
4 chr1 265549 266336
... ... ... ...
10445 chrY 15158250 15158653
10446 chrY 16985442 16985845
10447 chrY 19753311 19753714
10448 chrY 21011133 21011828
10449 chrY 56952574 56957328

10450 rows × 3 columns

We can now visualize the peaks called on the joint dataset.

[23]:
cc.pl.draw_area("chr1",999921,1000324,15000,peak_data, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,10),peak_line = 2,save = False,plotsize = [1,1,3], example_length = 1000,
                title = "HCT116_SP1")
cc.pl.draw_area("chr1",999921,1000324,15000,peak_data, K562_SP1, "hg38", K562_brd4, font_size=2,
                figsize = (30,10),peak_line = 2,save = False,plotsize = [1,1,3], example_length = 1000,
                title = "K562_SP1")
../../_images/tutorials_notebooks_K562HCT116_SP1_43_0.png
../../_images/tutorials_notebooks_K562HCT116_SP1_43_1.png
[24]:
cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,14), peak_line = 3,save = False, bins = 200, plotsize = [1,1,5],
                example_length = 1000, title = "HCT116_SP1")
cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data, K562_SP1, "hg38", K562_brd4, font_size=2,
                figsize = (30,14), peak_line = 3,save = False, bins = 200, plotsize = [1,1,5],
                example_length = 1000, title = "K562_SP1")
../../_images/tutorials_notebooks_K562HCT116_SP1_44_0.png
../../_images/tutorials_notebooks_K562HCT116_SP1_44_1.png

The results seem to be good! Congratulations! Now we can annotate the peaks using bedtools.

[25]:
peak_annotation = cc.pp.annotation(peak_data, reference = "hg38")
peak_annotation = cc.pp.combine_annotation(peak_data,peak_annotation)
peak_annotation
In the bedtools method, we would use bedtools in the default path. Set bedtools path by 'bedtools_path' if needed.
[25]:
Chr Start End Nearest Refseq1 Gene Name1 Direction1 Distance1 Nearest Refseq2 Gene Name2 Direction2 Distance2
0 chr1 29684 30087 NR_036051 MIR1302-2 + 279 NR_024540 WASH7P - -315
1 chr1 36239 38107 NR_026818 FAM138A - -159 NR_036051 MIR1302-2 + -5737
2 chr1 198893 201208 NR_026823 FAM138D - 3921 NR_107063 MIR6859-3 - -10936
3 chr1 203351 207161 NR_026823 FAM138D - 0 NR_107063 MIR6859-3 - -15394
4 chr1 265549 266336 NR_026823 FAM138D - -58953 NR_107063 MIR6859-3 - -77592
... ... ... ... ... ... ... ... ... ... ... ...
10445 chrY 15158250 15158653 NM_001206850 NLGN4Y + -314283 NR_046504 NLGN4Y-AS1 - -354218
10446 chrY 16985442 16985845 NR_028083 FAM41AY1 + 515113 NR_002160 FAM224B - 589316
10447 chrY 19753311 19753714 NM_001146706 KDM5D - -8373 NR_045128 TXLNGY + -146142
10448 chrY 21011133 21011828 NM_001039567 RPS4Y2 + -230102 NM_001282471 PRORY - 371146
10449 chrY 56952574 56957328 NM_005840 SPRY3 + 0 NM_001145149 VAMP7 + 110472

10450 rows × 11 columns

Combine the two experiment qbed files to make anndata object.

[26]:
exp_qbed = pd.concat([K562_SP1,HCT116_SP1])
exp_qbed
[26]:
Chr Start End Reads Direction Barcodes
0 chr1 16529 16533 163 - GCTCCTAAGTACGTTC-1
1 chr1 29884 29888 10 + CTCACACCAGACGCTC-1
2 chr1 29884 29888 155 + TGGCCAGCACCCATTC-1
3 chr1 29884 29888 285 + GTGGGTCCACGGCCAT-1
4 chr1 29884 29888 7 + CGTCTACTCAACACGT-1
... ... ... ... ... ... ...
77205 chrY 25518788 25518792 2 + TGGGCGTTCGAACGGA-1
77206 chrY 56987633 56987637 13 + CAGTCCTAGGCACATG-1
77207 chrY 57080855 57080859 17 + CGGAGCTCATCGACGC-1
77208 chrY 57080855 57080859 7 + GTAACGTAGTTACGGG-1
77209 chrY 57080855 57080859 9 + TCAGCAAGTTGAACTC-1

404675 rows × 6 columns

Read the barcode file.

[27]:
barcodes = cc.datasets.SP1_K562HCT116_data(data = "barcodes")
barcodes = barcodes.drop_duplicates(subset=['Index'])
barcodes
[27]:
Index cluster
0 AAACCTGAGAAAGTGG-1 HCT116
1 AAACCTGAGACCGGAT-1 K562
2 AAACCTGAGACTAGAT-1 HCT116
3 AAACCTGAGAGCTTCT-1 HCT116
4 AAACCTGAGAGTACCG-1 HCT116
... ... ...
52206 TTTGTCATCTCCGGTT-1 K562
52207 TTTGTCATCTCGATGA-1 K562
52208 TTTGTCATCTCTAAGG-1 K562
52209 TTTGTCATCTGGAGCC-1 HCT116
52210 TTTGTCATCTTGGGTA-1 HCT116

51079 rows × 2 columns

Now we will connect the peaks (and insertions under the peaks) to the cell barcode data. To do so, we will use the qbed data, peak data and barcodes data to make a cell by peak anndata object.

[28]:
adata_cc = cc.pp.make_Anndata(exp_qbed, peak_annotation, barcodes)
adata_cc
100%|██████████| 24/24 [00:02<00:00, 10.77it/s]
[28]:
AnnData object with n_obs × n_vars = 51079 × 10450
    obs: 'cluster'
    var: 'Chr', 'Start', 'End', 'Nearest Refseq1', 'Gene Name1', 'Direction1', 'Distance1', 'Nearest Refseq2', 'Gene Name2', 'Direction2', 'Distance2'

Although one peak should have many insertions, there is a chance that all the cells from the peak were filtered by the RNA preprocesssing. In this case, we advise to filter the peaks. Additionally, we also recommend to filter cells that have very few insertions.

[29]:
cc.pp.filter_peaks(adata_cc, min_counts = 5)
cc.pp.filter_peaks(adata_cc, min_cells = 5)
adata_cc
[29]:
AnnData object with n_obs × n_vars = 51079 × 10448
    obs: 'cluster'
    var: 'Chr', 'Start', 'End', 'Nearest Refseq1', 'Gene Name1', 'Direction1', 'Distance1', 'Nearest Refseq2', 'Gene Name2', 'Direction2', 'Distance2', 'n_counts', 'n_cells'

Next we can perform differential peak analysis to determine which peaks are cell type specific. In this example, we use the fisher exact test to find peaks enriched in K562 versus Hct116 cells.

[30]:
cc.tl.rank_peak_groups(adata_cc, "cluster", method = 'fisher_exact', key_added = 'fisher_exact')
100%|██████████| 2/2 [00:55<00:00, 27.99s/it]

We can plot the results for differential peak analysis.

[31]:
cc.pl.rank_peak_groups(adata_cc, key = 'fisher_exact')
../../_images/tutorials_notebooks_K562HCT116_SP1_58_0.png

Now let’s visualize some peaks that are differentially bound. The colored ones are the insertions for the cluster of interest (i.e. cell type) and the grey ones are insertions in the rest of the clusters. In this case there are only two clusters, HCT116 and K562. We observe large differences in Sp1 binding in HCT116 and K562 cells.

[32]:
bg_qbed = pd.concat([K562_brd4, HCT116_brd4])
bg_qbed
[32]:
Chr Start End Reads Direction Barcodes
0 chr1 30238 30242 3 + TTTACTGCATAAAGGT-1
1 chr1 30355 30359 2 - ATCACGAAGAGTAATC-1
2 chr1 30355 30359 70 + TTGAACGCAAATCCGT-1
3 chr1 31101 31105 2 + CCTCAGTCATCAGTAC-1
4 chr1 32116 32120 5 + CTAGTGAAGACAAAGG-1
... ... ... ... ... ... ...
37769 chrY 18037315 18037319 9 - GCAGTTAAGATCTGAA-1
37770 chrY 24036504 24036508 168 + GCAGTTAAGATCTGAA-1
37771 chrY 24036504 24036508 508 + CATATGGCAGCCAGAA-1
37772 chrY 25633622 25633626 13 - GCAGTTAAGATCTGAA-1
37773 chrY 25633622 25633626 32 - CATATGGCAGCCAGAA-1

145159 rows × 6 columns

In the tracks above, we see a strong peak on Chr 15 in HCT116 cell (purple) that is not present in K562 cells (red)

[33]:
cc.pl.draw_area("chr7", 143706539, 143718962, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "K562", key = 'cluster', figsize = (30,13),
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'K562 Insertions', name_density1 = 'K562 Insertion Density',
                peak_line = 4, color = "red", plotsize = [1,1,5], title = "chr7_143706539_143718962_K562")
cc.pl.draw_area("chr7",143706539,143718962,100000,peak_data,exp_qbed,"hg38",adata = adata_cc,
                bins = 250, font_size=2, name = "HCT116", key ='cluster', figsize = (30,13),
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'HCT116 Insertions', name_density1 = 'HCT116 Insertion Density',
                peak_line = 4, color = "purple", plotsize = [1,1,5], title = "chr7_143706539_143718962_HCT116")
../../_images/tutorials_notebooks_K562HCT116_SP1_62_0.png
../../_images/tutorials_notebooks_K562HCT116_SP1_62_1.png

In the tracks above we see a number of Sp1 peaks on chr7 that are tightly bound in K562, but not in HCT116.

[34]:
cc.pl.draw_area("chr8", 17799370, 17802353, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "K562", key = 'cluster', figsize = (30,13), peak_line = 3,
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'K562 Insertions', name_density1 = 'K562 Insertion Density',
                color = "red", plotsize = [1,1,6], title = "chr19_9818643_9820060")
cc.pl.draw_area("chr8", 17799370, 17802353, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "HCT116",key ='cluster',figsize = (30,13),peak_line = 3,
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'HCT116 Insertions', name_density1 = 'HCT116 Insertion Density',
                color = "purple", plotsize = [1,1,6], title = "chr19_9818643_9820060")
../../_images/tutorials_notebooks_K562HCT116_SP1_64_0.png
../../_images/tutorials_notebooks_K562HCT116_SP1_64_1.png

Here we find peaks on Chr8 that are bound in HCT116 but not in K562 cells.

Saved the file if needed.

[35]:
adata_cc.write("SP1_qbed.h5ad")