Tutorial: K562 HCT116 SP1 single-cell calling cards data.¶

In this tutorial, we will analyze the binding of the transcription factor Sp1 in K562 and HCT116 cell lines. These data are generated using single-cell calling cards (scCCs). The data is from Moudgil et al., Cell. (2020) and can be downloaded from GEO.

We will cover how to call TF peaks using a background file, annotate these peaks, compare them with Chip-seq reference data, and perform a differential peak analysis.

[1]:

import pycallingcards as cc
import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 150

We start by reading the qbed datafile. In this file, each row represents a Sp1-directed insertion and columns indicate the chromosome, start point and end point, reads number, the direction and cell barcode of each insertion. For example, the first row tells us the first insertion is on Chromosome 1 in a TTAA site located at genomic coordinates 30116 to 30120. There are 12 reads supporting this insertion, which maps to the negative strand of the genome. The cell barcode is CCCAATCCATCGGTTA-1. Note that the cell barcode allows us to connect this insertino to scRNA-seq data collected from the same cell.

Use cc.rd.read_qbed(filename) to read your own ccf data.

[2]:

# read in experiement data
HCT116_SP1 = cc.datasets.SP1_K562HCT116_data(data = "HCT116_SP1_qbed")
HCT116_SP1

[2]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	30116	30120	5	-	CCCAATCCATCGGTTA-1
1	chr1	34568	34572	3	-	CCTTCGAAGGGCTTCC-1
2	chr1	36736	36740	29	+	ACGAGCCGTATAGGTA-1
3	chr1	42447	42451	3	-	CTCTACGTCGGAGCAA-1
4	chr1	89697	89701	119	-	AGCTCTCGTTTGTTTC-1
...	...	...	...	...	...	...
77205	chrY	25518788	25518792	2	+	TGGGCGTTCGAACGGA-1
77206	chrY	56987633	56987637	13	+	CAGTCCTAGGCACATG-1
77207	chrY	57080855	57080859	17	+	CGGAGCTCATCGACGC-1
77208	chrY	57080855	57080859	7	+	GTAACGTAGTTACGGG-1
77209	chrY	57080855	57080859	9	+	TCAGCAAGTTGAACTC-1

77210 rows × 6 columns

[3]:

# read in backgound data
HCT116_brd4 = cc.datasets.SP1_K562HCT116_data(data="HCT116_brd4_qbed")
HCT116_brd4

[3]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	89697	89701	14	+	TCTGAGACAATGGTCT-1
1	chr1	89697	89701	8	+	CAGCGACCAAATACAG-1
2	chr1	203932	203936	99	+	TTCTCCTTCTACTTAC-1
3	chr1	204063	204067	5	-	TGTTCCGGTGTAAGTA-1
4	chr1	204063	204067	7	-	CAAGATCTCGACCAGC-1
...	...	...	...	...	...	...
37769	chrY	18037315	18037319	9	-	GCAGTTAAGATCTGAA-1
37770	chrY	24036504	24036508	168	+	GCAGTTAAGATCTGAA-1
37771	chrY	24036504	24036508	508	+	CATATGGCAGCCAGAA-1
37772	chrY	25633622	25633626	13	-	GCAGTTAAGATCTGAA-1
37773	chrY	25633622	25633626	32	-	CATATGGCAGCCAGAA-1

37774 rows × 6 columns

We first need to call peaks in order to find candidate SP1 binding sites. There are three different methods (CCcaller, MACCs, Blockify) available in Pycallingcards. Here, we will use MACCs to call peaks. The appropriate reference genome for these data is human(‘hg38’). The window_size parameter is the most important parameter for MACCs, it is highly related to the length of a peak. A value of 1000-2000 is recommended for most sequence-specific TFs. step_size is another important parameter and it controls whether two nearby clusters of insertions are called as one peak or split into two peaks. 500-800 is good for step_size. pvalue_cutoffTTAA is the pvalue cutoff for TTAA data and pvalue_cutoffbg is the pvalue cutoff for the background ccf data. Normally, the setting for pvalue_cutoffbg is considerably higher than pvalue_cutoffTTAA. pvalue_cutoffbg is recommended to be 0.1 and pvalue_cutoffTTAA is recommended from 0.001 to 0.05. pseudocounts is advised to be 0.1-1.

[4]:

peak_data_HCT116 = cc.pp.call_peaks(HCT116_SP1, HCT116_brd4, method = "MACCs", reference = "hg38",  window_size = 2000, step_size = 500,
                  pvalue_cutoffTTAA = 0.001, pvalue_cutoffbg = 0.1, lam_win_size = None,  pseudocounts = 0.1, record = True, save = "peak_HCT116.bed")
peak_data_HCT116

For the MACCs method with background, [expdata, background, reference, pvalue_cutoffbg, pvalue_cutoffTTAA, lam_win_size, window_size, step_size, extend, pseudocounts, test_method, min_insertions, record] would be utilized.

100%|██████████| 25/25 [00:43<00:00,  1.75s/it]

[4]:

	Chr	Start	End	Center	Experiment Insertions	Background insertions	Reference Insertions	pvalue Reference	pvalue Background	Fraction Experiment	TPH Experiment	Fraction background	TPH background	TPH background subtracted	pvalue_adj Reference
0	chr1	906689	907160	906957.0	5	0	3	3.099334e-09	1.546531e-04	0.000065	6475.845098	0.000000	0.000000	6475.845098	3.371990e-06
1	chr1	999921	1000324	1000121.0	20	0	1	0.000000e+00	0.000000e+00	0.000259	25903.380391	0.000000	0.000000	25903.380391	0.000000e+00
2	chr1	1156947	1157863	1157660.0	11	0	2	0.000000e+00	1.274899e-09	0.000142	14246.859215	0.000000	0.000000	14246.859215	0.000000e+00
3	chr1	1692740	1693542	1693339.0	6	0	3	5.135270e-11	1.546531e-04	0.000078	7771.014117	0.000000	0.000000	7771.014117	7.604315e-08
4	chr1	1744492	1746808	1746605.0	11	0	7	0.000000e+00	1.274899e-09	0.000142	14246.859215	0.000000	0.000000	14246.859215	0.000000e+00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3003	chrY	1245503	1247177	1245703.0	5	0	7	1.652897e-09	4.678840e-03	0.000065	6475.845098	0.000000	0.000000	6475.845098	2.056405e-06
3004	chrY	1280372	1281989	1281786.0	5	0	3	1.426995e-09	4.678840e-03	0.000065	6475.845098	0.000000	0.000000	6475.845098	1.828173e-06
3005	chrY	1586317	1587733	1587530.0	7	0	8	3.370637e-13	1.546531e-04	0.000091	9066.183137	0.000000	0.000000	9066.183137	6.766326e-10
3006	chrY	2391936	2392440	2392237.0	6	1	2	1.985434e-11	9.958372e-02	0.000078	7771.014117	0.000026	2647.323556	5123.690561	3.329699e-08
3007	chrY	2608793	2610054	2608993.0	8	0	2	2.775558e-15	1.546531e-04	0.000104	10361.352156	0.000000	0.000000	10361.352156	6.730626e-12

3008 rows × 15 columns

In order to tune parameters for peak calling, we advise looking at the data and evaluating the validity of the called peaks. The default settings are recommended, but for some TFs, adjacent peaks may be merged that should not be, or, alternatively, peaks that should be joined may be called separately.

Below, we plot the scCC data in HCT116 cells for a region in chromosome 1. The top track displays the locations of Sp1-directed (red) and background (grey) transpositions and their read counts. Each dot represents an insertion and the height is log(reads+1). The middle track plots the insertion density. The third track represents the reference genes and peaks. Finally, the last track represents peak calls. Below you can see that regions with high densities of insertions are accurately called as Sp1 binding sites.

[5]:

cc.pl.draw_area("chr1",999921,1000324,20000,peak_data_HCT116, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,10), peak_line = 1, save = False, plotsize = [1,1,4], example_length = 1000)

../../_images/tutorials_notebooks_K562HCT116_SP1_9_0.png

We can also visualize our data directly in the WashU Epigenome Browser. This can be useful for overlaying your data with other published datasets. Notice that this link only valid for 24hrs, so please rerun it if you want to use it.

[6]:

qbed = {"SP1":HCT116_SP1, "Brd4": HCT116_brd4}
bed = {"peak":peak_data_HCT116}
cc.pl.WashU_browser_url(qbed = qbed,bed = bed,genome = 'hg38')

All qbed addressed
All bed addressed
Uploading files
Please click the following link to see the data on WashU Epigenome Browser directly.
https://epigenomegateway.wustl.edu/browser/?genome=hg38&hub=https://companion.epigenomegateway.org//task/cb339e5b2d8c774345b7cf372cf4f97c/output//datahub.json

Pycallingcards can be used to visual peak locations acorss the genome to see that the distribution of peaks is unbiased and that all chromosomes are represented.

[7]:

cc.pl.whole_peaks(peak_data_HCT116, reference = "hg38", figsize=(100, 70), height_scale = 1.7)

../../_images/tutorials_notebooks_K562HCT116_SP1_13_0.png

We can then analyze the scCC Sp1 peaks to see if there is an enrichment of ChIP-seq signal at these locations using a reference Chip-seq dataset of SP1 binding in HCT116 from ENCSR000BSF (use the bigWig file ENCFF587ZMX generated by it).

Download the data, if needed, with:

!wget https://www.encodeproject.org/files/ENCFF587ZMX/@@download/ENCFF587ZMX.bigWig

We first calculate the signal of the Chip-seq signal around the peak.

[8]:

mtx_HCT116 = cc.pl.calculate_signal(peak_data_HCT116,
                                    chipseq_signal = "ENCFF587ZMX.bigWig")

100%|██████████| 3008/3008 [00:14<00:00, 200.91it/s]

Visualize it by the plotting the signal values.

[9]:

cc.pl.signal_plot(mtx_HCT116, alpha = 0.05, figsize=(6, 4))

../../_images/tutorials_notebooks_K562HCT116_SP1_18_0.png

Visualized by the plotting the signal heatmap plot.

[10]:

cc.pl.signal_heatmap(mtx_HCT116,pad = 0.035)

../../_images/tutorials_notebooks_K562HCT116_SP1_20_0.png

We can now use HOMER to call motifs. We hope to find the canonical Sp1 motif enriched under the call peaks.

[11]:

cc.tl.call_motif("peak_HCT116.bed",reference ="hg38",save_homer = "Homer/peak_HCT116",
                 homer_path = "/home/juanru/miniconda3/bin/", num_cores=12)

Peak data peak_HCT116.bed is used here.


        Position file = peak_HCT116.bed
        Genome = hg38
        Output Directory = Homer/peak_HCT116
        Fragment size set to 1000
        Using 12 CPUs
        Will not run homer for de novo motifs
        Found mset for "human", will check against vertebrates motifs
        Peak/BED file conversion summary:
                BED/Header formatted lines: 3008
                peakfile formatted lines: 0

        Peak File Statistics:
                Total Peaks: 3008
                Redundant Peak IDs: 0
                Peaks lacking information: 0 (need at least 5 columns per peak)
                Peaks with misformatted coordinates: 0 (should be integer)
                Peaks with misformatted strand: 0 (should be either +/- or 0/1)

        Peak file looks good!

        Background files for 1000 bp fragments found.
        Custom genome sequence directory: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38//

        Extracting sequences from file: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa
        Looking for peak sequences in a single file (/home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa)
        Extracting 334 sequences from chr1
        Extracting 147 sequences from chr10
        Extracting 132 sequences from chr11
        Extracting 179 sequences from chr12
        Extracting 56 sequences from chr13
        Extracting 85 sequences from chr14
        Extracting 83 sequences from chr15
        Extracting 127 sequences from chr16
        Extracting 205 sequences from chr17
        Extracting 54 sequences from chr18
        Extracting 249 sequences from chr19
        Extracting 194 sequences from chr2
        Extracting 71 sequences from chr20
        Extracting 37 sequences from chr21
        Extracting 63 sequences from chr22
        Extracting 143 sequences from chr3
        Extracting 146 sequences from chr4
        Extracting 96 sequences from chr5
        Extracting 159 sequences from chr6
        Extracting 143 sequences from chr7
        Extracting 117 sequences from chr8
        Extracting 136 sequences from chr9
        Extracting 47 sequences from chrX
        Extracting 5 sequences from chrY

        Not removing redundant sequences


        Sequences processed:
                Auto detected maximum sequence length of 1001 bp
                3008 total

        Frequency Bins: 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8
        Freq    Bin     Count
        0.3     2       1
        0.35    3       46
        0.4     4       183
        0.45    5       336
        0.5     6       476
        0.6     7       1116
        0.7     8       706
        0.8     9       142
        10      10      2

        Total sequences set to 50000

        Choosing background that matches in CpG/GC content...
        Bin     # Targets       # Background    Background Weight
        2       1       16      0.976
        3       46      719     0.999
        4       183     2859    1.000
        5       336     5249    1.000
        6       476     7436    1.000
        7       1116    17435   1.000
        8       706     11029   1.000
        9       142     2218    1.000
        10      2       31      1.008
        Assembling sequence file...
        Normalizing lower order oligos using homer2

        Reading input files...
        50000 total sequences read
        Autonormalization: 1-mers (4 total)
                A       22.98%  23.24%  0.989
                C       27.02%  26.76%  1.010
                G       27.02%  26.76%  1.010
                T       22.98%  23.24%  0.989
        Autonormalization: 2-mers (16 total)
                AA      6.71%   6.20%   1.082
                CA      6.45%   7.36%   0.876
                GA      5.97%   6.11%   0.977
                TA      3.85%   3.56%   1.081
                AC      4.73%   4.90%   0.965
                CC      8.95%   8.75%   1.023
                GC      7.37%   7.01%   1.051
                TC      5.97%   6.11%   0.977
                AG      7.19%   7.61%   0.944
                CG      4.43%   3.06%   1.451
                GG      8.95%   8.75%   1.023
                TG      6.45%   7.36%   0.876
                AT      4.36%   4.52%   0.964
                CT      7.19%   7.61%   0.944
                GT      4.73%   4.90%   0.965
                TT      6.71%   6.20%   1.082
        Autonormalization: 3-mers (64 total)
        Normalization weights can be found in file: Homer/peak_HCT116/seq.autonorm.tsv
        Converging on autonormalization solution:
        ...............................................................................
        Final normalization:    Autonormalization: 1-mers (4 total)
                A       22.98%  22.93%  1.002
                C       27.02%  27.07%  0.998
                G       27.02%  27.07%  0.998
                T       22.98%  22.93%  1.002
        Autonormalization: 2-mers (16 total)
                AA      6.71%   6.50%   1.032
                CA      6.45%   6.69%   0.965
                GA      5.97%   5.97%   1.000
                TA      3.85%   3.78%   1.020
                AC      4.73%   4.71%   1.003
                CC      8.95%   8.89%   1.007
                GC      7.37%   7.50%   0.983
                TC      5.97%   5.97%   1.000
                AG      7.19%   7.21%   0.997
                CG      4.43%   4.29%   1.035
                GG      8.95%   8.89%   1.007
                TG      6.45%   6.69%   0.965
                AT      4.36%   4.51%   0.966
                CT      7.19%   7.21%   0.997
                GT      4.73%   4.71%   1.003
                TT      6.71%   6.50%   1.032
        Autonormalization: 3-mers (64 total)
        Finished preparing sequence/group files

        ----------------------------------------------------------
        Known motif enrichment

        Reading input files...
        50000 total sequences read
        440 motifs loaded
        Cache length = 11180
        Using binomial scoring
        Checking enrichment of 440 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================

Finished!


        Preparing HTML output with sequence logos...
                1 of 440 (1e-105) Sp5(Zf)/mES-Sp5.Flag-ChIP-Seq(GSE72989)/Homer
                2 of 440 (1e-97) KLF3(Zf)/MEF-Klf3-ChIP-Seq(GSE44748)/Homer
                3 of 440 (1e-92) Sp1(Zf)/Promoter/Homer
                4 of 440 (1e-87) KLF1(Zf)/HUDEP2-KLF1-CutnRun(GSE136251)/Homer
                5 of 440 (1e-78) Sp2(Zf)/HEK293-Sp2.eGFP-ChIP-Seq(Encode)/Homer
                6 of 440 (1e-77) Klf9(Zf)/GBM-Klf9-ChIP-Seq(GSE62211)/Homer
                7 of 440 (1e-69) KLF6(Zf)/PDAC-KLF6-ChIP-Seq(GSE64557)/Homer
                8 of 440 (1e-68) KLF5(Zf)/LoVo-KLF5-ChIP-Seq(GSE49402)/Homer
                9 of 440 (1e-65) KLF14(Zf)/HEK293-KLF14.GFP-ChIP-Seq(GSE58341)/Homer
                10 of 440 (1e-62) Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer
                11 of 440 (1e-58) Klf4(Zf)/mES-Klf4-ChIP-Seq(GSE11431)/Homer
                12 of 440 (1e-51) Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer
                13 of 440 (1e-50) Fosl2(bZIP)/3T3L1-Fosl2-ChIP-Seq(GSE56872)/Homer
                14 of 440 (1e-43) TATA-Box(TBP)/Promoter/Homer
                15 of 440 (1e-42) Fra2(bZIP)/Striatum-Fra2-ChIP-Seq(GSE43429)/Homer
                16 of 440 (1e-41) Fra1(bZIP)/BT549-Fra1-ChIP-Seq(GSE46166)/Homer
                17 of 440 (1e-41) Fos(bZIP)/TSC-Fos-ChIP-Seq(GSE110950)/Homer
                18 of 440 (1e-41) JunB(bZIP)/DendriticCells-Junb-ChIP-Seq(GSE36099)/Homer
                19 of 440 (1e-37) Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer
                20 of 440 (1e-35) KLF10(Zf)/HEK293-KLF10.GFP-ChIP-Seq(GSE58341)/Homer
                21 of 440 (1e-30) BATF(bZIP)/Th17-BATF-ChIP-Seq(GSE39756)/Homer
                22 of 440 (1e-28) AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                23 of 440 (1e-25) Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer
                24 of 440 (1e-22) Pitx1(Homeobox)/Chicken-Pitx1-ChIP-Seq(GSE38910)/Homer
                25 of 440 (1e-20) Elk4(ETS)/Hela-Elk4-ChIP-Seq(GSE31477)/Homer
                26 of 440 (1e-19) E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer
                27 of 440 (1e-19) Hoxd11(Homeobox)/ChickenMSG-Hoxd11.Flag-ChIP-Seq(GSE86088)/Homer
                28 of 440 (1e-18) Nkx6.1(Homeobox)/Islet-Nkx6.1-ChIP-Seq(GSE40975)/Homer
                29 of 440 (1e-18) Hoxa13(Homeobox)/ChickenMSG-Hoxa13.Flag-ChIP-Seq(GSE86088)/Homer
                30 of 440 (1e-17) WT1(Zf)/Kidney-WT1-ChIP-Seq(GSE90016)/Homer
                31 of 440 (1e-16) Fli1(ETS)/CD8-FLI-ChIP-Seq(GSE20898)/Homer
                32 of 440 (1e-16) Hoxa11(Homeobox)/ChickenMSG-Hoxa11.Flag-ChIP-Seq(GSE86088)/Homer
                33 of 440 (1e-16) Isl1(Homeobox)/Neuron-Isl1-ChIP-Seq(GSE31456)/Homer
                34 of 440 (1e-15) Elk1(ETS)/Hela-Elk1-ChIP-Seq(GSE31477)/Homer
                35 of 440 (1e-15) Egr1(Zf)/K562-Egr1-ChIP-Seq(GSE32465)/Homer
                36 of 440 (1e-15) E2F6(E2F)/Hela-E2F6-ChIP-Seq(GSE31477)/Homer
                37 of 440 (1e-15) ETV4(ETS)/HepG2-ETV4-ChIP-Seq(ENCODE)/Homer
                38 of 440 (1e-15) EKLF(Zf)/Erythrocyte-Klf1-ChIP-Seq(GSE20478)/Homer
                39 of 440 (1e-14) Egr2(Zf)/Thymocytes-Egr2-ChIP-Seq(GSE34254)/Homer
                40 of 440 (1e-13) ZNF467(Zf)/HEK293-ZNF467.GFP-ChIP-Seq(GSE58341)/Homer
                41 of 440 (1e-12) E2F3(E2F)/MEF-E2F3-ChIP-Seq(GSE71376)/Homer
                42 of 440 (1e-11) ELF1(ETS)/Jurkat-ELF1-ChIP-Seq(SRA014231)/Homer
                43 of 440 (1e-11) Nrf2(bZIP)/Lymphoblast-Nrf2-ChIP-Seq(GSE37589)/Homer
                44 of 440 (1e-11) Unknown(Homeobox)/Limb-p300-ChIP-Seq/Homer
                45 of 440 (1e-10) Lhx1(Homeobox)/EmbryoCarcinoma-Lhx1-ChIP-Seq(GSE70957)/Homer
                46 of 440 (1e-10) E2F1(E2F)/Hela-E2F1-ChIP-Seq(GSE22478)/Homer
                47 of 440 (1e-10) En1(Homeobox)/SUM149-EN1-ChIP-Seq(GSE120957)/Homer
                48 of 440 (1e-9) Lhx3(Homeobox)/Neuron-Lhx3-ChIP-Seq(GSE31456)/Homer
                49 of 440 (1e-9) CRX(Homeobox)/Retina-Crx-ChIP-Seq(GSE20012)/Homer
                50 of 440 (1e-8) Zfp281(Zf)/ES-Zfp281-ChIP-Seq(GSE81042)/Homer
                51 of 440 (1e-8) GABPA(ETS)/Jurkat-GABPa-ChIP-Seq(GSE17954)/Homer
                52 of 440 (1e-8) NF-E2(bZIP)/K562-NFE2-ChIP-Seq(GSE31477)/Homer
                53 of 440 (1e-8) Hoxd13(Homeobox)/ChickenMSG-Hoxd13.Flag-ChIP-Seq(GSE86088)/Homer
                54 of 440 (1e-8) HIF-1b(HLH)/T47D-HIF1b-ChIP-Seq(GSE59937)/Homer
                55 of 440 (1e-8) ETV1(ETS)/GIST48-ETV1-ChIP-Seq(GSE22441)/Homer
                56 of 440 (1e-7) LHX9(Homeobox)/Hct116-LHX9.V5-ChIP-Seq(GSE116822)/Homer
                57 of 440 (1e-7) DLX2(Homeobox)/BasalGanglia-Dlx2-ChIP-seq(GSE124936)/Homer
                58 of 440 (1e-7) Bach1(bZIP)/K562-Bach1-ChIP-Seq(GSE31477)/Homer
                59 of 440 (1e-7) DLX1(Homeobox)/BasalGanglia-Dlx1-ChIP-seq(GSE124936)/Homer
                60 of 440 (1e-7) EHF(ETS)/LoVo-EHF-ChIP-Seq(GSE49402)/Homer
                61 of 440 (1e-7) BMYB(HTH)/Hela-BMYB-ChIP-Seq(GSE27030)/Homer
                62 of 440 (1e-7) NFE2L2(bZIP)/HepG2-NFE2L2-ChIP-Seq(Encode)/Homer
                63 of 440 (1e-7) Stat3(Stat)/mES-Stat3-ChIP-Seq(GSE11431)/Homer
                64 of 440 (1e-6) Hoxa9(Homeobox)/ChickenMSG-Hoxa9.Flag-ChIP-Seq(GSE86088)/Homer
                65 of 440 (1e-6) MYB(HTH)/ERMYB-Myb-ChIPSeq(GSE22095)/Homer
                66 of 440 (1e-6) Znf263(Zf)/K562-Znf263-ChIP-Seq(GSE31477)/Homer
                67 of 440 (1e-6) HNF1b(Homeobox)/PDAC-HNF1B-ChIP-Seq(GSE64557)/Homer
                68 of 440 (1e-6) CHR(?)/Hela-CellCycle-Expression/Homer
                69 of 440 (1e-6) CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer
                70 of 440 (1e-6) ETS(ETS)/Promoter/Homer
                71 of 440 (1e-6) ETS1(ETS)/Jurkat-ETS1-ChIP-Seq(GSE17954)/Homer
                72 of 440 (1e-6) ZNF652/HepG2-ZNF652.Flag-ChIP-Seq(Encode)/Homer
                73 of 440 (1e-5) Nanog(Homeobox)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                74 of 440 (1e-5) Lhx2(Homeobox)/HFSC-Lhx2-ChIP-Seq(GSE48068)/Homer
                75 of 440 (1e-5) Elf4(ETS)/BMDM-Elf4-ChIP-Seq(GSE88699)/Homer
                76 of 440 (1e-5) NFY(CCAAT)/Promoter/Homer
                77 of 440 (1e-5) ELF5(ETS)/T47D-ELF5-ChIP-Seq(GSE30407)/Homer
                78 of 440 (1e-5) AP-2gamma(AP2)/MCF7-TFAP2C-ChIP-Seq(GSE21234)/Homer
                79 of 440 (1e-5) ERG(ETS)/VCaP-ERG-ChIP-Seq(GSE14097)/Homer
                80 of 440 (1e-5) EWS:ERG-fusion(ETS)/CADO_ES1-EWS:ERG-ChIP-Seq(SRA014231)/Homer
                81 of 440 (1e-5) AMYB(HTH)/Testes-AMYB-ChIP-Seq(GSE44588)/Homer
                82 of 440 (1e-5) ELF3(ETS)/PDAC-ELF3-ChIP-Seq(GSE64557)/Homer
                83 of 440 (1e-5) HOXB13(Homeobox)/ProstateTumor-HOXB13-ChIP-Seq(GSE56288)/Homer
                84 of 440 (1e-5) STAT4(Stat)/CD4-Stat4-ChIP-Seq(GSE22104)/Homer
                85 of 440 (1e-5) MafK(bZIP)/C2C12-MafK-ChIP-Seq(GSE36030)/Homer
                86 of 440 (1e-4) SPDEF(ETS)/VCaP-SPDEF-ChIP-Seq(SRA014231)/Homer
                87 of 440 (1e-4) Ets1-distal(ETS)/CD4+-PolII-ChIP-Seq(Barski_et_al.)/Homer
                88 of 440 (1e-4) DLX5(Homeobox)/BasalGanglia-Dlx5-ChIP-seq(GSE124936)/Homer
                89 of 440 (1e-4) Dlx3(Homeobox)/Kerainocytes-Dlx3-ChIP-Seq(GSE89884)/Homer
                90 of 440 (1e-4) PRDM15(Zf)/ESC-Prdm15-ChIP-Seq(GSE73694)/Homer
                91 of 440 (1e-4) Stat3+il21(Stat)/CD4-Stat3-ChIP-Seq(GSE19198)/Homer
                92 of 440 (1e-4) Etv2(ETS)/ES-ER71-ChIP-Seq(GSE59402)/Homer
                93 of 440 (1e-4) HIF2a(bHLH)/785_O-HIF2a-ChIP-Seq(GSE34871)/Homer
                94 of 440 (1e-4) TEAD1(TEAD)/HepG2-TEAD1-ChIP-Seq(Encode)/Homer
                95 of 440 (1e-3) ZNF189(Zf)/HEK293-ZNF189.GFP-ChIP-Seq(GSE58341)/Homer
                96 of 440 (1e-3) STAT5(Stat)/mCD4+-Stat5-ChIP-Seq(GSE12346)/Homer
                97 of 440 (1e-3) AP-2alpha(AP2)/Hela-AP2alpha-ChIP-Seq(GSE31477)/Homer
                98 of 440 (1e-3) ZFX(Zf)/mES-Zfx-ChIP-Seq(GSE11431)/Homer
                99 of 440 (1e-3) NFkB-p65-Rel(RHD)/ThioMac-LPS-Expression(GSE23622)/Homer
                100 of 440 (1e-3) BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer
                101 of 440 (1e-3) E2F7(E2F)/Hela-E2F7-ChIP-Seq(GSE32673)/Homer
                102 of 440 (1e-3) EWS:FLI1-fusion(ETS)/SK_N_MC-EWS:FLI1-ChIP-Seq(SRA014231)/Homer
                103 of 440 (1e-3) c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer
                104 of 440 (1e-3) Otx2(Homeobox)/EpiLC-Otx2-ChIP-Seq(GSE56098)/Homer
                105 of 440 (1e-3) TEAD4(TEA)/Tropoblast-Tead4-ChIP-Seq(GSE37350)/Homer
                106 of 440 (1e-2) Npas4(bHLH)/Neuron-Npas4-ChIP-Seq(GSE127793)/Homer
                107 of 440 (1e-2) CArG(MADS)/PUER-Srf-ChIP-Seq(Sullivan_et_al.)/Homer
                108 of 440 (1e-2) ZNF711(Zf)/SHSY5Y-ZNF711-ChIP-Seq(GSE20673)/Homer
                109 of 440 (1e-2) Prop1(Homeobox)/GHFT1-PROP1.biotin-ChIP-Seq(GSE77302)/Homer
                110 of 440 (1e-2) GSC(Homeobox)/FrogEmbryos-GSC-ChIP-Seq(DRA000576)/Homer
                111 of 440 (1e-2) Hnf1(Homeobox)/Liver-Foxa2-Chip-Seq(GSE25694)/Homer
                112 of 440 (1e-2) TRPS1(Zf)/MCF7-TRPS1-ChIP-Seq(GSE107013)/Homer
                113 of 440 (1e-2) Tbr1(T-box)/Cortex-Tbr1-ChIP-Seq(GSE71384)/Homer
                114 of 440 (1e-2) NFAT(RHD)/Jurkat-NFATC1-ChIP-Seq(Jolma_et_al.)/Homer
                115 of 440 (1e-2) CLOCK(bHLH)/Liver-Clock-ChIP-Seq(GSE39860)/Homer
                116 of 440 (1e-2) STAT1(Stat)/HelaS3-STAT1-ChIP-Seq(GSE12782)/Homer
                117 of 440 (1e-2) Barx1(Homeobox)/Stomach-Barx1.3xFlag-ChIP-Seq(GSE69483)/Homer
                118 of 440 (1e-2) NFkB-p65(RHD)/GM12787-p65-ChIP-Seq(GSE19485)/Homer
                119 of 440 (1e-2) CEBP(bZIP)/ThioMac-CEBPb-ChIP-Seq(GSE21512)/Homer
                120 of 440 (1e-2) Arnt:Ahr(bHLH)/MCF7-Arnt-ChIP-Seq(Lo_et_al.)/Homer
                121 of 440 (1e-2) HINFP(Zf)/K562-HINFP.eGFP-ChIP-Seq(Encode)/Homer
                122 of 440 (1e-2) CRE(bZIP)/Promoter/Homer
                123 of 440 (1e-2) E-box(bHLH)/Promoter/Homer
                124 of 440 (1e-2) Hoxd12(Homeobox)/ChickenMSG-Hoxd12.Flag-ChIP-Seq(GSE86088)/Homer
                125 of 440 (1e-2) TEAD(TEA)/Fibroblast-PU.1-ChIP-Seq(Unpublished)/Homer
                126 of 440 (1e-2) ZSCAN22(Zf)/HEK293-ZSCAN22.GFP-ChIP-Seq(GSE58341)/Homer
                127 of 440 (1e-2) IRF:BATF(IRF:bZIP)/pDC-Irf8-ChIP-Seq(GSE66899)/Homer
                128 of 440 (1e-2) CDX4(Homeobox)/ZebrafishEmbryos-Cdx4.Myc-ChIP-Seq(GSE48254)/Homer
        Skipping...
        Job finished - if results look good, please send beer to ..

        Cleaning up tmp files...

In the motif analysis result, SP1 motif and many other family members rank top.

drawing

Do the exact same thing for K562 SP1 data.

[12]:

# read experiment data
K562_SP1 = cc.datasets.SP1_K562HCT116_data(data="K562_SP1_qbed")
K562_SP1

[12]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	16529	16533	163	-	GCTCCTAAGTACGTTC-1
1	chr1	29884	29888	10	+	CTCACACCAGACGCTC-1
2	chr1	29884	29888	155	+	TGGCCAGCACCCATTC-1
3	chr1	29884	29888	285	+	GTGGGTCCACGGCCAT-1
4	chr1	29884	29888	7	+	CGTCTACTCAACACGT-1
...	...	...	...	...	...	...
327460	chrY	57061562	57061566	6	+	CTCATTATCATCATTC-1
327461	chrY	57061562	57061566	67	+	TGCGTGGCATTAGGCT-1
327462	chrY	57145084	57145088	2	-	ACATACGTCGCGCCAA-1
327463	chrY	57148630	57148634	2	-	TATGCCCGTACAGTTC-1
327464	chrY	57183913	57183917	228	-	AAACCTGGTCCTGCTT-1

327465 rows × 6 columns

[13]:

# read background data
K562_brd4 =cc.datasets.SP1_K562HCT116_data(data="K562_brd4_qbed")
K562_brd4

[13]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	30238	30242	3	+	TTTACTGCATAAAGGT-1
1	chr1	30355	30359	2	-	ATCACGAAGAGTAATC-1
2	chr1	30355	30359	70	+	TTGAACGCAAATCCGT-1
3	chr1	31101	31105	2	+	CCTCAGTCATCAGTAC-1
4	chr1	32116	32120	5	+	CTAGTGAAGACAAAGG-1
...	...	...	...	...	...	...
107380	chrY	57080210	57080214	9	-	AAGGAGCCAGTATAAG-1
107381	chrY	57087785	57087789	24	-	CGAGCCAGTCTCTCTG-1
107382	chrY	57144853	57144857	5	+	GAAGCAGTCCCATTTA-1
107383	chrY	57183772	57183776	2	-	TCTTTCCTCTTGCCGT-1
107384	chrY	57204853	57204857	369	-	ATAACGCAGTTTGCGT-1

107385 rows × 6 columns

[14]:

peak_data_K562 = cc.pp.call_peaks(K562_SP1, K562_brd4, method = "MACCs", reference = "hg38", window_size = 2000, step_size = 500,
                  pvalue_cutoffTTAA = 0.0001,  pvalue_cutoffbg = 0.1, lam_win_size = None,  pseudocounts = 0.1, record = True, save = "peak_k562.bed")
peak_data_K562

For the MACCs method with background, [expdata, background, reference, pvalue_cutoffbg, pvalue_cutoffTTAA, lam_win_size, window_size, step_size, extend, pseudocounts, test_method, min_insertions, record] would be utilized.

100%|██████████| 24/24 [01:34<00:00,  3.96s/it]

[14]:

	Chr	Start	End	Center	Experiment Insertions	Background insertions	Reference Insertions	pvalue Reference	pvalue Background	Fraction Experiment	TPH Experiment	Fraction background	TPH background	TPH background subtracted	pvalue_adj Reference
0	chr1	29684	30087	29884.0	6	0	1	8.878753e-11	1.546531e-04	0.000018	1832.256882	0.000000	0.000000	1832.256882	3.323285e-08
1	chr1	36239	38107	37578.0	24	2	15	0.000000e+00	1.486029e-03	0.000073	7329.027530	0.000019	1862.457513	5466.570017	0.000000e+00
2	chr1	198893	201208	200869.0	28	2	11	0.000000e+00	6.927041e-05	0.000086	8550.532118	0.000019	1862.457513	6688.074605	0.000000e+00
3	chr1	203351	207161	205004.0	92	13	22	0.000000e+00	4.337485e-05	0.000281	28094.605530	0.000121	12105.973832	15988.631698	0.000000e+00
4	chr1	265549	266336	265749.0	5	0	3	3.731359e-08	4.678840e-03	0.000015	1526.880735	0.000000	0.000000	1526.880735	1.056034e-05
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9404	chrY	15158250	15158653	15158450.0	11	0	1	0.000000e+00	1.546531e-04	0.000034	3359.137618	0.000000	0.000000	3359.137618	0.000000e+00
9405	chrY	16985442	16985845	16985642.0	5	0	2	1.806731e-09	4.678840e-03	0.000015	1526.880735	0.000000	0.000000	1526.880735	6.060202e-07
9406	chrY	19753311	19753714	19753511.0	33	0	1	0.000000e+00	2.269296e-13	0.000101	10077.412853	0.000000	0.000000	10077.412853	0.000000e+00
9407	chrY	21011133	21011828	21011333.0	5	0	4	2.510448e-09	4.678840e-03	0.000015	1526.880735	0.000000	0.000000	1526.880735	8.296735e-07
9408	chrY	56952574	56957328	56953707.0	40	1	37	0.000000e+00	2.427052e-06	0.000122	12215.045883	0.000009	931.228756	11283.817126	0.000000e+00

9409 rows × 15 columns

[15]:

cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data_K562,K562_SP1, "hg38", K562_brd4 , font_size=2,
                figsize = (30,15),peak_line = 4,save = False,bins =400, plotsize = [1,1,5], example_length = 1000)

../../_images/tutorials_notebooks_K562HCT116_SP1_28_0.png

[16]:

qbed = {"SP1":K562_SP1, "Brd4": K562_brd4}
bed = {"peak":peak_data_K562}
cc.pl.WashU_browser_url(qbed = qbed,bed = bed,genome = 'hg38')

All qbed addressed
All bed addressed
Uploading files
Please click the following link to see the data on WashU Epigenome Browser directly.
https://epigenomegateway.wustl.edu/browser/?genome=hg38&hub=https://companion.epigenomegateway.org//task/db6498ca05ea546a0938a34cd8de001c/output//datahub.json

[17]:

cc.pl.whole_peaks(peak_data_K562, reference = "hg38",figsize=(100, 70),height_scale = 1.7)

../../_images/tutorials_notebooks_K562HCT116_SP1_30_0.png

We can see that SP1 binds much more frequently in K562 than HCT116.

We can then check with reference Chip-seq data of SP1 in K562 from ENCSR372IML (and use the bigWig file ENCFF588UII generated by it）.

Download the data if needed:

!wget https://www.encodeproject.org/files/ENCFF588UII/@@download/ENCFF588UII.bigWig

[18]:

mtx_K562 = cc.pl.calculate_signal(peak_data_K562, chipseq_signal = "ENCFF588UII.bigWig")

100%|██████████| 9409/9409 [00:41<00:00, 225.34it/s]

[19]:

cc.pl.signal_plot(mtx_K562, alpha = 0.05, figsize=(6, 4))

../../_images/tutorials_notebooks_K562HCT116_SP1_34_0.png

[20]:

cc.pl.signal_heatmap(mtx_K562,pad =  0.023, belowlength = 100)

../../_images/tutorials_notebooks_K562HCT116_SP1_35_0.png

We can see that calling cards peaks are consistent with Chip-seq data. Peak centers tend to have a higher signal and the signal goes lower as the distance increases.

Call motif to check the peak results.

[21]:

cc.tl.call_motif("peak_k562.bed",reference ="hg38",save_homer = "Homer/peak_k562",
                 homer_path = "/home/juanru/miniconda3/bin/", num_cores=12)

Peak data peak_k562.bed is used here.


        Position file = peak_k562.bed
        Genome = hg38
        Output Directory = Homer/peak_k562
        Fragment size set to 1000
        Using 12 CPUs
        Will not run homer for de novo motifs
        Found mset for "human", will check against vertebrates motifs
        Peak/BED file conversion summary:
                BED/Header formatted lines: 9409
                peakfile formatted lines: 0

        Peak File Statistics:
                Total Peaks: 9409
                Redundant Peak IDs: 2
                Peaks lacking information: 0 (need at least 5 columns per peak)
                Peaks with misformatted coordinates: 0 (should be integer)
                Peaks with misformatted strand: 0 (should be either +/- or 0/1)

        Redunant Peaks found: Remove or rename these or some programs may have trouble...

        2 duplicate peak IDs out of 9409 total peaks
        Background files for 1000 bp fragments found.
        Custom genome sequence directory: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38//

        Extracting sequences from file: /home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa
        Looking for peak sequences in a single file (/home/juanru/miniconda3/share/homer/.//data/genomes/hg38///genome.fa)
        Extracting 1062 sequences from chr1
        Extracting 426 sequences from chr10
        Extracting 475 sequences from chr11
        Extracting 443 sequences from chr12
        Extracting 182 sequences from chr13
        Extracting 227 sequences from chr14
        Extracting 302 sequences from chr15
        Extracting 342 sequences from chr16
        Extracting 409 sequences from chr17
        Extracting 169 sequences from chr18
        Extracting 505 sequences from chr19
        Extracting 660 sequences from chr2
        Extracting 249 sequences from chr20
        Extracting 154 sequences from chr21
        Extracting 186 sequences from chr22
        Extracting 508 sequences from chr3
        Extracting 428 sequences from chr4
        Extracting 408 sequences from chr5
        Extracting 620 sequences from chr6
        Extracting 572 sequences from chr7
        Extracting 375 sequences from chr8
        Extracting 408 sequences from chr9
        Extracting 281 sequences from chrX
        Extracting 18 sequences from chrY

        Not removing redundant sequences


        Sequences processed:
                Auto detected maximum sequence length of 1001 bp
                9409 total

        Frequency Bins: 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.6 0.7 0.8
        Freq    Bin     Count
        0.3     2       52
        0.35    3       412
        0.4     4       1127
        0.45    5       1812
        0.5     6       1961
        0.6     7       2657
        0.7     8       1164
        0.8     9       219
        10      10      5

        Total sequences set to 50000

        Choosing background that matches in CpG/GC content...
        Bin     # Targets       # Background    Background Weight
        2       52      224     1.001
        3       412     1777    1.000
        4       1127    4862    1.000
        5       1812    7817    1.000
        6       1961    8460    1.000
        7       2657    11462   1.000
        8       1164    5022    1.000
        9       219     945     1.000
        10      5       22      0.980
        Assembling sequence file...
        Normalizing lower order oligos using homer2

        Reading input files...
        50000 total sequences read
        Autonormalization: 1-mers (4 total)
                A       25.36%  25.49%  0.995
                C       24.64%  24.51%  1.005
                G       24.64%  24.51%  1.005
                T       25.36%  25.49%  0.995
        Autonormalization: 2-mers (16 total)
                AA      7.80%   7.45%   1.047
                CA      6.85%   7.40%   0.925
                GA      6.01%   6.09%   0.986
                TA      4.70%   4.54%   1.036
                AC      4.92%   4.99%   0.986
                CC      7.54%   7.44%   1.013
                GC      6.17%   5.99%   1.029
                TC      6.01%   6.09%   0.986
                AG      7.26%   7.47%   0.972
                CG      2.98%   2.20%   1.356
                GG      7.54%   7.44%   1.013
                TG      6.85%   7.40%   0.925
                AT      5.37%   5.57%   0.966
                CT      7.26%   7.47%   0.972
                GT      4.92%   4.99%   0.986
                TT      7.80%   7.45%   1.047
        Autonormalization: 3-mers (64 total)
        Normalization weights can be found in file: Homer/peak_k562/seq.autonorm.tsv
        Converging on autonormalization solution:
        ...............................................................................
        Final normalization:    Autonormalization: 1-mers (4 total)
                A       25.36%  25.34%  1.001
                C       24.64%  24.66%  0.999
                G       24.64%  24.66%  0.999
                T       25.36%  25.34%  1.001
        Autonormalization: 2-mers (16 total)
                AA      7.80%   7.68%   1.016
                CA      6.85%   6.98%   0.982
                GA      6.01%   6.00%   1.000
                TA      4.70%   4.67%   1.006
                AC      4.92%   4.88%   1.008
                CC      7.54%   7.52%   1.002
                GC      6.17%   6.25%   0.987
                TC      6.01%   6.00%   1.000
                AG      7.26%   7.24%   1.003
                CG      2.98%   2.92%   1.023
                GG      7.54%   7.52%   1.002
                TG      6.85%   6.98%   0.982
                AT      5.37%   5.53%   0.971
                CT      7.26%   7.24%   1.003
                GT      4.92%   4.88%   1.008
                TT      7.80%   7.68%   1.016
        Autonormalization: 3-mers (64 total)
        Finished preparing sequence/group files

        ----------------------------------------------------------
        Known motif enrichment

        Reading input files...
        50000 total sequences read
        440 motifs loaded
        Cache length = 11180
        Using binomial scoring
        Checking enrichment of 440 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================

Finished!


        Preparing HTML output with sequence logos...
                1 of 440 (1e-169) Sp2(Zf)/HEK293-Sp2.eGFP-ChIP-Seq(Encode)/Homer
                2 of 440 (1e-148) Sp5(Zf)/mES-Sp5.Flag-ChIP-Seq(GSE72989)/Homer
                3 of 440 (1e-137) KLF1(Zf)/HUDEP2-KLF1-CutnRun(GSE136251)/Homer
                4 of 440 (1e-125) KLF6(Zf)/PDAC-KLF6-ChIP-Seq(GSE64557)/Homer
                5 of 440 (1e-124) KLF5(Zf)/LoVo-KLF5-ChIP-Seq(GSE49402)/Homer
                6 of 440 (1e-123) Sp1(Zf)/Promoter/Homer
                7 of 440 (1e-114) KLF3(Zf)/MEF-Klf3-ChIP-Seq(GSE44748)/Homer
                8 of 440 (1e-114) Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer
                9 of 440 (1e-100) KLF14(Zf)/HEK293-KLF14.GFP-ChIP-Seq(GSE58341)/Homer
                10 of 440 (1e-96) Klf9(Zf)/GBM-Klf9-ChIP-Seq(GSE62211)/Homer
                11 of 440 (1e-74) Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer
                12 of 440 (1e-67) Gata2(Zf)/K562-GATA2-ChIP-Seq(GSE18829)/Homer
                13 of 440 (1e-65) Fosl2(bZIP)/3T3L1-Fosl2-ChIP-Seq(GSE56872)/Homer
                14 of 440 (1e-61) TATA-Box(TBP)/Promoter/Homer
                15 of 440 (1e-60) Gata4(Zf)/Heart-Gata4-ChIP-Seq(GSE35151)/Homer
                16 of 440 (1e-59) Gata6(Zf)/HUG1N-GATA6-ChIP-Seq(GSE51936)/Homer
                17 of 440 (1e-57) Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer
                18 of 440 (1e-52) BMYB(HTH)/Hela-BMYB-ChIP-Seq(GSE27030)/Homer
                19 of 440 (1e-52) GATA3(Zf)/iTreg-Gata3-ChIP-Seq(GSE20898)/Homer
                20 of 440 (1e-48) Fra2(bZIP)/Striatum-Fra2-ChIP-Seq(GSE43429)/Homer
                21 of 440 (1e-47) Klf4(Zf)/mES-Klf4-ChIP-Seq(GSE11431)/Homer
                22 of 440 (1e-46) Gata1(Zf)/K562-GATA1-ChIP-Seq(GSE18829)/Homer
                23 of 440 (1e-45) Fos(bZIP)/TSC-Fos-ChIP-Seq(GSE110950)/Homer
                24 of 440 (1e-45) JunB(bZIP)/DendriticCells-Junb-ChIP-Seq(GSE36099)/Homer
                25 of 440 (1e-43) Fra1(bZIP)/BT549-Fra1-ChIP-Seq(GSE46166)/Homer
                26 of 440 (1e-41) AMYB(HTH)/Testes-AMYB-ChIP-Seq(GSE44588)/Homer
                27 of 440 (1e-39) TRPS1(Zf)/MCF7-TRPS1-ChIP-Seq(GSE107013)/Homer
                28 of 440 (1e-36) MYB(HTH)/ERMYB-Myb-ChIPSeq(GSE22095)/Homer
                29 of 440 (1e-36) KLF10(Zf)/HEK293-KLF10.GFP-ChIP-Seq(GSE58341)/Homer
                30 of 440 (1e-36) Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer
                31 of 440 (1e-35) Bach1(bZIP)/K562-Bach1-ChIP-Seq(GSE31477)/Homer
                32 of 440 (1e-34) BATF(bZIP)/Th17-BATF-ChIP-Seq(GSE39756)/Homer
                33 of 440 (1e-33) NF-E2(bZIP)/K562-NFE2-ChIP-Seq(GSE31477)/Homer
                34 of 440 (1e-33) Pitx1(Homeobox)/Chicken-Pitx1-ChIP-Seq(GSE38910)/Homer
                35 of 440 (1e-32) WT1(Zf)/Kidney-WT1-ChIP-Seq(GSE90016)/Homer
                36 of 440 (1e-32) Elk4(ETS)/Hela-Elk4-ChIP-Seq(GSE31477)/Homer
                37 of 440 (1e-28) Isl1(Homeobox)/Neuron-Isl1-ChIP-Seq(GSE31456)/Homer
                38 of 440 (1e-27) AP-1(bZIP)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                39 of 440 (1e-27) Nrf2(bZIP)/Lymphoblast-Nrf2-ChIP-Seq(GSE37589)/Homer
                40 of 440 (1e-26) Elk1(ETS)/Hela-Elk1-ChIP-Seq(GSE31477)/Homer
                41 of 440 (1e-24) ZNF467(Zf)/HEK293-ZNF467.GFP-ChIP-Seq(GSE58341)/Homer
                42 of 440 (1e-23) NFE2L2(bZIP)/HepG2-NFE2L2-ChIP-Seq(Encode)/Homer
                43 of 440 (1e-23) E2F6(E2F)/Hela-E2F6-ChIP-Seq(GSE31477)/Homer
                44 of 440 (1e-23) CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer
                45 of 440 (1e-23) ETV4(ETS)/HepG2-ETV4-ChIP-Seq(ENCODE)/Homer
                46 of 440 (1e-23) Fli1(ETS)/CD8-FLI-ChIP-Seq(GSE20898)/Homer
                47 of 440 (1e-23) Egr1(Zf)/K562-Egr1-ChIP-Seq(GSE32465)/Homer
                48 of 440 (1e-22) GATA:SCL(Zf,bHLH)/Ter119-SCL-ChIP-Seq(GSE18720)/Homer
                49 of 440 (1e-21) Egr2(Zf)/Thymocytes-Egr2-ChIP-Seq(GSE34254)/Homer
                50 of 440 (1e-21) Zfp281(Zf)/ES-Zfp281-ChIP-Seq(GSE81042)/Homer
                51 of 440 (1e-20) ETV1(ETS)/GIST48-ETV1-ChIP-Seq(GSE22441)/Homer
                52 of 440 (1e-19) Elf4(ETS)/BMDM-Elf4-ChIP-Seq(GSE88699)/Homer
                53 of 440 (1e-19) Nkx6.1(Homeobox)/Islet-Nkx6.1-ChIP-Seq(GSE40975)/Homer
                54 of 440 (1e-18) ZNF652/HepG2-ZNF652.Flag-ChIP-Seq(Encode)/Homer
                55 of 440 (1e-18) Hoxa13(Homeobox)/ChickenMSG-Hoxa13.Flag-ChIP-Seq(GSE86088)/Homer
                56 of 440 (1e-16) Znf263(Zf)/K562-Znf263-ChIP-Seq(GSE31477)/Homer
                57 of 440 (1e-15) E2F4(E2F)/K562-E2F4-ChIP-Seq(GSE31477)/Homer
                58 of 440 (1e-15) ELF1(ETS)/Jurkat-ELF1-ChIP-Seq(SRA014231)/Homer
                59 of 440 (1e-15) EKLF(Zf)/Erythrocyte-Klf1-ChIP-Seq(GSE20478)/Homer
                60 of 440 (1e-14) GABPA(ETS)/Jurkat-GABPa-ChIP-Seq(GSE17954)/Homer
                61 of 440 (1e-13) CRX(Homeobox)/Retina-Crx-ChIP-Seq(GSE20012)/Homer
                62 of 440 (1e-13) Lhx3(Homeobox)/Neuron-Lhx3-ChIP-Seq(GSE31456)/Homer
                63 of 440 (1e-12) Hoxa11(Homeobox)/ChickenMSG-Hoxa11.Flag-ChIP-Seq(GSE86088)/Homer
                64 of 440 (1e-11) STAT5(Stat)/mCD4+-Stat5-ChIP-Seq(GSE12346)/Homer
                65 of 440 (1e-11) ETS1(ETS)/Jurkat-ETS1-ChIP-Seq(GSE17954)/Homer
                66 of 440 (1e-11) ZNF317(Zf)/HEK293-ZNF317.GFP-ChIP-Seq(GSE58341)/Homer
                67 of 440 (1e-11) En1(Homeobox)/SUM149-EN1-ChIP-Seq(GSE120957)/Homer
                68 of 440 (1e-10) BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer
                69 of 440 (1e-10) MITF(bHLH)/MastCells-MITF-ChIP-Seq(GSE48085)/Homer
                70 of 440 (1e-10) RORg(NR)/Liver-Rorc-ChIP-Seq(GSE101115)/Homer
                71 of 440 (1e-10) Hoxd13(Homeobox)/ChickenMSG-Hoxd13.Flag-ChIP-Seq(GSE86088)/Homer
                72 of 440 (1e-10) EHF(ETS)/LoVo-EHF-ChIP-Seq(GSE49402)/Homer
                73 of 440 (1e-10) EWS:FLI1-fusion(ETS)/SK_N_MC-EWS:FLI1-ChIP-Seq(SRA014231)/Homer
                74 of 440 (1e-9) RXR(NR),DR1/3T3L1-RXR-ChIP-Seq(GSE13511)/Homer
                75 of 440 (1e-9) ELF5(ETS)/T47D-ELF5-ChIP-Seq(GSE30407)/Homer
                76 of 440 (1e-9) PU.1(ETS)/ThioMac-PU.1-ChIP-Seq(GSE21512)/Homer
                77 of 440 (1e-9) FOXK2(Forkhead)/U2OS-FOXK2-ChIP-Seq(E-MTAB-2204)/Homer
                78 of 440 (1e-9) Hoxd11(Homeobox)/ChickenMSG-Hoxd11.Flag-ChIP-Seq(GSE86088)/Homer
                79 of 440 (1e-9) PPARa(NR),DR1/Liver-Ppara-ChIP-Seq(GSE47954)/Homer
                80 of 440 (1e-9) PU.1-IRF(ETS:IRF)/Bcell-PU.1-ChIP-Seq(GSE21512)/Homer
                81 of 440 (1e-9) Stat3(Stat)/mES-Stat3-ChIP-Seq(GSE11431)/Homer
                82 of 440 (1e-9) Unknown(Homeobox)/Limb-p300-ChIP-Seq/Homer
                83 of 440 (1e-9) Hoxa9(Homeobox)/ChickenMSG-Hoxa9.Flag-ChIP-Seq(GSE86088)/Homer
                84 of 440 (1e-9) ETS(ETS)/Promoter/Homer
                85 of 440 (1e-8) HIF-1b(HLH)/T47D-HIF1b-ChIP-Seq(GSE59937)/Homer
                86 of 440 (1e-8) Hnf1(Homeobox)/Liver-Foxa2-Chip-Seq(GSE25694)/Homer
                87 of 440 (1e-8) STAT1(Stat)/HelaS3-STAT1-ChIP-Seq(GSE12782)/Homer
                88 of 440 (1e-8) PPARE(NR),DR1/3T3L1-Pparg-ChIP-Seq(GSE13511)/Homer
                89 of 440 (1e-8) Smad4(MAD)/ESC-SMAD4-ChIP-Seq(GSE29422)/Homer
                90 of 440 (1e-8) PR(NR)/T47D-PR-ChIP-Seq(GSE31130)/Homer
                91 of 440 (1e-8) ERG(ETS)/VCaP-ERG-ChIP-Seq(GSE14097)/Homer
                92 of 440 (1e-8) MafK(bZIP)/C2C12-MafK-ChIP-Seq(GSE36030)/Homer
                93 of 440 (1e-8) MafB(bZIP)/BMM-Mafb-ChIP-Seq(GSE75722)/Homer
                94 of 440 (1e-8) FOXP1(Forkhead)/H9-FOXP1-ChIP-Seq(GSE31006)/Homer
                95 of 440 (1e-7) E2F3(E2F)/MEF-E2F3-ChIP-Seq(GSE71376)/Homer
                96 of 440 (1e-7) E2F7(E2F)/Hela-E2F7-ChIP-Seq(GSE32673)/Homer
                97 of 440 (1e-7) Lhx1(Homeobox)/EmbryoCarcinoma-Lhx1-ChIP-Seq(GSE70957)/Homer
                98 of 440 (1e-7) Smad2(MAD)/ES-SMAD2-ChIP-Seq(GSE29422)/Homer
                99 of 440 (1e-7) Nanog(Homeobox)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                100 of 440 (1e-7) Max(bHLH)/K562-Max-ChIP-Seq(GSE31477)/Homer
                101 of 440 (1e-7) DLX2(Homeobox)/BasalGanglia-Dlx2-ChIP-seq(GSE124936)/Homer
                102 of 440 (1e-7) MafA(bZIP)/Islet-MafA-ChIP-Seq(GSE30298)/Homer
                103 of 440 (1e-7) ZFX(Zf)/mES-Zfx-ChIP-Seq(GSE11431)/Homer
                104 of 440 (1e-7) DLX1(Homeobox)/BasalGanglia-Dlx1-ChIP-seq(GSE124936)/Homer
                105 of 440 (1e-7) E2F1(E2F)/Hela-E2F1-ChIP-Seq(GSE22478)/Homer
                106 of 440 (1e-7) CHR(?)/Hela-CellCycle-Expression/Homer
                107 of 440 (1e-6) Ets1-distal(ETS)/CD4+-PolII-ChIP-Seq(Barski_et_al.)/Homer
                108 of 440 (1e-6) MafF(bZIP)/HepG2-MafF-ChIP-Seq(GSE31477)/Homer
                109 of 440 (1e-6) USF1(bHLH)/GM12878-Usf1-ChIP-Seq(GSE32465)/Homer
                110 of 440 (1e-6) LRF(Zf)/Erythroblasts-ZBTB7A-ChIP-Seq(GSE74977)/Homer
                111 of 440 (1e-6) Foxo3(Forkhead)/U2OS-Foxo3-ChIP-Seq(E-MTAB-2701)/Homer
                112 of 440 (1e-6) SpiB(ETS)/OCILY3-SPIB-ChIP-Seq(GSE56857)/Homer
                113 of 440 (1e-6) p53(p53)/mES-cMyc-ChIP-Seq(GSE11431)/Homer
                114 of 440 (1e-6) Tbox:Smad(T-box,MAD)/ESCd5-Smad2_3-ChIP-Seq(GSE29422)/Homer
                115 of 440 (1e-6) Otx2(Homeobox)/EpiLC-Otx2-ChIP-Seq(GSE56098)/Homer
                116 of 440 (1e-6) MYNN(Zf)/HEK293-MYNN.eGFP-ChIP-Seq(Encode)/Homer
                117 of 440 (1e-6) YY1(Zf)/Promoter/Homer
                118 of 440 (1e-6) SPDEF(ETS)/VCaP-SPDEF-ChIP-Seq(SRA014231)/Homer
                119 of 440 (1e-5) NeuroD1(bHLH)/Islet-NeuroD1-ChIP-Seq(GSE30298)/Homer
                120 of 440 (1e-5) SCL(bHLH)/HPC7-Scl-ChIP-Seq(GSE13511)/Homer
                121 of 440 (1e-5) IRF3(IRF)/BMDM-Irf3-ChIP-Seq(GSE67343)/Homer
                122 of 440 (1e-5) GFY-Staf(?,Zf)/Promoter/Homer
                123 of 440 (1e-5) Foxf1(Forkhead)/Lung-Foxf1-ChIP-Seq(GSE77951)/Homer
                124 of 440 (1e-5) Foxo1(Forkhead)/RAW-Foxo1-ChIP-Seq(Fan_et_al.)/Homer
                125 of 440 (1e-5) CLOCK(bHLH)/Liver-Clock-ChIP-Seq(GSE39860)/Homer
                126 of 440 (1e-5) ETS:E-box(ETS,bHLH)/HPC7-Scl-ChIP-Seq(GSE22178)/Homer
                127 of 440 (1e-5) Atoh1(bHLH)/Cerebellum-Atoh1-ChIP-Seq(GSE22111)/Homer
                128 of 440 (1e-5) Tbr1(T-box)/Cortex-Tbr1-ChIP-Seq(GSE71384)/Homer
                129 of 440 (1e-4) STAT4(Stat)/CD4-Stat4-ChIP-Seq(GSE22104)/Homer
                130 of 440 (1e-4) Sox3(HMG)/NPC-Sox3-ChIP-Seq(GSE33059)/Homer
                131 of 440 (1e-4) c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer
                132 of 440 (1e-4) LHX9(Homeobox)/Hct116-LHX9.V5-ChIP-Seq(GSE116822)/Homer
                133 of 440 (1e-4) ELF3(ETS)/PDAC-ELF3-ChIP-Seq(GSE64557)/Homer
                134 of 440 (1e-4) Stat3+il21(Stat)/CD4-Stat3-ChIP-Seq(GSE19198)/Homer
                135 of 440 (1e-4) DLX5(Homeobox)/BasalGanglia-Dlx5-ChIP-seq(GSE124936)/Homer
                136 of 440 (1e-4) CEBP:AP1(bZIP)/ThioMac-CEBPb-ChIP-Seq(GSE21512)/Homer
                137 of 440 (1e-4) AR-halfsite(NR)/LNCaP-AR-ChIP-Seq(GSE27824)/Homer
                138 of 440 (1e-4) Six1(Homeobox)/Myoblast-Six1-ChIP-Chip(GSE20150)/Homer
                139 of 440 (1e-4) Tcf12(bHLH)/GM12878-Tcf12-ChIP-Seq(GSE32465)/Homer
                140 of 440 (1e-4) CArG(MADS)/PUER-Srf-ChIP-Seq(Sullivan_et_al.)/Homer
                141 of 440 (1e-3) Etv2(ETS)/ES-ER71-ChIP-Seq(GSE59402)/Homer
                142 of 440 (1e-3) Zfp57(Zf)/H1-ZFP57.HA-ChIP-Seq(GSE115387)/Homer
                143 of 440 (1e-3) NFY(CCAAT)/Promoter/Homer
                144 of 440 (1e-3) PU.1:IRF8(ETS:IRF)/pDC-Irf8-ChIP-Seq(GSE66899)/Homer
                145 of 440 (1e-3) NPAS(bHLH)/Liver-NPAS-ChIP-Seq(GSE39860)/Homer
                146 of 440 (1e-3) EWS:ERG-fusion(ETS)/CADO_ES1-EWS:ERG-ChIP-Seq(SRA014231)/Homer
                147 of 440 (1e-3) FOXK1(Forkhead)/HEK293-FOXK1-ChIP-Seq(GSE51673)/Homer
                148 of 440 (1e-3) Zic2(Zf)/ESC-Zic2-ChIP-Seq(SRP197560)/Homer
                149 of 440 (1e-3) MNT(bHLH)/HepG2-MNT-ChIP-Seq(Encode)/Homer
                150 of 440 (1e-3) IRF2(IRF)/Erythroblas-IRF2-ChIP-Seq(GSE36985)/Homer
                151 of 440 (1e-3) Tcf21(bHLH)/ArterySmoothMuscle-Tcf21-ChIP-Seq(GSE61369)/Homer
                152 of 440 (1e-3) BHLHA15(bHLH)/NIH3T3-BHLHB8.HA-ChIP-Seq(GSE119782)/Homer
                153 of 440 (1e-3) Ap4(bHLH)/AML-Tfap4-ChIP-Seq(GSE45738)/Homer
                154 of 440 (1e-3) IRF8(IRF)/BMDM-IRF8-ChIP-Seq(GSE77884)/Homer
                155 of 440 (1e-3) NFAT(RHD)/Jurkat-NFATC1-ChIP-Seq(Jolma_et_al.)/Homer
                156 of 440 (1e-3) Unknown-ESC-element(?)/mES-Nanog-ChIP-Seq(GSE11724)/Homer
                157 of 440 (1e-3) Sox9(HMG)/Limb-SOX9-ChIP-Seq(GSE73225)/Homer
                158 of 440 (1e-3) MyoG(bHLH)/C2C12-MyoG-ChIP-Seq(GSE36024)/Homer
                159 of 440 (1e-3) Six2(Homeobox)/NephronProgenitor-Six2-ChIP-Seq(GSE39837)/Homer
                160 of 440 (1e-3) Lhx2(Homeobox)/HFSC-Lhx2-ChIP-Seq(GSE48068)/Homer
                161 of 440 (1e-3) GSC(Homeobox)/FrogEmbryos-GSC-ChIP-Seq(DRA000576)/Homer
                162 of 440 (1e-3) HNF1b(Homeobox)/PDAC-HNF1B-ChIP-Seq(GSE64557)/Homer
                163 of 440 (1e-3) IRF4(IRF)/GM12878-IRF4-ChIP-Seq(GSE32465)/Homer
                164 of 440 (1e-3) Prop1(Homeobox)/GHFT1-PROP1.biotin-ChIP-Seq(GSE77302)/Homer
                165 of 440 (1e-3) NeuroG2(bHLH)/Fibroblast-NeuroG2-ChIP-Seq(GSE75910)/Homer
                166 of 440 (1e-3) BMAL1(bHLH)/Liver-Bmal1-ChIP-Seq(GSE39860)/Homer
                167 of 440 (1e-3) ZNF189(Zf)/HEK293-ZNF189.GFP-ChIP-Seq(GSE58341)/Homer
                168 of 440 (1e-3) EBF1(EBF)/Near-E2A-ChIP-Seq(GSE21512)/Homer
                169 of 440 (1e-3) HIF2a(bHLH)/785_O-HIF2a-ChIP-Seq(GSE34871)/Homer
                170 of 440 (1e-3) E2F(E2F)/Hela-CellCycle-Expression/Homer
                171 of 440 (1e-2) Rfx6(HTH)/Min6b1-Rfx6.HA-ChIP-Seq(GSE62844)/Homer
                172 of 440 (1e-2) EBF2(EBF)/BrownAdipose-EBF2-ChIP-Seq(GSE97114)/Homer
                173 of 440 (1e-2) Zac1(Zf)/Neuro2A-Plagl1-ChIP-Seq(GSE75942)/Homer
                174 of 440 (1e-2) VDR(NR),DR3/GM10855-VDR+vitD-ChIP-Seq(GSE22484)/Homer
                175 of 440 (1e-2) Rbpj1(?)/Panc1-Rbpj1-ChIP-Seq(GSE47459)/Homer
                176 of 440 (1e-2) STAT6(Stat)/Macrophage-Stat6-ChIP-Seq(GSE38377)/Homer
                177 of 440 (1e-2) Ascl1(bHLH)/NeuralTubes-Ascl1-ChIP-Seq(GSE55840)/Homer
                178 of 440 (1e-2) n-Myc(bHLH)/mES-nMyc-ChIP-Seq(GSE11431)/Homer
                179 of 440 (1e-2) MyoD(bHLH)/Myotube-MyoD-ChIP-Seq(GSE21614)/Homer
                180 of 440 (1e-2) FOXA1(Forkhead)/MCF7-FOXA1-ChIP-Seq(GSE26831)/Homer
                181 of 440 (1e-2) Zfp809(Zf)/ES-Zfp809-ChIP-Seq(GSE70799)/Homer
                182 of 440 (1e-2) Sox21(HMG)/ESC-SOX21-ChIP-Seq(GSE110505)/Homer
                183 of 440 (1e-2) CRE(bZIP)/Promoter/Homer
                184 of 440 (1e-2) RUNX1(Runt)/Jurkat-RUNX1-ChIP-Seq(GSE29180)/Homer
                185 of 440 (1e-2) JunD(bZIP)/K562-JunD-ChIP-Seq/Homer
                186 of 440 (1e-2) Foxa3(Forkhead)/Liver-Foxa3-ChIP-Seq(GSE77670)/Homer
                187 of 440 (1e-2) FOXA1(Forkhead)/LNCAP-FOXA1-ChIP-Seq(GSE27824)/Homer
                188 of 440 (1e-2) ZNF711(Zf)/SHSY5Y-ZNF711-ChIP-Seq(GSE20673)/Homer
                189 of 440 (1e-2) TCF4(bHLH)/SHSY5Y-TCF4-ChIP-Seq(GSE96915)/Homer
                190 of 440 (1e-2) PRDM15(Zf)/ESC-Prdm15-ChIP-Seq(GSE73694)/Homer
                191 of 440 (1e-2) Esrrb(NR)/mES-Esrrb-ChIP-Seq(GSE11431)/Homer
                192 of 440 (1e-2) ZNF692(Zf)/HEK293-ZNF692.GFP-ChIP-Seq(GSE58341)/Homer
                193 of 440 (1e-2) PRDM1(Zf)/Hela-PRDM1-ChIP-Seq(GSE31477)/Homer
                194 of 440 (1e-2) Sox10(HMG)/SciaticNerve-Sox3-ChIP-Seq(GSE35132)/Homer
                195 of 440 (1e-2) Zic(Zf)/Cerebellum-ZIC1.2-ChIP-Seq(GSE60731)/Homer
                196 of 440 (1e-2) PRDM14(Zf)/H1-PRDM14-ChIP-Seq(GSE22767)/Homer
                197 of 440 (1e-2) Myf5(bHLH)/GM-Myf5-ChIP-Seq(GSE24852)/Homer
                198 of 440 (1e-2) Barx1(Homeobox)/Stomach-Barx1.3xFlag-ChIP-Seq(GSE69483)/Homer
                199 of 440 (1e-2) IRF1(IRF)/PBMC-IRF1-ChIP-Seq(GSE43036)/Homer
                200 of 440 (1e-2) AP-2gamma(AP2)/MCF7-TFAP2C-ChIP-Seq(GSE21234)/Homer
                201 of 440 (1e-2) bHLHE40(bHLH)/HepG2-BHLHE40-ChIP-Seq(GSE31477)/Homer
                202 of 440 (1e-2) ZBTB18(Zf)/HEK293-ZBTB18.GFP-ChIP-Seq(GSE58341)/Homer
                203 of 440 (1e-2) TR4(NR),DR1/Hela-TR4-ChIP-Seq(GSE24685)/Homer
                204 of 440 (1e-2) Olig2(bHLH)/Neuron-Olig2-ChIP-Seq(GSE30882)/Homer
                205 of 440 (1e-2) HNF4a(NR),DR1/HepG2-HNF4a-ChIP-Seq(GSE25021)/Homer
                206 of 440 (1e-2) Twist2(bHLH)/Myoblast-Twist2.Ty1-ChIP-Seq(GSE127998)/Homer
                207 of 440 (1e-2) FOXM1(Forkhead)/MCF7-FOXM1-ChIP-Seq(GSE72977)/Homer
                208 of 440 (1e-2) RORgt(NR)/EL4-RORgt.Flag-ChIP-Seq(GSE56019)/Homer
                209 of 440 (1e-2) RORgt(NR)/EL4-RORgt.Flag-ChIP-Seq(GSE56019)/Homer
                210 of 440 (1e-2) STAT6(Stat)/CD4-Stat6-ChIP-Seq(GSE22104)/Homer
                211 of 440 (1e-2) RUNX2(Runt)/PCa-RUNX2-ChIP-Seq(GSE33889)/Homer
                212 of 440 (1e-2) Eomes(T-box)/H9-Eomes-ChIP-Seq(GSE26097)/Homer
                213 of 440 (1e-2) Six4(Homeobox)/MCF7-SIX4-ChIP-Seq(Encode)/Homer
                214 of 440 (1e-2) ZNF341(Zf)/EBV-ZNF341-ChIP-Seq(GSE113194)/Homer
                215 of 440 (1e-2) Foxa2(Forkhead)/Liver-Foxa2-ChIP-Seq(GSE25694)/Homer
        Skipping...
        Job finished - if results look good, please send beer to ..

        Cleaning up tmp files...

In the motif analysis result, SP1 motif and many other family members rank top.

drawing

Next we want to identify binding sites that are differentially bound in K562 and Hct-116 cells. This can be challenging as the two samples may have slightly shifted peaks centers at a given genomic region, leading to false positive differential peak calls. To handle this, Pycallingcards first combines the insertions from the two samples and calls peaks on the joint dataset. We do this using bedtools and pybedtools.

[22]:

import pybedtools
peak = cc.rd.combine_qbed([peak_data_HCT116, peak_data_K562])
peak = pybedtools.BedTool.from_dataframe(peak).merge().to_dataframe()
peak_data = peak.rename(columns={"chrom":"Chr", "start":"Start", "end":"End"})
peak_data

[22]:

	Chr	Start	End
0	chr1	29684	30087
1	chr1	36239	38107
2	chr1	198893	201208
3	chr1	203351	207161
4	chr1	265549	266336
...	...	...	...
10445	chrY	15158250	15158653
10446	chrY	16985442	16985845
10447	chrY	19753311	19753714
10448	chrY	21011133	21011828
10449	chrY	56952574	56957328

10450 rows × 3 columns

We can now visualize the peaks called on the joint dataset.

[23]:

cc.pl.draw_area("chr1",999921,1000324,15000,peak_data, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,10),peak_line = 2,save = False,plotsize = [1,1,3], example_length = 1000,
                title = "HCT116_SP1")
cc.pl.draw_area("chr1",999921,1000324,15000,peak_data, K562_SP1, "hg38", K562_brd4, font_size=2,
                figsize = (30,10),peak_line = 2,save = False,plotsize = [1,1,3], example_length = 1000,
                title = "K562_SP1")

../../_images/tutorials_notebooks_K562HCT116_SP1_43_0.png

../../_images/tutorials_notebooks_K562HCT116_SP1_43_1.png

[24]:

cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data, HCT116_SP1, "hg38", HCT116_brd4, font_size=2,
                figsize = (30,14), peak_line = 3,save = False, bins = 200, plotsize = [1,1,5],
                example_length = 1000, title = "HCT116_SP1")
cc.pl.draw_area("chr10",3048452,3049913,60000,peak_data, K562_SP1, "hg38", K562_brd4, font_size=2,
                figsize = (30,14), peak_line = 3,save = False, bins = 200, plotsize = [1,1,5],
                example_length = 1000, title = "K562_SP1")

../../_images/tutorials_notebooks_K562HCT116_SP1_44_0.png

../../_images/tutorials_notebooks_K562HCT116_SP1_44_1.png

The results seem to be good! Congratulations! Now we can annotate the peaks using bedtools.

[25]:

peak_annotation = cc.pp.annotation(peak_data, reference = "hg38")
peak_annotation = cc.pp.combine_annotation(peak_data,peak_annotation)
peak_annotation

In the bedtools method, we would use bedtools in the default path. Set bedtools path by 'bedtools_path' if needed.

[25]:

	Chr	Start	End	Nearest Refseq1	Gene Name1	Direction1	Distance1	Nearest Refseq2	Gene Name2	Direction2	Distance2
0	chr1	29684	30087	NR_036051	MIR1302-2	+	279	NR_024540	WASH7P	-	-315
1	chr1	36239	38107	NR_026818	FAM138A	-	-159	NR_036051	MIR1302-2	+	-5737
2	chr1	198893	201208	NR_026823	FAM138D	-	3921	NR_107063	MIR6859-3	-	-10936
3	chr1	203351	207161	NR_026823	FAM138D	-	0	NR_107063	MIR6859-3	-	-15394
4	chr1	265549	266336	NR_026823	FAM138D	-	-58953	NR_107063	MIR6859-3	-	-77592
...	...	...	...	...	...	...	...	...	...	...	...
10445	chrY	15158250	15158653	NM_001206850	NLGN4Y	+	-314283	NR_046504	NLGN4Y-AS1	-	-354218
10446	chrY	16985442	16985845	NR_028083	FAM41AY1	+	515113	NR_002160	FAM224B	-	589316
10447	chrY	19753311	19753714	NM_001146706	KDM5D	-	-8373	NR_045128	TXLNGY	+	-146142
10448	chrY	21011133	21011828	NM_001039567	RPS4Y2	+	-230102	NM_001282471	PRORY	-	371146
10449	chrY	56952574	56957328	NM_005840	SPRY3	+	0	NM_001145149	VAMP7	+	110472

10450 rows × 11 columns

Combine the two experiment qbed files to make anndata object.

[26]:

exp_qbed = pd.concat([K562_SP1,HCT116_SP1])
exp_qbed

[26]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	16529	16533	163	-	GCTCCTAAGTACGTTC-1
1	chr1	29884	29888	10	+	CTCACACCAGACGCTC-1
2	chr1	29884	29888	155	+	TGGCCAGCACCCATTC-1
3	chr1	29884	29888	285	+	GTGGGTCCACGGCCAT-1
4	chr1	29884	29888	7	+	CGTCTACTCAACACGT-1
...	...	...	...	...	...	...
77205	chrY	25518788	25518792	2	+	TGGGCGTTCGAACGGA-1
77206	chrY	56987633	56987637	13	+	CAGTCCTAGGCACATG-1
77207	chrY	57080855	57080859	17	+	CGGAGCTCATCGACGC-1
77208	chrY	57080855	57080859	7	+	GTAACGTAGTTACGGG-1
77209	chrY	57080855	57080859	9	+	TCAGCAAGTTGAACTC-1

404675 rows × 6 columns

Read the barcode file.

[27]:

barcodes = cc.datasets.SP1_K562HCT116_data(data = "barcodes")
barcodes = barcodes.drop_duplicates(subset=['Index'])
barcodes

[27]:

	Index	cluster
0	AAACCTGAGAAAGTGG-1	HCT116
1	AAACCTGAGACCGGAT-1	K562
2	AAACCTGAGACTAGAT-1	HCT116
3	AAACCTGAGAGCTTCT-1	HCT116
4	AAACCTGAGAGTACCG-1	HCT116
...	...	...
52206	TTTGTCATCTCCGGTT-1	K562
52207	TTTGTCATCTCGATGA-1	K562
52208	TTTGTCATCTCTAAGG-1	K562
52209	TTTGTCATCTGGAGCC-1	HCT116
52210	TTTGTCATCTTGGGTA-1	HCT116

51079 rows × 2 columns

Now we will connect the peaks (and insertions under the peaks) to the cell barcode data. To do so, we will use the qbed data, peak data and barcodes data to make a cell by peak anndata object.

[28]:

adata_cc = cc.pp.make_Anndata(exp_qbed, peak_annotation, barcodes)
adata_cc

100%|██████████| 24/24 [00:02<00:00, 10.77it/s]

[28]:

AnnData object with n_obs × n_vars = 51079 × 10450
    obs: 'cluster'
    var: 'Chr', 'Start', 'End', 'Nearest Refseq1', 'Gene Name1', 'Direction1', 'Distance1', 'Nearest Refseq2', 'Gene Name2', 'Direction2', 'Distance2'

Although one peak should have many insertions, there is a chance that all the cells from the peak were filtered by the RNA preprocesssing. In this case, we advise to filter the peaks. Additionally, we also recommend to filter cells that have very few insertions.

[29]:

cc.pp.filter_peaks(adata_cc, min_counts = 5)
cc.pp.filter_peaks(adata_cc, min_cells = 5)
adata_cc

[29]:

AnnData object with n_obs × n_vars = 51079 × 10448
    obs: 'cluster'
    var: 'Chr', 'Start', 'End', 'Nearest Refseq1', 'Gene Name1', 'Direction1', 'Distance1', 'Nearest Refseq2', 'Gene Name2', 'Direction2', 'Distance2', 'n_counts', 'n_cells'

Next we can perform differential peak analysis to determine which peaks are cell type specific. In this example, we use the fisher exact test to find peaks enriched in K562 versus Hct116 cells.

[30]:

cc.tl.rank_peak_groups(adata_cc, "cluster", method = 'fisher_exact', key_added = 'fisher_exact')

100%|██████████| 2/2 [00:55<00:00, 27.99s/it]

We can plot the results for differential peak analysis.

[31]:

cc.pl.rank_peak_groups(adata_cc, key = 'fisher_exact')

../../_images/tutorials_notebooks_K562HCT116_SP1_58_0.png

Now let’s visualize some peaks that are differentially bound. The colored ones are the insertions for the cluster of interest (i.e. cell type) and the grey ones are insertions in the rest of the clusters. In this case there are only two clusters, HCT116 and K562. We observe large differences in Sp1 binding in HCT116 and K562 cells.

[32]:

bg_qbed = pd.concat([K562_brd4, HCT116_brd4])
bg_qbed

[32]:

	Chr	Start	End	Reads	Direction	Barcodes
0	chr1	30238	30242	3	+	TTTACTGCATAAAGGT-1
1	chr1	30355	30359	2	-	ATCACGAAGAGTAATC-1
2	chr1	30355	30359	70	+	TTGAACGCAAATCCGT-1
3	chr1	31101	31105	2	+	CCTCAGTCATCAGTAC-1
4	chr1	32116	32120	5	+	CTAGTGAAGACAAAGG-1
...	...	...	...	...	...	...
37769	chrY	18037315	18037319	9	-	GCAGTTAAGATCTGAA-1
37770	chrY	24036504	24036508	168	+	GCAGTTAAGATCTGAA-1
37771	chrY	24036504	24036508	508	+	CATATGGCAGCCAGAA-1
37772	chrY	25633622	25633626	13	-	GCAGTTAAGATCTGAA-1
37773	chrY	25633622	25633626	32	-	CATATGGCAGCCAGAA-1

145159 rows × 6 columns

In the tracks above, we see a strong peak on Chr 15 in HCT116 cell (purple) that is not present in K562 cells (red)

[33]:

cc.pl.draw_area("chr7", 143706539, 143718962, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "K562", key = 'cluster', figsize = (30,13),
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'K562 Insertions', name_density1 = 'K562 Insertion Density',
                peak_line = 4, color = "red", plotsize = [1,1,5], title = "chr7_143706539_143718962_K562")
cc.pl.draw_area("chr7",143706539,143718962,100000,peak_data,exp_qbed,"hg38",adata = adata_cc,
                bins = 250, font_size=2, name = "HCT116", key ='cluster', figsize = (30,13),
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'HCT116 Insertions', name_density1 = 'HCT116 Insertion Density',
                peak_line = 4, color = "purple", plotsize = [1,1,5], title = "chr7_143706539_143718962_HCT116")

../../_images/tutorials_notebooks_K562HCT116_SP1_62_0.png

../../_images/tutorials_notebooks_K562HCT116_SP1_62_1.png

In the tracks above we see a number of Sp1 peaks on chr7 that are tightly bound in K562, but not in HCT116.

[34]:

cc.pl.draw_area("chr8", 17799370, 17802353, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "K562", key = 'cluster', figsize = (30,13), peak_line = 3,
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'K562 Insertions', name_density1 = 'K562 Insertion Density',
                color = "red", plotsize = [1,1,6], title = "chr19_9818643_9820060")
cc.pl.draw_area("chr8", 17799370, 17802353, 100000, peak_data, exp_qbed, "hg38", adata = adata_cc,
                bins = 250, font_size=2, name = "HCT116",key ='cluster',figsize = (30,13),peak_line = 3,
                name_insertion2 = 'Total Insertions', name_density2 = 'Total Insertion Density',
                name_insertion1 = 'HCT116 Insertions', name_density1 = 'HCT116 Insertion Density',
                color = "purple", plotsize = [1,1,6], title = "chr19_9818643_9820060")

../../_images/tutorials_notebooks_K562HCT116_SP1_64_0.png

../../_images/tutorials_notebooks_K562HCT116_SP1_64_1.png

Here we find peaks on Chr8 that are bound in HCT116 but not in K562 cells.

Saved the file if needed.

[35]:

adata_cc.write("SP1_qbed.h5ad")