The PReMod database describes more than 100,000 computational predicted transcriptional regulatory modules within the human genome1. These modules represent the regulatory potential for 229 transcription factors families and are the first genome-wide/transcription factor-wide collection of predicted regulatory modules for the human genome2.
The algorithm used involves two steps: (i) Identification and scoring of putative transcription factor binding sites using 481 TRANSFAC 7.2 PWMs for vertebrate transcription factors. To this end, each non-coding position of the human genome was evaluated for its similarity to each PWM using a log-likelihood ratio score with a local GC-parameterized third-order Markov background model. Corresponding orthologous positions in mouse and rat genomes were evaluated similarly and a weighted average of the human, mouse, and rat log-likelihood scores at aligned positions (based on a Multiz (Blanchette et al. 2004) genome-wide alignment of these three species) was used to define the matrix score for each genomic position and each PWM. (ii) Detection of clustered putative binding sites. To assign a "module score" to a given region, the five transcription factors with the highest total scoring hits are identified, and a p-value is assigned to the total score observed of the top 1, 2, 3, 4, or 5 factors. The p-value computation takes into consideration the number of factors involved (1 to 5), their total binding site scores, and the length and GC content of the region under evaluation2.
Queries of the database allow a user to specify how much information they want to see. For instance, one can retrieve all information for a given region, a given PWM, a given gene and so on. Several options are given for textual output or visualization of the data1.
2 Mathieu Blanchette*, Alain R. Bataille, Xiaoyu Chen, Christian Poitras, Josée Laganière, Céline Lefèbvre, Geneviève Deblois, Vincent Giguère, Vincent Ferretti, Dominique Bergeron, Benoit Coulombe, and François Robert*