Statistical analysis of over-represented words in human promoter sequences
AUTOR(ES)
Mariño-Ramírez, Leonardo
FONTE
Oxford University Press
RESUMO
The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from –2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.
ACESSO AO ARTIGO
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=373387Documentos Relacionados
- An approach to identify over-represented cis-elements in related sequences
- Long W tracts are over-represented in the Escherichia coli and Haemophilus influenzae genomes.
- Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences
- Statistical analysis of nucleotide sequences.
- Statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza A viruses.