A large percentage of protein sequences are aperiodic, showing close to average amino acid composition. This subtle mixture of residues dictates the structural properties of proteins and their functional role. However, there is an important group of proteins encompassing regions enriched in one or few amino acids, the so-called low complexity regions (LCRs) [1∗]. Mainly located within intrinsically disordered regions (IDRs), i.e. regions without permanent secondary or tertiary structure [2,3], LCRs are found in half of eukaryotic proteins where they represent around 25% of the coding sequence [4]. Homorepeats (or polyX) are tracts of a single amino acid that represent an eye-catching family of LCRs [5,6]. Once considered as ‘junk’ protein segments without specific function, there is a growing body of evidence that underlines their biological relevance [1∗,7]. Indeed, homorepeats exploit the accumulation of specific physicochemical properties in defined regions of proteins to perform very specialised functions in (among others) stress response, development, transcription, organelle biogenesis and transport [6,8]. Homorepeats provide functional versatility to proteins by mediating protein–protein interactions and driving spatial localisation [7,9]. Moreover, their presence, even in essential proteins, facilitates protein divergence and evolvability to rewire interactions [10,11∗]. It has been also shown that proteins containing homorepeats have denser and more diverse interactomes [7], and these containing multiple polyX are more often involved in disease, including neurological disorders and cancer [12]. Moreover, the combination of distinct PolyX in the same protein is not a random phenomenon and, in developmental proteins, specific combinations are enriched and their co-evolution can be traced [13]. Although protein length is a factor that needs to be taken into account because it necessarily increases the probability to find more polyX, these observations underline the role of homorepeats in signalling and regulatory processes.
The accumulation of a given physicochemical feature can also have detrimental consequences. Indeed, repeats of certain amino acids, such as cysteine, tyrosine or tryptophan, are rarely found in proteomes, suggesting their inherent toxicity. Moreover, the uncontrolled expansion of poly-glutamine (polyQ) and poly-alanine (polyA) in specific proteins cause a series of rare neurodegenerative and developmental diseases, including Huntington's disease, several ataxias, synpolydactyly syndrome and congenital central hypoventilation syndrome [14, 15, 16]. These pathologies are triggered by the incorporation of additional residues in a previously existing homorepeat, demonstrating the subtle balance between function and toxicity [17]. More recently, polyG aggregates originating from expanded (CGG)n repeats located in 5′-untranslated regions of certain genes have been identified in patients of (among others) neural intranuclear inclusion disease and fragile X tremor/ataxia syndrome [18].
Despite the growing attention to homorepeats, the structural bases of their function and malfunction remain poorly understood, precluding rational intervention for biomedical purposes. Moreover, a precise control of the structural determinants of these sequences would pave the way to design of IDRs with targeted functions in biotechnology [19]. LCRs in general and homorepeats in particular pose fundamental problems for the application of traditional high-resolution structural biology methods. On the one hand, their inherent flexibility precludes the general use of X-ray crystallography and cryo-electron microscopy. On the other hand, the similarity of the chemical environments in repetitive sequences hampers the application of standard Nuclear Magnetic Resonance (NMR) frequency assignment strategies. These limitations have fostered the application of low-resolution methods and computational approaches in order to establish connections between the structure of homorepeats and their biological function [20,21]. Complementary to these methods, computational and genomic approaches have been applied to assess the distribution of homorepeats in proteomes and to evaluate their interactome and evolutionary dynamics [22, 23, 24∗∗].
In this review, we summarise present structural knowledge for the most abundant homorepeats and describe recent developments to study the structure/function relationships of this elusive family of LCRs. Due to the restrictions in the number of citations allowed, we cannot provide an extensive revision of the field, and only the most recent studies will be described along this review.
Comments (0)