Abstract
Collagens are structural proteins that are predominantly found in the extracellular matrix of multicellular animals, where they are mainly responsible for the stability and structural integrity of various tissues. All collagens contain polypeptide strands (α-chains). There are several types of collagens, some of which differ significantly in form, function, and tissue specificity. Because of their importance in clinical research, they are grouped into subdivisions, the so-called collagen families, and their sequences are often analysed. However, problems arise with highly homologous sequence segments. To increase the accuracy of collagen classification and prediction of their functions, the structure of these collagens and their expression in different tissues could result in a better focus on sequence segments of interest. Here, we analyse collagen families with different levels of conservation. As a result, clusters with high interconnectivity can be found, such as the fibrillar collagens, the COL4 network-forming collagens, and the COL9 FACITs. Furthermore, a large cluster between network-forming, FACIT, and COL28a1 α-chains is formed with COL6a3 as a major hub node. The formation of clusters also signifies, why it is important to always analyse the α-chains and why structural changes can have a wide range of effects on the body.
1 Introduction
1.1 Collagen
Collagens are structural proteins of the extracellular matrix (ECM) in multicellular animals and, with a mass fraction of ∼30 %, constitute an essential part of the ECM’s structure and maintenance [1–4]. Collagens are found in skin, connective tissue, vascular walls, bone, and cartilage [5–9]. Moreover, the expression of collagens is essential for the formation of the basic structure of tissues during ontogenesis [10]. For example, a somite is a block of condensed mesoderm formed bilaterally along the central axis in vertebrate embryos [11]. Collagens are not only essential for the formation of these blocks but also for the differentiation of the somitic subunits, i.e., dermatome, myotome, and sclerotome [12, 13]. The fusion of myoblasts into multinucleated myofibers during skeletal and cardiac muscle development is another essential function of one of the collagens [14].
Many types of collagens vary widely in their form, function, and tissue specificity. Currently, 28 different types of collagens are known, which can be divided into six different families (Table 1) [15–18]. A classification is based on characteristic properties of the collagen such as its structure, interaction, or site of expression [19]. Due to the variability of characters, various classifications of collagen exist, with none of them being indisputable [20–23]. Thus, Table 1 is only one way of classifying collagens.
Classification of the 28 collagens into six different groups based on Gordon & Hahn [15]. Collagens are colour-coded according to their group, as also used further in this article. FACITs, Fibril Associated Collagens with Interrupted Triple helices.
![]() |
Within the 28 collagens, 44 α-chains can be distinguished. In consideration of the collagen groups in Table 1, these α-chains are characterised by specific processing units that are separated by a repetitive sequence [(Gly-X-Y] n ) [24], labelled in Uniprot as “Chain”. This chain is predominantly involved in structure formation. The first unit, which appears in all α-chains except for the transmembrane ones, is the addition of a signal sequence that is important for the translocation of the α-chain to the endoplasmic reticulum that is afterward removed [25, 26]. In nine fibrillar and two network-forming α-chains (COL1a1, COL1a2, COL2a1, COL3a1, COL4a1, COL4a2, COL5a1, COL5a2, COL11a1, COL11a2, COL27a1), a propeptide unit is located next to the signal unit and/or at the end of all fibrillar α-chains (The Uniprot Consortium, 2023). These propeptides are essential for assembling collagen trimers and are cleaved in fibrillar collagens after helix formation [20]. In addition to this, several residues are modified by post translational modifications (PTM, e.g., hydroxylation of proline).
Most of the collagens are homotrimers, i.e., they are built from three identical α-chains [19, 27–31]. However, some heterotrimers consist of two identical α-chains and one different α-chain. The resulting homo- and heterotrimers aggregate with collagen microfibrils in the environment, which are further organised as fibrils [32–34]. These fibrils are further packed into fibres, which are bundled into fascicles [35, 36]. Another characterisation of collagens is based on the abundance of their collagen types and mRNA expression throughout the human body [37, 38]. The mRNA localization of individual collagens is often tissue-specific [37, 39].
The understanding of the sequence and structure of collagens is crucial because of hereditary diseases based on mutations in collagen genes [40], such as osteogenesis imperfecta, Ehlers-Danlos syndrome, and Stickler syndrome.
1.2 AlphaFold
AlphaFold DB is a protein structure database (DB) and prediction tool created by DeepMind and the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI). It uses a deep neural network to predict spatial structures for proteins, similar to a graph interference problem [41, 42]. It maximises the number of interactions according to a graph-theoretical model considering the similarity across structures. The advantage of AlphaFold is that it can, in theory, be applied to all known amino acid sequences.
However, this requires prior measurements of 3D structures to validate these sequences, which is determined by a confidence value. This value indicates for each position in the amino acid sequences how well the prediction matches the measurement. AlphaFold categorises these confidence values (CV) into four groups concerning their accuracy. These groups are ‘high accuracy’ (CV > 90), ‘generally good backbone prediction’ (70 < CV < 90), ‘low accuracy’ (50 < CV < 70), and ‘ribbon-like appearance’ (CV < 50). AlphaFold suggests that the last group should not be interpreted. Screening of the collagen sequences revealed that the proportions of repetitive regions were assigned a very low confidence value. These repetitive regions are found mainly in the inside of the sequences, as they are instrumental in the structure of the collagen [43, 44]. However, they can complicate analysis due to their repetitions, as they occur more frequently than the functional regions of the protein. This is in contrast to boundary regions, as they return a high-confidence value in most collagens, which indicates known knowledge about the 3D structure of other proteins.
1.3 Aim of this study
Considering the importance of collagen for ontogenesis and structure formation in the body [45, 46] and the multitude of collagen-related human diseases [40, 47], it is surprising that there are only a few complete studies regarding the possible alignment of collagens to each other [48]. In this study, the 44 α-chains of all 28 available human collagens from all six families are analysed by bioinformatics techniques. We aim to resolve the following questions by aligning these short high-confidence regions based on their hydrophobicity. 1st: Is it possible to construct informative networks between different α-chains and their collagens using high-confidence regions? 2nd: Are there any comparable features between the resulting networks and the literature?
2 Materials and methods
2.1 Sequence preparation
The amino acid (AA) sequence and their confidence values of human collagen variants is retrieved from the AlphaFold DB based on AlphaFold with the version from the “14th Critical Assessment of Protein Structure Prediction” (v2.0, Figure 1) [41, 42]. In AlphaFold, over 200 million protein structure predictions are stored and freely available.
![Figure 1:
Overview of the predicted aligned error (PAE) of the collagen α-chains of collagens I–XXVIII adapted from AlphaFold [41, 42]. The colour indicates the expected position error when the predicted and true structures are aligned. The darker the colour of green, the lower the alignment error. The dark-shaded boxes were predicted with a local ColabAlphaFold instance.](/document/doi/10.1515/jib-2024-0020/asset/graphic/j_jib-2024-0020_fig_001.jpg)
Overview of the predicted aligned error (PAE) of the collagen α-chains of collagens I–XXVIII adapted from AlphaFold [41, 42]. The colour indicates the expected position error when the predicted and true structures are aligned. The darker the colour of green, the lower the alignment error. The dark-shaded boxes were predicted with a local ColabAlphaFold instance.
The α-chains are selected based on their curation state. If this is not available, the reference proteome state is favoured. The sequences and their position confidence values of the collagen α-chains are downloaded as PDB files. An exception is made for Col6a3, COL7a1, and Col12a1 for which longer and reviewed sequences are available in the UniProt database (The Uniprot Consortium, 2023). For those α-chains, we download the sequences in FASTA format and predict their confidence values with a local ColabAlphaFold instance (v.1.5.2, https://github.com/YoshitakaMo/localcolabfold). Additionally, we retrieve the PTMs for each α-chain. The IDs (identification numbers) and PTM positions of the chosen α-chains can be found in Supplementary 1.
Since the PDB format is atypical for most sequence analysis tools, it is reformatted into a FASTA format based on the confidence values of each position. To process the following steps, several custom Python (v3.10.12) scripts are implemented (Supplementary 7). First, all amino acids are transferred from the 3-letter code to the 1-letter code based on the IUPAC nomenclature. Then, high-confidence regions (based on their average confidence value) from each α-chain are extracted (Figure 2). This is done separately for different subsequence lengths (from ten AA to 500 AA in steps of ten) and confidence thresholds (from zero to 95 in steps of five). To ensure that parts of each region do not appear twice or more often, after the extraction of each subsequence the confidence values of the particular region of the α-chain are set to −1 × 109. Furthermore, for each subsequence, the reversed complement is computed (marked with “rev” in the sequence name). This is done for three conditions based on the PTM positions of each α-chain: WHOLE SEQUENCE (entire α-chain sequence), WITHOUT SIGNAL (the signal sequence is removed from the α-chain sequence), and ONLY CHAIN (the signal and propeptide sequences are removed from the α-chain sequence).

Flowchart of data processing for the 44 collagen α-chains. From left to right: we have collected 44 collagen α-chains based on three conditions: WHOLE SEQUENCE (entire α-chain sequence), WITHOUT SIGNAL (the signal sequence is removed from the α-chain sequence), and ONLY CHAIN (the signal and propeptide sequences are removed from the α-chain sequence). From these, subsequences are extracted based on lengths (between ten and 500 in steps of ten) and confidence values (between zero and 95 in steps of five). Afterwards, for each set, all sequences were globally aligned to each other and matrices were constructed based on the calculated similarities. In the end, all similarity matrices were visualised in the form of networks. Each node was coloured based on the colours in Table 1.
2.2 Construction of similarity matrices
To create the distance similarity matrices on which we base our networks, each subsequence in a set is globally aligned to all other subsequences based on their hydrophobicity. The substitution matrix is based on the Eisenberg consensus scale, which combines several hydrophobicity (e.g., 0.48 kcal/mol for glycine or 0.12 kcal/mol for proline) features into one score (Supplementary 2). The difference in hydrophobicity for each amino acid pair is taken as the substitution rate and normalised between zero and 20. The rate values are rounded to the next integer. Gap open cost is set to ten and the gap extend cost to 0.5 (EMBOSS default). Additionally, to compare each set with the others, the calculated similarities are also normalised to the length of the aligned subsequences (resulting in a maximum score of 20 if two identical sequences are compared).
2.3 Compression of similarity matrices for further analysis
Afterward, data compressions on the similarity matrices are performed to better analyse our results. The subsequences are assigned to one of two categories, one containing all α-chains and one containing only the collagen types. For the α-chains [44], the maximum similarity between two α-chains over all their subsequence comparisons is selected. For the collagen types [28], the maximum similarity between the subsequences of all α-chains of the compared collagen types is selected. In a further step, lists are generated from these two compressed datasets listing the number of pairings between each α-chain subset and each collagen subset. For these similarities, score thresholds are set between 12 and 20 in steps of 0.25. This is done for each subsequence length and confidence value.
For each reduced similarity matrix, a network (in GML format) is built with the α-chains/collagens as nodes. The similarity scores are used as edges between each α-chain and based on the score thresholds using a custom PASCAL program (v1.1.24). The networks are visualised with the general-purpose diagramming program yEd (v3.23.1).
2.4 Generation of literature matrix for validation
To extrapolate the connection between the different α-chains, we construct a reference matrix between all α-chains based on reference databases. The idea behind this is that even if only neighbour links are considered, a larger network should still result. This is done for Google Scholar (entire text) and PubMed (only title and abstract) databases. For this purpose, a search query is performed between two collagens (Google Scholar) or α-chains (Pubmed) for all collagens and α-chains in humans. The number of hits is stored in a matrix. To avoid bias by focusing on specific collagen types in the literature, these values are logarithmized (to the base of 100) and normalised between zero and 20. As with the alignment networks, different score thresholds for the networks were used (between one and 20 in steps of one).
3 Results
Here, we aligned the short high-confidence regions of collagens based on their hydrophobicity. This enabled us to investigate the potential of different collagen α-chains to build informative networks. This should help reveal unknown interactions between different collagen types and make it possible to infer their spatial arrangement through simulations, i.e., within the basement membrane (e.g., basal lamina). Additionally, we compared our networks to the literature to verify their correctness.
3.1 Identification of relevant parameters
Overall, we calculated 192,000 different networks (three sequence types [WHOLE SEQUENCE, WITHOUT SIGNAL, ONLY CHAIN], 50 subsequence lengths, 20 confidence thresholds, 32 network thresholds, and two experiments [α-chains, collagens]). Therefore, to reduce the complexity of the results, we further adjusted the parameters. According to AlphaFold, a confidence threshold of at least 70 is required to have a “good backbone prediction”. In the case of the WHOLE SEQUENCE and WITHOUT SIGNAL sequence types, the longest possible subsequence length for confidence 70, for which all α-chains (at least one sequence) are present, is at length 30. Additionally, we get the same parameters when we calculate an optimum over the entire set of matrices (50 subsequences and 20 confidence thresholds) between the number of subsequences and the standard deviation of the similarity scores for the WHOLE SEQUENCE and WITHOUT SIGNAL datasets. For simplicity’s sake, we take the same parameters for the ONLY CHAIN dataset.
The high-confidence sequences and results for the WHOLE SEQUENCE, WITHOUT SIGNAL and ONLY CHAIN datasets, can be found in Supplementary 3, 4, and 5, respectively. Table 2 shows the three sequence types at which confidence thresholds α-chains are filtered out. For the ONLY CHAIN sequence type, even at low subsequence lengths of ten and a confidence threshold of 70 most fibrillar α-chains (COL1a1, COL1a2, COL2a1, COL3a1, COL11a1, COL27a1) are filtered out.
Summary of the excluded α-chains based on subsequence lengths and confidence thresholds. The first row shows all α-chains missing for the next highest subsequence length (40 AA) and the third row for the next highest confidence threshold (75). The second row shows the α-chains missing for our chosen parameters.
WHOLE SEQUENCE | WITHOUT SIGNAL | ONLY CHAIN | |
---|---|---|---|
Subseq. length 40/confidence 70 | COL9a2, COL9a3 | COL9a2, COL9a3 | COL1a1, COL1a2, COL2a1, COL3a1, COL9a2, COL9a3, COL11a1, COL27a1 |
Subseq. length 30/confidence 70 | COL1a1, COL1a2, COL2a1, COL3a1, COL11a1, COL27a1 | ||
Subseq. length 30/confidence 75 | COL9a2, COL9a3 | COL9a2, COL9a3 | COL1a1, COL1a2, COL2a1, COL3a1, COL9a2, COL9a3, COL11a1, COL27a1 |
The remaining score matrices and networks, as well as the rest of the data for the main analysis, can be found on Mendeley Data (doi: 10.17632/9pf5n687z4.3). As mentioned above, we selected the subsequence length of 30 and the confidence threshold of 70 for further analysis. The difference between the WHOLE SEQUENCE and WITHOUT SIGNAL datasets is negligible, since the signal sequence is concise. Therefore, only the WITHOUT SIGNAL and ONLY CHAIN datasets are considered in the Discussion.
3.2 Lists and networks
With the selected subsequence length and confidence threshold, an informative network is obtained in which the thickness of the edges indicates the similarities between α-chain/collagen. For clarity, we further filtered out thin edges (comparatively low similarity) using various threshold values (from 12 to 20 in steps 0.25) to emphasise local subnetworks. From these, the first (lowest threshold) network with the most subnetworks was chosen. In all three sequence types, it was the network with the threshold of 17.5 (WHOLE SEQUENCE and WITHOUT SIGNAL: nine, ONLY CHAIN: eight). As mentioned above, for the network of the ONLY CHAIN type most fibrillar α-chains are filtered out due to their repetitive chain sequences, leading to poor confidence and similarity values. This is not the case for the other two sequence types. The difference between the WHOLE SEQUENCE and WITHOUT SIGNAL datasets are negligible. Therefore, we focus on the network of the WITHOUT SIGNAL type. As can be seen in the top panel Figure 3, nine subnetworks are formed: (1) the fibrillar subnetwork (consisting of all fibrillar α-chains, except COL24a1 and COL27a1), (2) the COL4 subnetwork (consisting of all six COL4 α-chains), (3) the COL9 subnetwork (consisting of all three COL9 α-chains), (4) the COL8/10 subnetwork (consisting of the two α-chains of COL8 and the COL10a1 α-chain), (5–8) four pair subnetworks (COL16a1 and COL19a1, COL15a1 and COL18a1, COL13a1 and COL26a1, COL24a1 and COL27a1), three unconnected transmembrane α-chains (COL17a1, COL23a1, and COL25a1), and (9) the largest subnetwork consisting of the remaining α-chains. To show the interconnection between the individual subnetworks, it is necessary to lower the threshold to 17.25. With this, the following new connections are formed (Figure 3 bottom): The pair subnetworks are incorporated into the larger subnetworks. COL24/27 is connected to the fibrillar subnetwork through COL1a2, COL13/26 to the COL4 subnetwork through COL4a4, and COL15/18 to COL13/26 through COL13a1. At even lower thresholds (16.75, 17.0), almost all subnetworks are connected. The fibrillar network is connected to the COL4 subnetwork through COL4a3 and COL4a4, the COL8/10 subnetwork through COL8a2, and the largest subnetwork through COL6a5 and COL7a1. The subnetworks COL15/18 and COL16/19 now connect to the largest subnetwork through COL6a6. A new pair subnetwork between COL23a1 and COL25a1 is formed. Interestingly, the COL9 subnetwork is very stable.

Top: Network representation of the similarity matrix (sequence type: WITHOUT SIGNAL, subsequence length: 30, confidence: 70, network threshold: 17.5) of the 44 α-chains found in humans. Bottom: From left to right, the network representation of the similarity matrix for the network thresholds 16.75, 17.0, 17.25, 17.5 (selected, same as top), 17.75, and 18.0. Colours: Purple, fibrillar collagens; cyan, associated with fibrils (FACITs); green, network-forming collagens; yellow, transmembrane collagens; orange, endostatin precursor collagens; red, other collagens.
With even lower thresholds, the networks become multidimensional and are difficult to visualise.
3.3 Network comparison with PubMed and Google Scholar
Looking at the reference matrices (Supplementary 6), the lowest values (and thus the most hits) between collagens (Google Scholar) or α-chains (PubMed) were on the diagonal, where only single collagens or α-chains were searched for. This is because single keywords always have more hits than the conjunction with other keywords. The overall lowest value was for the collagen COL1 with 0.32 (13100 hits) or the α-chain COL1a1 with 0.13 (2,264 hits) indicating a high impact in the literature. The second-lowest values were for the cells around the diagonal. These cells mostly represented α-chains of the same or closely related collagens. For example, the lowest values for the six α-chains of COL4 were for each of these chains with itself. None of those values were above 0.42 and, therefore, could be found together in the literature. This could indicate that they are closely related or interact with each other. The same correlation could be seen in the networks for the alignments, where the α-chains of COL4 always formed a subnetwork with each other, but only with other α-chains at lower thresholds. Similar results could be seen for the fibrillar collagens. In general, the values were around 0.5 or lower. However, exceptions exist due to the few literatures available for specific α-chains. For example, COL5a3 has values above one or even a value of ten (meaning an occurrence of zero) with COL1a2. The same can also be seen for COL24a1 and COL27a1. The lowest value was 0.72 between COL27a1 and COL1/COL5a1 while most other values were ten. At the same time, our network shows the same pattern by separating the COL24/27a1 pair from the large fibrillar network at higher thresholds. In contrast to the alignment analysis, the PubMed networks tend to form a single large cluster, even at high thresholds (Figure 4). Further analyses were only done for PubMed, as the query limit for Google Scholar was exceeded for α-chains.

Network representation of the reference matrix of the 26/44 α-chains with significant entries found in PubMed. The other 18/44 α-chains had no connections and were not shown. The colours are as follows: Purple: fibrillar collagens, cyan: associated with fibrils (FACITs), green: network-forming collagens, and yellow: transmembrane collagens.
4 Discussion
We aimed to construct informative networks between the different collagen α-chains based on pairwise sequence alignments of high-confidence regions and compared them with the literature. For reasons of complexity, we focus our analysis on collagens only and exclude other binding proteins. Here, we showed that this classification could be generally reconstructed again based on the sequences of the α-chains.
4.1 Fibrillar-forming collagen clusters
It is striking that the propeptides are essential for this analysis. Without them, fibrillar collagens are filtered out due to the poor confidence values of the chain sequences. In a more detailed analysis of our networks, it can be seen that the fibrillar collagens form two distinct clusters. It turned out that COL5a1 is the major hub node in the larger cluster, which can be explained by its role as a regulator for the formation of a uniformly small corneal fibril diameter with COL1 [49]. COL5a1/a2 are associated with several diseases, including Ehlers-Danlos syndrome, which is characterised by very elastic skin, weakened blood vessels, and joint hypermobility [50–52]. COL24a1 and COL27a1 form their network separate from the other fibrillar collagens starting at a threshold of 17.5 (Figure 3). The triple helices of these two α-chains are shorter than those of the other fibrillar collagens [15]. This may be why they do not interact with other collagen α-chains at high thresholds.
4.2 Network-forming collagen clusters
The α-chains of the network-forming collagen COL4 are closely related, which indicates similar functions with few differences. As with the collagens of the fibrillar cluster, the α-chains of COL4 are also essential structure proteins in the body [53]. At a lower threshold of 17.0, a connection between COL4 and COL7 can be seen as mentioned in the literature [54, 55]. A defect of the α-chains COL4a3/a4/a5 can lead to Alport syndrome which affects the basement membranes of the kidney, inner ear, and eye [56–59].
The second cluster is formed by COL8a1, COL8a2 and COL10a2. This can be explained by their high similarity to each other, with COL8a1/a2 having one additional exon compared to COL10a1 [15]. Furthermore, COL8a1 and COL10a1 are associated with age-related macular degeneration [60] whereas mutations in COL8a2 can lead to Fuchs endothelial dystrophy, an impairment of vision [61]. The remaining network-forming collagens are part of the largest network, where COL6a3 is a major hub node.
4.3 Largest cluster
The largest cluster is not as homogeneous as the previous two clusters, consisting of three different types of collagens (network-forming, FACITs, and COL28a1). All α-chains are highly connected to each other. In this cluster, COL6a3 is a major hub node. Generally, COL6 forms the most connections with other collagens. At lower network thresholds (17.0 and lower), the α-chains of COL6 seem to be a major hub node for the whole network (Figure 3). This could be explained by the fact that COL6 is important for the function and stability of the cell membrane by interacting with other collagens such as COL1, COL2, COL4, and COL14 [62, 63]. Such connections can also be seen in our network for different thresholds, at lower thresholds for COL1/2 (17.0) and COL4 (16.75) and for COL14 at even higher thresholds of 18.0 (Supplementary 4).
This network seems to contain important collagens for developmental mechanisms. One disease associated with this is limb-girdle muscular dystrophy. This disease can be caused by mutations of multiple genes encoding for proteins within the sarcolemma, cytosol, or nucleus of a myocyte (muscle cell), leading ultimately to membrane instability, a weakness of the dystrophin associated glycoprotein complex, and defects in muscle repair mechanisms [64, 65]. The muscular weakness is caused by destabilisation of the structural proteins that are supposed to keep the muscular cell intact during contractions; one of these proteins could be those of COL6 [66].
4.4 COL9 cluster
All three COL9 α-chains are relatively short (the longest chain, COL9a1, is only 921 AA long) and highly similar to each other. It is known that defects in these α-chains can lead to Stickler syndrome, a disease characterised by ophthalmic, orofacial, articular, and auditory defects [67]. The range of defects highly suggests that the α-chains of COL9 are expressed in the first and second pharyngeal arch derivatives, which include, for example, the maxilla, mandible, palate, and auditory ossicles [68, 69]. This could also explain why the α-chains of COL9 seem to only connect to other α-chains at low network thresholds (Figure 3).
4.5 Pair clusters
One cluster consisting of only two α-chains is COL15/18. COL15a1 is structurally homologous to COL18a1 [70] which explains their distinct and stable bond. COL18a1 is mostly expressed in the brain and eye, and associated with the Knobloch syndrome, which leads to eye deformations in the development phase, called occipital encephalocele [71]. Mouse mutants deficient in COL15a1 showed progressive degeneration in skeletal muscles and susceptibility to muscle damage [70]. Other than that, no discernable feature can be extracted from the network.
For the other two pair clusters COL13/26 and COL16/19 the literature search results in no entries concerning interactions. Therefore, no information could be extracted.
4.6 Comparison with the literature
Our results regarding the similarity matrix are also reflected in the PubMed reference matrices. In particular, local clusters can be detected. COL1a1 forms the main hub node, with the remaining fibrillar collagens having the strongest connection to each other. At the same, the COL4 and to a lesser degree the COL6, COL8/10 and COL9 clusters also emerge as subclusters in the network. However, it should be noted that this is not surprising, since α-chains and collagens of the same type are usually studied together. The same can be said for COL1a1 functioning as the main hub node. COL1 is one of the most abundant structural proteins in vertebrates and, therefore, often used in conjunction with other α-chains in studies [72]. Lastly, it should be noted that the number of references decreases considerably at higher collagen numbers. For the α-chains not shown, future studies will result in higher literature data and may remedy this problem. Overall, the resulting reference network was as expected, but could still be used to validate parts of our alignment network.
With this study, we constructed an informative network and compared it with the literature. We were also able to link this network to specific questions (i.e., regarding diseases and developmental biology).
4.7 Outlook
Since collagens are important structural proteins present in every tissue of multicellular animals, an alteration naturally leads to serious effects on the body [73]. Our analysis showed that the representation with only collagen types is not sufficient to show the linkages, because the α-chains bind differently within a collagen type. To this end, it is necessary to always examine the α-chains. Moreover, in further studies, it should be possible to verify the overall network and its subnetworks of collagens with immunohistochemistry. Additional analysis could examine the ageing of collagens (regarding chemical modifications) and its effects on networks [74]; or also the consideration of further protein alignments between the collagens, such as adhesion proteins like fibronectins, laminins, tenascins, or glycoproteins, or in the far future for the whole human proteome.
Supplementary Material
Mendeley repository: The datasets generated and analysed during this study along with the code and several supplementary files are available in the Mendeley repository: Wesp, Valentin; Stark, Heiko (2024), “Constructing networks for comparison of collagen types”, Mendeley Data, V3, doi: 10.17632/9pf5n687z4.3.
Acknowledgments
We thank all the members of the Department of Bioinformatics who were providing guidance and support during this study. Additionally, we thank A. Berndt for stimulating discussions.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission. V.W., L.S., and H.S. conceived the study. V.W. and H.S. supervised and performed the calculations. V.W. and H.S. analysed network data. V.H. and H.S. drafted the manuscript and all figures. J.M.Z.C. contributed developmental and muscular functional aspects. S.S. contributed biochemical and bioinformatics expertise. All authors contributed to the interpretation of the results and revised the manuscript.
-
Competing interests: The authors declare no competing interests.
-
Research funding: This study benefited from the research experience of HS gained in studies funded by the Center of Interdisciplinary Prevention of Diseases related to Professional Activities (KIP) funded by the Friedrich-Schiller-University Jena and the ‘Berufsgenossenschaft Nahrungsmittel und Gastgewerbe Erfurt (Germany)’ (BGN).
References
1. Huxley-Jones, J, Robertson, DL, Boot-Handford, RP. On the origins of the extracellular matrix in vertebrates. Matrix Biol 2007;26:2–11. https://doi.org/10.1016/j.matbio.2006.09.008.Suche in Google Scholar PubMed
2. LeBleu, VS, MacDonald, B, Kalluri, R. Structure and function of basement membranes. Exp Biol Med 2007;232:1121–9. https://doi.org/10.3181/0703-mr-72.Suche in Google Scholar PubMed
3. Patino, MG, Neiders, ME, Andreana, S, Noble, B, Cohen, RE. Collagen: an overview. Implant Dent 2002;11:280–5. https://doi.org/10.1097/00008505-200207000-00014.Suche in Google Scholar PubMed
4. Salamito, M, Nauroy, P, Ruggiero, F. The collagen superfamily: everything you always wanted to know. In: The collagen superfamily and collagenopathies. Berlin, Germany: Springer; 2021, 8:1–22 pp.10.1007/978-3-030-67592-9_1Suche in Google Scholar
5. Ahmed, MS, Kamruzzaman, M, Rana, M, Akond, Z, Mollah, M. In silico analyses of human collagen protein function prediction. J Biol Sci 2016;24:55–65.10.3329/jbs.v24i0.37487Suche in Google Scholar
6. Bornstein, P, Sage, H. Structurally distinct collagen types. Annu Rev Biochem 1980;49:957–1003. https://doi.org/10.1146/annurev.bi.49.070180.004521.Suche in Google Scholar PubMed
7. Kucharz, EJ. The collagens: biochemistry and pathophysiology. Berlin, Germany: Springer Science & Business Media; 2012.Suche in Google Scholar
8. Miller, EJ, Gay, S. Collagen: an overview. Methods Enzymol 1982;82:3–32. https://doi.org/10.1016/0076-6879(82)82058-2.Suche in Google Scholar PubMed
9. Zhao, C, Xiao, Y, Ling, S, Pei, Y, Ren, J. Structure of collagen. Fibrous proteins: design, synthesis, and assembly. In: Ling, S. (eds) Fibrous proteins. methods in molecular biology. New York, NY: Humana; 2021, 2347:17–25 pp.10.1007/978-1-0716-1574-4_2Suche in Google Scholar PubMed
10. van Der Rest, M, Garrone, R. Collagen family of proteins. Faseb J 1991;5:2814–23. https://doi.org/10.1096/fasebj.5.13.1916105.Suche in Google Scholar
11. Tajbakhsh, S, Spörle, R. Somite development: constructing the vertebrate body. Cell 1998;92:9–16. https://doi.org/10.1016/s0092-8674(00)80894-6.Suche in Google Scholar PubMed
12. Duband, JL, Thiery, JP. Distribution of laminin and collagens during avian neural crest development. Development 1987;101:461–78. https://doi.org/10.1242/dev.101.3.461.Suche in Google Scholar PubMed
13. Leivo, I, Vaheri, A, Timpl, R, Wartiovaara, J. Appearance and distribution of collagens and laminin in the early mouse embryo. Dev Biol 1980;76:100–14. https://doi.org/10.1016/0012-1606(80)90365-6.Suche in Google Scholar PubMed
14. Gonçalves, TJM, Boutillon, F, Lefebvre, S, Goffin, V, Iwatsubo, T, Wakabayashi, T, et al.. Collagen XXV promotes myoblast fusion during myogenic differentiation and muscle formation. Sci Rep 2019;9:5878. https://doi.org/10.1038/s41598-019-42296-6.Suche in Google Scholar PubMed PubMed Central
15. Gordon, MK, Hahn, RA. Collagens. Cell Tissue Res 2010;339:247–57. https://doi.org/10.1007/s00441-009-0844-4.Suche in Google Scholar PubMed PubMed Central
16. Mayne, R, Brewton, RG. New members of the collagen superfamily. Curr Opin Cell Biol 1993;5:883–90. https://doi.org/10.1016/0955-0674(93)90039-s.Suche in Google Scholar PubMed
17. Maynes, R. Structure and function of collagen types. Orlando, FL: Elsevier; 2012.Suche in Google Scholar
18. Vuorio, E, de Crombrugghe, B. The family of collagen genes. Annu Rev Biochem 1990;59:837–72. https://doi.org/10.1146/annurev.biochem.59.1.837.Suche in Google Scholar
19. Ricard-Blum, S. The collagen family. Cold Spring Harbor Perspect Biol 2011;3:a004978. https://doi.org/10.1101/cshperspect.a004978.Suche in Google Scholar PubMed PubMed Central
20. Exposito, JY, Valcourt, U, Cluzel, C, Lethias, C. The fibrillar collagen family. Int J Mol Sci 2010;11:407–26. https://doi.org/10.3390/ijms11020407.Suche in Google Scholar PubMed PubMed Central
21. Franzke, CW, Tasanen, K, Schumann, H, Bruckner-Tuderman, L. Collagenous transmembrane proteins: collagen XVII as a prototype. Matrix Biol 2003;22:299–309. https://doi.org/10.1016/s0945-053x(03)00051-9.Suche in Google Scholar PubMed
22. Knupp, C, Squire, JM. Molecular packing in network-forming collagens. Adv Protein Chem 2005;70:375–403. https://doi.org/10.1016/s0065-3233(05)70011-5.Suche in Google Scholar PubMed
23. Ricard-Blum, S, Ruggiero, F. The collagen superfamily: from the extracellular matrix to the cell membrane. Pathol Biol 2005;53:430–42. https://doi.org/10.1016/j.patbio.2004.12.024.Suche in Google Scholar PubMed
24. Brodsky, B, Persikov, AV. Molecular structure of the collagen triple helix. Adv Protein Chem 2005;70:301–39. https://doi.org/10.1016/s0065-3233(05)70009-7.Suche in Google Scholar PubMed
25. Connizzo, BK, Yannascoli, SM, Soslowsky, LJ. Structure–function relationships of postnatal tendon development: a parallel to healing. Matrix Biol 2013;32:106–16. https://doi.org/10.1016/j.matbio.2013.01.007.Suche in Google Scholar PubMed PubMed Central
26. McAlinden, A, Smith, TA, Sandell, LJ, Ficheux, D, Parry, DA, Hulmes, DJ. α-Helical coiled-coil oligomerization domains are almost ubiquitous in the collagen superfamily. J Biol Chem 2003;278:42200–7. https://doi.org/10.1074/jbc.m302429200.Suche in Google Scholar PubMed
27. Francomano, CA. Key role for a minor collagen. Nat Genet 1995;9:6–8. https://doi.org/10.1038/ng0195-6.Suche in Google Scholar PubMed
28. Brinckmann, J. Collagens at a glance. In: Collagen: Primer in Structure, Processing and Assembly. Springer; 2005, 247:1–6 pp.10.1007/b103817Suche in Google Scholar
29. Gelse, K, Pöschl, E, Aigner, T. Collagens—structure, function, and biosynthesis. Adv Drug Deliv Rev 2003;55:1531–46. https://doi.org/10.1016/j.addr.2003.08.002.Suche in Google Scholar PubMed
30. Hulmes, D. Collagen diversity, synthesis and assembly. In: Collagen: structure and mechanics. Springer; 2008:15–47 pp.10.1007/978-0-387-73906-9_2Suche in Google Scholar
31. Linsenmayer, T. Collagen. In: Cell Biology of Extracellular Matrix, 2nd ed Springer; 1991:7–44 pp.10.1007/978-1-4615-3770-0_2Suche in Google Scholar
32. Birk, DE, Bruckner, P. Collagen suprastructures. In: Collagen: primer in structure, processing and assembly; 2005:185–205 pp.10.1007/b103823Suche in Google Scholar
33. Fratzl, P. Collagen: structure and mechanics, an introduction. In: Collagen: Structure and Mechanics. Springer; 2008:1–13 pp.10.1007/978-0-387-73906-9_1Suche in Google Scholar
34. Goh, KL, Listrat, A, Béchet, D. Hierarchical mechanics of connective tissues: integrating insights from nano to macroscopic studies. J Biomed Nanotechnol 2014;10:2464–507. https://doi.org/10.1166/jbn.2014.1960.Suche in Google Scholar
35. Heino, J. The collagen family members as cell adhesion proteins. Bioessays 2007;29:1001–10. https://doi.org/10.1002/bies.20636.Suche in Google Scholar PubMed
36. Nimni, ME, Harkness, RD. Molecular structure and functions of collagen. In: Collagen, 1st ed. CRC Press; 2018:1–78 pp.10.1201/9781351070799-1Suche in Google Scholar
37. Kim, MS, Pinto, SM, Getnet, D, Nirujogi, RS, Manda, SS, Chaerkady, R, et al.. A draft map of the human proteome. Nature 2014;509:575–81. https://doi.org/10.1038/nature13302.Suche in Google Scholar PubMed PubMed Central
38. Uhlén, M, Fagerberg, L, Hallström, BM, Lindskog, C, Oksvold, P, Mardinoglu, A, et al.. Tissue-based map of the human proteome. Science 2015;347:1260419. https://doi.org/10.1126/science.1260419.Suche in Google Scholar PubMed
39. Oh, SP, Griffith, CM, Hay, ED, Olsen, BR. Tissue-specific expression of type XII collagen during mouse embryonic development. Dev Dynam 1993;196:37–46. https://doi.org/10.1002/aja.1001960105.Suche in Google Scholar PubMed
40. Arseni, L, Lombardi, A, Orioli, D. From structure to phenotype: impact of collagen alterations on human health. Int J Mol Sci 2018;19:1407. https://doi.org/10.3390/ijms19051407.Suche in Google Scholar PubMed PubMed Central
41. Jumper, J, Evans, R, Pritzel, A, Green, T, Figurnov, M, Ronneberger, O, et al.. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. https://doi.org/10.1038/s41586-021-03819-2.Suche in Google Scholar PubMed PubMed Central
42. Varadi, M, Anyango, S, Deshpande, M, Nair, S, Natassia, C, Yordanova, G, et al.. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 2022;50:D439–44. https://doi.org/10.1093/nar/gkab1061.Suche in Google Scholar PubMed PubMed Central
43. Brown, JC, Timpl, R. The collagen superfamily. Int Arch Allergy Immunol 1995;107:484–90. https://doi.org/10.1159/000237090.Suche in Google Scholar PubMed
44. Beck, K, Brodsky, B. Supercoiled protein motifs: the collagen triple-helix and the α-helical coiled coil. J Struct Biol 1998;122:17–29. https://doi.org/10.1006/jsbi.1998.3965.Suche in Google Scholar PubMed
45. Marro, J, Pfefferli, C, de Preux Charles, AS, Bise, T, Jaźwińska, A. Collagen XII contributes to epicardial and connective tissues in the zebrafish heart during ontogenesis and regeneration. PLoS One 2016;11:e0165497. https://doi.org/10.1371/journal.pone.0165497.Suche in Google Scholar PubMed PubMed Central
46. Mereness, JA, Mariani, TJ. The critical role of collagen VI in lung development and chronic lung disease. Matrix Biol Plus 2021;10:100058. https://doi.org/10.1016/j.mbplus.2021.100058.Suche in Google Scholar PubMed PubMed Central
47. Kuivaniemi, H, Tromp, G, Prockop, DJ. Mutations in collagen genes: causes of rare and some common diseases in humans. Faseb J 1991;5:2052–60. https://doi.org/10.1096/fasebj.5.7.2010058.Suche in Google Scholar PubMed
48. Nassa, M, Anand, P, Jain, A, Chhabra, A, Jaiswal, A, Malhotra, U, et al.. Analysis of human collagen sequences. Bioinformation 2012;8:26. https://doi.org/10.6026/97320630008026.Suche in Google Scholar PubMed PubMed Central
49. Mak, KM, Png, CYM, Lee, DJ. Type V collagen in health, disease, and fibrosis. Anat Rec 2016;299:613–29. https://doi.org/10.1002/ar.23330.Suche in Google Scholar PubMed
50. Imamura, Y, Scott, IC, Greenspan, DS. The Pro-α3 (V) Collagen Chain: complete primary structure, expression domains in adult and developing tissues, and comparison to the structures and expression domains of the other types V and XI procollagen chains. J Biol Chem 2000;275:8749–59. https://doi.org/10.1074/jbc.275.12.8749.Suche in Google Scholar PubMed
51. Steinmann, B, Royce, PM, Superti-Furga, A. The Ehlers-Danlos syndrome. In: Connective tissue and its heritable disorders: molecular, genetic, and medical aspects, 1st ed. Springer; 2002, 802:431–523 pp.10.1002/0471221929.ch9Suche in Google Scholar
52. Malfait, F, Wenstrup, RJ, De Paepe, A. Clinical and genetic aspects of Ehlers-Danlos syndrome, classic type. Genet Med 2010;12:597–605. https://doi.org/10.1097/gim.0b013e3181eed412.Suche in Google Scholar
53. Khoshnoodi, J, Pedchenko, V, Hudson, BG. Mammalian collagen IV. Microsc Res Tech 2008;71:357–70. https://doi.org/10.1002/jemt.20564.Suche in Google Scholar PubMed PubMed Central
54. Roig-Rosello, E, Rousselle, P. The human epidermal basement membrane: a shaped and cell instructive platform that aging slowly alters. Biomolecules 2020;10:1607. https://doi.org/10.3390/biom10121607.Suche in Google Scholar PubMed PubMed Central
55. Brittingham, R, Uitto, J, Fertala, A. High-affinity binding of the NC1 domain of collagen VII to laminin 5 and collagen IV. Biochem Biophys Res Commun 2006;343:692–9. https://doi.org/10.1016/j.bbrc.2006.03.034.Suche in Google Scholar PubMed
56. De Gregorio, V, Caparali, EB, Shojaei, A, Ricardo, S, Barua, M. Alport syndrome: clinical Spectrum and therapeutic advances. Kidney Med 2023;5:100631. https://doi.org/10.1016/j.xkme.2023.100631.Suche in Google Scholar PubMed PubMed Central
57. Imafuku, A, Nozu, K, Sawa, N, Hasegawa, E, Hiramatsu, R, Kawada, M, et al.. Autosomal dominant form of type IV collagen nephropathy exists among patients with hereditary nephritis difficult to diagnose clinicopathologically. Nephrology 2018;23:940–7. https://doi.org/10.1111/nep.13115.Suche in Google Scholar PubMed PubMed Central
58. Shulman, C, Liang, E, Kamura, M, Udwan, K, Yao, T, Cattran, D, et al.. Type IV collagen variants in CKD: performance of computational predictions for identifying pathogenic variants. Kidney Med 2021;3:257–66. https://doi.org/10.1016/j.xkme.2020.12.007.Suche in Google Scholar PubMed PubMed Central
59. Deltas, C. Pos-435 next generation sequencing identifies candidate genetic modifiers potentially exacerbating kidney disease in col4a3/a4 heterozygous patients. Kidney Int Rep 2022;7:S194. https://doi.org/10.1016/j.ekir.2022.01.462.Suche in Google Scholar
60. Cascella, R, Strafella, C, Caputo, V, Errichiello, V, Zampatti, S, Milano, F, et al.. Towards the application of precision medicine in Age-Related Macular Degeneration. Prog Retinal Eye Res 2018;63:132–46. https://doi.org/10.1016/j.preteyeres.2017.11.004.Suche in Google Scholar PubMed
61. Zhang, J, Patel, DV. The pathophysiology of Fuchs’ endothelial dystrophy–a review of molecular and cellular insights. Exp Eye Res 2015;130:97–105. https://doi.org/10.1016/j.exer.2014.10.023.Suche in Google Scholar PubMed
62. Cescon, M, Gattazzo, F, Chen, P, Bonaldo, P. Collagen VI at a glance. J Cell Sci 2015;128:3525–31. https://doi.org/10.1242/jcs.169748.Suche in Google Scholar PubMed
63. Tonelotto, V, Trapani, V, Bretaud, S, Heumüller, SE, Wagener, R, Ruggiero, F, et al.. Spatio-temporal expression and distribution of collagen VI during zebrafish development. Sci Rep 2019;9:19851. https://doi.org/10.1038/s41598-019-56445-4.Suche in Google Scholar PubMed PubMed Central
64. Bushby, KM, Collins, J, Hicks, D. Collagen type VI myopathies. In: Progress in Heritable Soft Connective Tissue Diseases, 1st ed. Springer; 2014, 802:185–99 pp.10.1007/978-94-007-7893-1_12Suche in Google Scholar PubMed
65. Murphy, AP, Straub, V. The classification, natural history and treatment of the limb girdle muscular dystrophies. J Neuromuscul Dis 2015;2:S7–19. https://doi.org/10.3233/jnd-150105.Suche in Google Scholar PubMed PubMed Central
66. Dowling, P, Gargan, S, Murphy, S, Zweyer, M, Sabir, H, Swandulla, D, et al.. The dystrophin node as integrator of cytoskeletal organization, lateral force transmission, fiber stability and cellular signaling in skeletal muscle. Proteomes 2021;9:9. https://doi.org/10.3390/proteomes9010009.Suche in Google Scholar PubMed PubMed Central
67. Robin, NH, Moran, RT, Ala-Kokko, L. Stickler syndrome. In: GeneReviews®, 1993. Seattle, WA: GeneReviews, University of Washington; 2021.Suche in Google Scholar
68. Trainor, PA, Krumlauf, R. Hox genes, neural crest cells and branchial arch patterning. Curr Opin Cell Biol 2001;13:698–705. https://doi.org/10.1016/s0955-0674(00)00273-8.Suche in Google Scholar PubMed
69. Liu, Q, Gibson, MP, Sun, H, Qin, C. Dentin sialophosphoprotein (DSPP) plays an essential role in the postnatal development and maintenance of mouse mandibular condylar cartilage. J Histochem Cytochem 2013;61:749–58. https://doi.org/10.1369/0022155413502056.Suche in Google Scholar PubMed PubMed Central
70. Marneros, AG, Olsen, BR. The role of collagen-derived proteolytic fragments in angiogenesis. Matrix Biol 2001;20:337–45. https://doi.org/10.1016/s0945-053x(01)00151-2.Suche in Google Scholar PubMed
71. Seppinen, L, Pihlajaniemi, T. The multiple functions of collagen XVIII in development and disease. Matrix Biol 2011;30:83–92. https://doi.org/10.1016/j.matbio.2010.11.001.Suche in Google Scholar PubMed
72. Stover, DA, Verrelli, BC. Comparative vertebrate evolutionary analyses of type I collagen: potential of COL1a1 gene structure and intron variation for common bone-related diseases. Mol Biol Evol 2011;28:533–42. https://doi.org/10.1093/molbev/msq221.Suche in Google Scholar PubMed
73. Iozzo, RV, Gubbiotti, MA. Extracellular matrix: the driving force of mammalian diseases. Matrix Biol 2018;71:1–9. https://doi.org/10.1016/j.matbio.2018.03.023.Suche in Google Scholar PubMed PubMed Central
74. Fichtner, M, Schuster, S, Stark, H. Determination of scoring functions for protein damage susceptibility. Biosystems 2020;187:104035. https://doi.org/10.1016/j.biosystems.2019.104035.Suche in Google Scholar PubMed
© 2024 the author(s), published by De Gruyter, Berlin/Boston
This work is licensed under the Creative Commons Attribution 4.0 International License.
Artikel in diesem Heft
- Frontmatter
- Special Section: 18th International Symposium on Integrative Bioinformatics, Zurich, Switzerland, 2024; Guest Editors: Can Türker and Christian Panse
- International symposium on integrative bioinformatics 2024 – editorial
- A roadmap for a middleware as a federation service for integrative data retrieval of agricultural data
- Layout of anatomical structures and blood vessels based on the foundational model of anatomy
- Constructing networks for comparison of collagen types
- Leonhard Med, a trusted research environment for processing sensitive research data
- Exploring animal behaviour multilayer networks in immersive environments – a conceptual framework
- Regular Contribution
- MCMVDRP: a multi-channel multi-view deep learning framework for cancer drug response prediction
Artikel in diesem Heft
- Frontmatter
- Special Section: 18th International Symposium on Integrative Bioinformatics, Zurich, Switzerland, 2024; Guest Editors: Can Türker and Christian Panse
- International symposium on integrative bioinformatics 2024 – editorial
- A roadmap for a middleware as a federation service for integrative data retrieval of agricultural data
- Layout of anatomical structures and blood vessels based on the foundational model of anatomy
- Constructing networks for comparison of collagen types
- Leonhard Med, a trusted research environment for processing sensitive research data
- Exploring animal behaviour multilayer networks in immersive environments – a conceptual framework
- Regular Contribution
- MCMVDRP: a multi-channel multi-view deep learning framework for cancer drug response prediction