Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature

 
CONTINUE READING
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
ARTICLE
                  https://doi.org/10.1038/s41467-020-17207-3                 OPEN

                  Structure-based machine-guided mapping of
                  amyloid sequence space reveals uncharted
                  sequence clusters with higher solubilities
                  Nikolaos Louros1,2, Gabriele Orlando1,2, Matthias De Vleeschouwer1,2, Frederic Rousseau                                    1,2 ✉   &
                  Joost Schymkowitz 1,2 ✉
1234567890():,;

                  The amyloid conformation can be adopted by a variety of sequences, but the precise
                  boundaries of amyloid sequence space are still unclear. The currently charted amyloid
                  sequence space is strongly biased towards hydrophobic, beta-sheet prone sequences that
                  form the core of globular proteins and by Q/N/Y rich yeast prions. Here, we took advantage
                  of the increasing amount of high-resolution structural information on amyloid cores currently
                  available in the protein databank to implement a machine learning approach, named Cordax
                  (https://cordax.switchlab.org), that explores amyloid sequence beyond its current bound-
                  aries. Clustering by t-Distributed Stochastic Neighbour Embedding (t-SNE) shows how our
                  approach resulted in an expansion away from hydrophobic amyloid sequences towards
                  clusters of lower aliphatic content and higher charge, or regions of helical and disordered
                  propensities. These clusters uncouple amyloid propensity from solubility representing
                  sequence flavours compatible with surface-exposed patches in globular proteins, functional
                  amyloids or sequences associated to liquid-liquid phase transitions.

                  1 Switch Laboratory, VIB Center for Brain and Disease Research, Herestraat 49, 3000 Leuven, Belgium. 2 Switch Laboratory, Department of Cellular and

                  Molecular Medicine, KU Leuven, Herestraat 49, 3000 Leuven, Belgium. ✉email: Frederic.Rousseau@kuleuven.vib.be; Joost.Schymkowitz@kuleuven.vib.be

                  NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications                           1
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
ARTICLE                                                                 NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3

T
        he amyloid cross-β state is a polypeptide conformation that          identified less hydrophobic polar and charged aggregation prone
        is adopted by 36 proteins or peptides associated to human            sequences that increasingly uncouple solubility and amyloid
        protein deposition pathologies1. It also constitutes the             propensity, closely resembling characteristics of phase-separation
structural core of a growing number of functional amyloids in                inducers. Clustering by t-distributed stochastic neighbour
both bacteria and eukaryotes2,3. Beyond these bona fide func-                 embedding reveals the heterogeneous substructure of amyloid
tional and pathological amyloids it has been demonstrated that               sequence space consisting in varying clusters corresponding to
many if not most proteins can adopt an amyloid-like con-                     sequences compatible with globular structure, functional scaf-
formation upon unfolding/misfolding4. This has led to the notion             folding amyloids, N/Q/Y-rich prions, helical peptides and
that just like the α-helix or β-sheet, the amyloid state is a generic        intrinsically disordered sequences. Together, the structural
polypeptide backbone conformation but also that amino acids                  exploration performed here demonstrates that the field now
have different propensities to adopt the amyloid conformation5.              gathered sufficient structural and sequence information to start
Initially, it was observed that amyloid-like aggregation correlates          classifying amyloids according to different structural and func-
with hydrophobicity, β-strand propensity, and (lack of) net                  tional niches. Just like for globular proteins in the 1980s, this will
charge6. This triggered the development of aggregation prediction            allow to fine-tune both general and context-dependent structural
algorithms that essentially evaluate the above biophysical                   rule learning allowing to manipulate and design amyloid struc-
propensities7,8. Others extended to scaling residue propensities             ture and function.
between protein folding and aggregation9,10. These algorithms
confirmed the ubiquity of amyloid-like propensity in natural
                                                                             Results
protein sequences and particularly in globular proteins as it was
                                                                             Overall approach of Cordax. We wanted to design a novel
estimated that 15–20% of residues in a typical globular domain
                                                                             structure-based amyloid core sequence prediction method that
are within aggregation-prone regions (APRs)11,12. These APRs
                                                                             (a) leverages all the available structure information that is cur-
are sequence segments of six to seven amino acids in length on
                                                                             rently available, and (b) employs a machine-learning element for
average and are mostly buried within the protein structure where
                                                                             optimal prediction performance. To this end, we first built a
they constitute the hydrophobic core stabilising tertiary protein
                                                                             curated template library of amyloid core structures as described
structure13–15. On the other hand, the increasing identification of
                                                                             in the paragraph below. In the vein of previous prediction
both yeast prions and functional amyloids clearly indicated that
                                                                             methods29, we fixed on the hexapeptide as a unit of prediction. In
amyloid sequence space is not monolithic and that more polar/
                                                                             order to determine the amyloid propensity of a query hexapeptide
less aliphatic sequences represent important alternative popula-
                                                                             we start by modelling its side chains on all the available amyloid
tions of amyloid sequence space3. The limited sensitivity of the
                                                                             template structures using the FoldX force field30, which yields a
above cited algorithms to specifically identify these other sub-
                                                                             model and an associated free energy estimate (ΔG, kcal/mol) for
populations confirmed the underestimated sequence versatility of
                                                                             each template. These free energies are then fed into a logistic
the amyloid conformation. Indeed, more recently the role of
                                                                             regression model, which is a simple statistical method relating a
amyloid-like sequences in proteins mediating liquid–liquid phase
                                                                             binary outcome to continuous variables. The prediction output of
transitions again demonstrates the ubiquity of the amyloid in
                                                                             Cordax is multiple: First, there is the prediction from the logistic
biological function and further withers the image of the amyloid
                                                                             regression whether or not the segment is an amyloid core
state as a predominantly disease and/or toxicity-associated pro-
                                                                             sequence. Second, for the sequences deemed amyloid core, the
tein conformation16–18. Rather, this suggests that like globular
                                                                             most likely amyloid core model is provided. For longer query
protein folding, amyloid assembly is a matter of kinetic and
                                                                             sequences, a sliding window approach is adopted. The technical
thermodynamic control that can be evolutionary tuned by
                                                                             details of the pipeline can be found in the “Methods” section.
sequence variation and selection.
   Efforts to develop aggregation predictors that can identify a
broader spectrum of amyloid sequences have increased over the                Refinement of fibril structures for machine learning. We iso-
years19. Such approaches focused on identifying position-specific             lated 78 short segment fibril core high-resolution structures from
patterns by reference to accumulated experimental data of                    the Protein Data Bank (Supplementary Data 1). Templates were
APRs20–22, or by using energy functions of cross-beta pairings23.            grouped into seven distinct topological classes out of eight the-
Recently developed meta-predictors produce consensus outputs                 oretically possible based on their overall structural properties, as
by combining previous methods, in an attempt to boost                        previously proposed by Sawaya et al.31. Briefly, topologies are
performance24,25. Indirect structure-based methods were initially            defined by whether β-sheets have parallel versus antiparallel
developed by considering secondary structure propensities26,27.              orientation, by the orientation of the strand faces that form the
Complementary studies extended this notion by suggesting that                steric zipper (face-to-face versus face-to-back), and finally the
disease-related amyloids form β-strand-loop-β-strand motifs28.               orientation of both sheets towards each other and whether that
However, the principle of using structural information to accu-              results in identical or different fibril edges. This complexity was
rately predict aggregation prone segments in protein sequences               addressed by generating an ensemble of amyloid cores per
stems from the detailed work of Eisenberg and co-workers. The                structure using crystal contact information derived from the
3D-profiling method utilised the crystal structure of the fibril-              solved structures. Every template comprises two facing β-sheets,
forming segment NNQQNY (PDB ID: 1YJO) derived from the                       each composed of five successive β-strands. Since parallel archi-
Sup35 prion protein, to thread and evaluate sequence fitting using            tectures can share more than one homotypic packing interface,
the Rosetta energy function29. In this work, we build on this                those structures were split into separate individual entries (Fig. 1).
principle to develop Cordax, an exhaustively trained regression              To ensure uniformity, we expanded the number of structural
model that leverages a substantial library of curated template               variants by breaking down longer segments into hexapeptide
structures combined with machine learning. Cordax not only                   constituents, thus yielding a library of 179 peptide fragment
detects APRs in proteins, but also predicts the structural topology,         structures (Fig. 1 and Supplementary Data 1).
orientation and overall architecture of the resulting putative fibril            The amyloid interaction interfaces were analysed in detail
core. To validate the accuracy of our predictions, we designed a             following energy refinement by the FoldX force field30. During
screen of 96 newly predicted APRs and experimentally deter-                  this step we identified and rejected 33 imperfect β-packing
mined their aggregation properties. Using this approach, we                  interfaces formed by β-strands that contribute less than three

2                NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3                                                                      ARTICLE

                                                                   Interface fragmentation

                                                                             1

                                90°                                                                                         90°

                                       Interface a                                                        Interface b

                                                                             2

                                                                 Hexapeptide fragmentation

                           Fragment a_1                Fragment a_2                   Fragment b_1                      Fragment b_2

Fig. 1 Processing steps of the peptide fragment library. a Crystal contact information was used to generate fibril cores from isolated PDB structures.
Structures containing multiple packing interfaces were split into individual templates (1), which were in turn split into hexapeptide core fragments (2).
Source data are provided as a Source Data file.

interacting residues, thus reducing the ensemble to 146 structures               characteristic (ROC) analysis33. The ROC curves generated
(Supplementary Data 1). Detailed analysis of the contributions of                highlight that Cordax performance exceeds over seven state-of-
various energy components showed that these excluded β-                          the-art methods, which we applied using optimised options
packing interfaces have inefficient shape complementarity and                     defined by the developers7,9,21–24,34. In detail, Cordax performs
low overall stability, stemming from a combination of weak                       well over random as depicted by the highest total area under the
electrostatic contributions, diminished van der Waals interactions               curve (AUC) value of 0.87 (Fig. 3a). Distribution analysis of the
and exposure of hydrophobic residues to the solvent (Fig. 2a).                   scoring values indicates that the method achieves optimal
   Previous work has highlighted that distinct topological layouts               separation, resulting in minimal scoring overlay between positive
can potentially introduce a stronger tolerance for the integration               and negative amyloid forming sequences (Fig. 3b). As previously
of protein sequence segments and as a result can generate several                reported, TANGO showed high specificity due to the over-
potential type-I errors (false positives)29. To address this issue, we           representation of unscored values, which is also evident for
implemented a two-step cross-threading exploration of putative                   WALTZ as well as MetAmyl, which incorporates the latter
structural promiscuous traps. In more detail, we extracted a non-                method in its meta-prediction. The cost of high specificity is also
redundant set of hexapeptide sequences from the structural                       reflected by the calculated F1 values, as PASTA and TANGO
library (73 sequences), which was subsequently cross-modelled in                 report low recall values. On the other hand, AGGRESCAN and
an all-against-all reiteration process. Using an empirical cut-off               GAP produce significant overpredictions as depicted by their
threshold (=5), a sum of three structural fragments was initially                reported false-positive rates (FPR values of 0.54 and 0.76,
identified and removed. Eliminating these structures led to the                   respectively) (Fig. 3c). The optimal score thresholding of our
identification and subsequent elimination of three additional                     method was determined from the ROC curve analysis as the score
promiscuous templates, resulting in the final Cordax library,                     where predictions show the highest sensitivity-to-specificity ratio.
composed of 140 zipper structures (Fig. 2b and c).                               According to this, Cordax achieves a well-balanced prediction by
                                                                                 reporting with high specificity (86%) more than 7 out of 10
Benchmarking aggregation propensity detection with Cordax.                       aggregation prone segments (72%), which is reflected by the
As an initial test of the prediction accuracy of the regression                  highest calculated MCC, AUC and F1 values compared to other
model, we performed leave-one-out cross-validation on the                        available software (Fig. 3c).
training dataset32 and performance metrics were determined on a                     To further benchmark the method, we tested it against full-
peptide basis. Due to the extensive size of the dataset, comparison              length protein sequences. For this we used a standardised set of
to other software was performed only with methods supporting                     34 annotated amyloidogenic proteins that was previously
multiple sequence input and a non-binary scoring function,                       implemented for validation of several previous aggregation
since performances were compared using receiver operating                        predictors25, following a filtering step for potential overlaps to

NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications                                3
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
ARTICLE                                                                                                                                                                                                                                      NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3

     a                                          ΔG             BackHbonds             SideHbonds              VdW                   Electrostatics            SolvPolar             SolvHphobic
                                                                                                                                                                                                                      b                                                                                    1YJO            3FR1                                 6CFH_3                                     3FOD               4XFN                          4W67_2

                                                                Cor : 0.221           Cor : 0.0339         Cor : 0.611               Cor : 0.559            Cor : –0.0299            Cor : 0.697

                                                              Accepted: 0.206       Accepted: –0.0205    Accepted: 0.524           Accepted: 0.526      Accepted: 0.158         Accepted: 0.654
      ΔG

                                                                                                                                                                                                                                                                                        146 Structures                                                                    143 Structures                                                                              140 Structures
                                                              Rejected: 0.211       Rejected: –0.306     Rejected: 0.791           Rejected: 0.365      Rejected: –0.538        Rejected: 0.806                                                                                             Starting                                                                             Filtered                                                                                   CORDAX
                                                                                                                                                                                                                                                                                             library                                                                              library                                                                                    library

                                                                                       Cor : 0.128         Cor : 0.213              Cor : –0.197             Cor : –0.287            Cor : 0.136
      BackHbonds

                                                                                     Accepted: 0.132     Accepted: 0.173          Accepted: –0.248      Accepted: –0.219         Accepted: 0.11                                                                                                                      1st Cross-thread step                                                                            2nd Cross-thread step

                                                                                    Rejected: 0.0738     Rejected: 0.215          Rejected: –0.342      Rejected: –0.516        Rejected: 0.101
                                                                                                                                                                                                                                     15                                                          1yjo                                                 15                                                                                   15

                                                                                                           Cor : 0.365              Cor : 0.0391             Cor : –0.549           Cor : –0.0231
                                                                                                                                                                                                                                     10                                                     6cfh_3                                                    10                                                                                   10
      SideHbonds

                                                                                                         Accepted: 0.385          Accepted: –0.0142     Accepted: –0.56        Accepted: –0.0886

                                                                                                                                                                                                                                         5                                                                                                             5                                                                                           5

                                                                                                                                                                                                                      Selection counts

                                                                                                                                                                                                                                                                                                                                   Selection counts

                                                                                                                                                                                                                                                                                                                                                                                                                                Selection counts
                                                                                                        Rejected: –0.145          Rejected: –0.0855     Rejected: 0.0135        Rejected: –0.223                                                                                                                                                                        4xfn
                                                                                                                                                                                                                                                                                                              3fr1                                                  4w67_2                             3fod
                                                                                                                                                                                                                                         0                                                                                                             0                                                                                           0
                                                                                                                                      Cor : 0.22             Cor : –0.677            Cor : 0.901
      VdW

                                                                                                                                   Accepted: 0.151      Accepted: –0.595        Accepted: 0.859

                                                                                                                                  Rejected: –0.0543     Rejected: –0.879        Rejected: 0.982

                                                                                                                                                            Cor : –0.0129            Cor : 0.217
      Electrostatics

                                                                                                                                                        Accepted: 0.0724        Accepted: 0.169                                                                                              PDB Code                                                                            PDB Code                                                                                   PDB Code

                                                                                                                                                            Rejected: 0.31     Rejected: –0.0201

                                                                                                                                                                                                                                                                                                         WALTZ-DB 2.0                                      1    DATABASE OF AMYLOID

                                                                                                                                                                                    Cor : –0.416                      c                                                                                                                                         FORMING PEPTIDES
                                                                                                                                                                                                                                                                                                                                                                                                                           Sequence
                                                                                                                                                                                                                                                                                                                                                                                                                             input
      SolvPolar

                                                                                                                                                                                Accepted: –0.263
                                                                                                                                                                                                                                                                                                           1402 Sequences
                                                                                                                                                                                Rejected: –0.804
                                                                                                                                                                                                                                                CORDAX
                                                                                                                                                                                                                                                 library
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Topology model
                                                                                                                                                                                                                                                                                                         196.280                                           Regression
      SolvHphobic

                                                                                                                                                                                                                                                                                                        ΔG Energies                                          model

                                                                                                                                                                                                                                                                                                        Training data
                                                                                                                                                                                                                                             140 Structures
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Amyloid scoring

Fig. 2 Optimising the Cordax structural library. a Correlation plot of interface energies calculated using FoldX. Top half shows correlation values with
scatter plots indicated at the bottom half. Rejected fragments sharing low shape complementarity (shown in yellow) have correlating weak van der Waals
interfaces, as well as poor solvation energies for hydrophobic side chains compared to the remaining library (indicated in purple). Linear regression lines
are shown with 95% confidence interval (shown in grey shaded areas). b Promiscuity sorting of the structural library performed as a two-step cross-
threading process. Circular histograms highlight three major promiscuous structures (n > 5) which were removed during the primary (PDB ID: 1YJO, 3FR1
and 6CFH_3) and secondary step (PDB ID: 3FOD, 4XFN and 4W67_2). c Schematic representation of Cordax training and the derived pipeline.

    a                                                                                                                                                                                                                                           b                                1.00
                                          1.0                                                                                                                                                1                                                                                                                                                                                                                    5

                                                                                                                                                                                                                                                                                                                                                                                            PASTA 2.0 energy
                                                                                                                                                                                                                                                                                 0.75
                                                                                                                                                                                                                                                  CORDAX score

                                                                                                                                                                                                                                                                                                                                                                                                                  0
                                                                                                                                                                                                                                                                                 0.50

                                                                                                                                                                                                                                                                                 0.25                                                                                                                            –5
                                          0.8                                                                                                                                                0.8

                                                                                                                                                                                                                                                                                 0.00
                                                                                                                                                                                                                                                                                              Amyloid               Non-Amyloid                                0.0 0.5 1.0 1.5 2.0 2.5                                       Amyloid                               Non-Amyloid         0.00 0.05 0.10 0.15 0.20
                                                                                                                                                                                                                                                                                                        Classification                                                 Density                                                                         Classification                             Density
             Average true positive rate

                                          0.6                                                                                                                                                0.6                                                                                 600
                                                                                                                                                                                                     CORDAX cutoffs
                                                                                                                              n
                                                                                                                           tio

                                                                                                                                                                                                                                                                                                                                                                                                                 400
                                                                                                                         ic
                                                                                                                    ed

                                                                                                                                                                                                                                                   WALTZ score
                                                                                                                  pr

                                                                                                                                                                                                                                                                                                                                                                                                   TANGO score

                                                                                                                                                                                                                                                                                 400                                                                                                                             300
                                                                                                               m
                                                                                                             do
                                                                                                          an

                                                                                                                                                                                                                                                                                                                                                                                                                 200
                                                                                                         R

                                                                                                                                                                                                                                                                                 200
                                          0.4                                                                                              AUC = 0.87                                        0.4
                                                                                                                                                                                                                                                                                                                                                                                                                 100

                                                                                                                                               CORDAX                                                                                                                              0                                                                                                                                  0
                                                                                                                                               WALTZ
                                                                                                                                               PASTA 2.0                                                                                                                                      Amyloid               Non-Amyloid                                0.00 0.25 0.50 0.75 1.00                                       Amyloid                              Non-Amyloid         0.00 0.25 0.50 0.75 1.00
                                                                                                                                               AGGRESCAN                                                                                                                                                Classification                                              Scaled density                                                                      Classification                      Scaled density
                                                                                                                                               MetAmyl
                                                                                                                                               TANGO
                                          0.2                                                                                                  GAP                                           0.2
                                                                                                                                               MILAMP
                                                                                                                                                                                                                                                                                 0.15
                                                                                                                                                                                                                                                 MetAmyl (total hot spot area)

                                                                                                                                                                                                                                                                                                                                                                                                                  100
                                                                                                                                                                                                                                                                                                                                                                                            AGGRESCAN (Na4vSS)

                                                                                                                                                                                                                                                                                                                                                                                                                   50
                                                                                                                                                                                                                                                                                 0.10
                                                                                                                                                                                                                                                                                                                                                                                                                      0

                                          0.0                                                                                                                                                0                                                                                                                                                                                                                    –50
                                                                                                                                                                                                                                                                                 0.05

                                                                                                                                                                                                                                                                                                                                                                                                                 –100
                                                     0.0                      0.2                  0.4                0.6                             0.8                      1.0
                                                                                                                                                                                                                                                                                 0.00
                                                                                                 Average false positive rate                                                                                                                                                                                                                                                                                     –150
                                                                                                                                                                                                                                                                                              Amyloid                Non-Amyloid                               0.00 0.25 0.50 0.75 1.00                                       Amyloid                              Non-Amyloid         0.000    0.005 0.010
                                                                                                                                                                                                                                                                                                         Classification                                              Scaled density                                                                    Classification                              Density

    c                                                CORDAX             WALTZ            PASTA 2.0      AGGRESCAN                 MetAmyl             TANGO                  GAP                   MILAMP
                                                                                                                                                                                                                                                                                 1.00                                                                                                                            1.00
     Accuracy                                          0.81               0.67               0.76            0.57                   0.79              0.73                   0.51                   0.73
     Precision                                         0.74               0.54               0.74            0.60                   0.72              0.82                   0.42                   0.66                                                                         0.75                                                                                                                            0.75
                                                                                                                                                                                                                                                                                                                                                                                             MILAMP score
                                                                                                                                                                                                                                                 GAP score

    TPR (Recall)                                       0.72               0.65               0.52            0.85                   0.69              0.31                   0.94                   0.49                                                                                                                                                                                                         0.50
                                                                                                                                                                                                                                                                                 0.50
                 FPR                                   0.14               0.33               0.10            0.54                   0.15              0.04                   0.74                   0.14
                                                                                                                                                                                                                                                                                 0.25                                                                                                                            0.25
              MCC                                      0.57               0.34               0.46            0.29                   0.54              0.38                   0.25                   0.38

                             F1                        0.73               0.60               0.61            0.71                   0.70              0.45                   0.58                   0.56                                                                         0.00                                                                                                                            0.00
                                                                                                                                                                                                                                                                                              Amyloid                Non-Amyloid                               0.00 0.25 0.50 0.75 1.00                                        Amyloid                              Non-Amyloid         0.0    0.5 1.0 1.5 2.0
                 AUC                                   0.87               0.73               0.84            0.84                   0.78              0.64                   0.70                   0.76                                                                                                 Classification                                             Scaled density                                                                      Classification                             Density

Fig. 3 Benchmarking of CORDAX. a ROC curve analysis for Cordax and seven other state-of-art methods against WALTZ-DB 2.0. For WALTZ, TANGO
and MetAmyl, FPR stops at earlier rates due to minimal scoring variations. b Cordax score distribution compared to other tools. The regression model
achieves better scoring separation for predictions between amyloid-forming (shown in blue) and non-amyloid sequences (shown in red). Distributions are
shown as standard box-plots highlighting median, upper (Q3) and lower (Q1) quartiles. Whiskers indicate the ranges for values within 1.5 interquartile
range (IQR) above and below Q3 and Q1, respectively. Density plots for WALTZ, TANGO, MetAmyl and GAP are scaled due to the overrepresentation of
unscored values or false positives, respectively. c Performance metrics comparison indicating Cordax superiority to other sequence predictors (MCC =
0.57, F1 = 0.73 and AUC = 0.87).

4                                                                 NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3                                                                    ARTICLE

the training data set. Despite its wide use, this collection suffers            picked up both by Cordax and traditional sequence-based
from insufficient experimental characterisation of certain large                 methods are usually found buried within the core of soluble
entries (i.e. gelsolin, kerato-epithelin, lactoferrin, amphoterin and           proteins. Contrary to what has been previously reported14,15,
others), which has been shown to introduce type-I errors (false                 however, our regression model also discovered additional nucle-
positives). This error propensity derives from non-amyloid                      ating sequences that primarily appear to reside on the surface of
annotations which primarily correspond to regions of undeter-                   protein molecules (Fig. 5b–h) and as a result, are characterised by
mined aggregation propensity, a notion that is highlighted by                   high solvent exposure (Fig. 5i and j). Partition coefficients clearly
recent studies, such as in the case of calcitonin35, cystatin-C36 and           indicate that these exposed peptide segments identified by Cordax
transthyretin37. In contrast, other proteins have been linked to                are primarily water-soluble sequences, whereas APRs that are
the formation of β-helical structures and as an after effect contain            predicted by the majority of sequence-based predictors are largely
elongated fragments characterised, yet unverified in their entirety,             insoluble (Fig. 5k). Sequence distribution analysis signifies that
as amyloidogenic, which can introduce type-II errors (false                     this increased exposure and solubility is complemented by an
negatives) when applying predictors of local aggregation                        expected decrease in sequence hydrophobicity (Fig. 5l). More
propensity38–41. The aforementioned shortcomings are reflected                   specifically, APRs identified solely by Cordax are relatively enri-
by the low MCC values that are reported for all aggregation                     ched in charged or polar side chains (Fig. 5l) and are frequently
predictors (Supplementary Table 1) and the fact that predicted                  parts of α-helical or unstructured segments (Fig. 5m). This
segments were originally considered neutral, but later shown to                 implies that these regions are in fact conformational switches that
be aggregation hotspots (Supplementary Fig. 1)35–41.                            may, under fitting misfolding conditions, transiently move
                                                                                towards the formation of β-aggregates. The fact that these
                                                                                sequences are not dictated by typical sequence propensities, such
Designed APR nucleators validate the accuracy of Cordax
                                                                                as hydrophobicity or β-structure tendency, explains why
predictions. In the interest of improving the current description
                                                                                sequence-based predictors overlook them.
of the familiar amyloidogenic protein dataset, we selected and
synthesised a subset of 96 peptides corresponding to strong
aggregation prone regions identified in these proteins by Cordax.
                                                                                Cordax infiltrates uncharted areas of amyloid sequence space.
Apart of prediction strength, the peptide screen was also selec-
                                                                                To further explore the capabilities of our method, we composed a
tively constructed to ensure broad sequence variability and a wide
                                                                                map of the known amyloid-forming sequence space using t-
distribution on the proteins of the dataset, with a preference for
                                                                                distributed stochastic neighbour embedding (t-SNE) for dimen-
longer entries defined by inadequate previous characterisation.
                                                                                sionality reduction (Fig. 6a). As input, we used a 20-dimensional
Peptide sequences were cross-checked and filtered to exclude
                                                                                parameterisation vector describing all newly identified amyloi-
overlapping sequences with previously identified amyloid regions
                                                                                dogenic peptides merged to the known amyloid-forming hex-
and WALTZ-DB (Supplementary Data 2). The remaining selec-
                                                                                apeptide sequences in WALTZ-DB, in terms of their basic
tion of 96 peptides were synthesised using standard solid phase
                                                                                physicochemical properties and amino acid composition, as well
synthesis and their amyloid-forming properties were initially
                                                                                as prediction outputs derived from Cordax and other high spe-
examined using Thioflavin-T (Th-T) or pFTAA binding, fol-
                                                                                cificity predictors. t-SNE mapping pinpointed clear areas of
lowing rotating incubation for 5 days at room temperature. The
                                                                                sequence space where Cordax correctly identifies amyloid pro-
binding assays are complementary, as Th-T and pFTAA are
                                                                                pensity (purple colour in Fig. 6a), which primarily extend towards
opposingly charged molecules, which increases the amyloid
                                                                                regions that remain unpredicted (shown in black) and seclude
identification rate by overcoming cases of dye-specific failure to
                                                                                from a large base of sequences identified by multiple methods,
bind to amyloid surfaces based on charge repulsion. Under these
                                                                                including Cordax (cyan colour). Clustering analysis (Fig. 6b)
conditions, 66 peptides successfully bind to the specific dyes
                                                                                performed using physicochemical properties (Figs. 6c–e), sec-
(Fig. 4a and b) by forming fibrils with typical amyloid
                                                                                ondary structure propensities (Fig. 6f) and side chain size dis-
morphologies and properties that were verified using transmis-
                                                                                tributions (Fig. 6g, h) identifies that this common base of by-now
sion electron microscopy (Fig. 4c) and Congo red staining for
                                                                                easy to predict APRs are characterised by high hydrophobicity,
selected cases (Fig. 4d). As these dyes are known to yield false
                                                                                strong β-sheet propensity and a high relative content of aliphatic
negatives, in particular for short peptides, all dye-negative pep-
                                                                                side chains (cluster 1 in Fig. 6b), still echoing the initial discovery
tides were further investigated using electron microscopy. During
                                                                                of APRs by these features6. Cordax explores regions adjacent to
this scan, we recovered 19 additional sequences that were capable
                                                                                this with a higher content of shorter side chains (clusters 2 and 5).
of forming sparse amyloid-like fibrils with shorter lengths (Sup-
                                                                                Notably, amyloid nucleators of this composition are an invaluable
plementary Fig. 2). Taking the latter into account, Cordax was
                                                                                resource for amyloid nanomaterial designs with elastin-like
able to fish out a total number of 85 novel nucleation segments
                                                                                properties, are enriched in functional amyloids and have also
with unparalleled accuracy (89%), thus providing a rigorously
                                                                                been linked to ancestral amyloid scaffolds in early life42–45. A
improved description of the protein set to be used for the efficient
                                                                                similar trend in amino acid composition has also been reported
testing and development of future predictors (Supplementary
                                                                                for proteins that form condensates through phase transition, such
Fig. 1).
                                                                                as TDP-43 and FUS16,18. Low complexity regions (LCRs) that are
                                                                                enriched in short side chains, such as Gly or Ala, have been
Cordax detects highly soluble surface-exposed conformational                    shown to drive phase separation, often as an intermediate event
switches. The expanded amyloidogenic annotation of the protein                  towards fibrillation, particularly in polar LCRs with lower ali-
dataset was supplemented with structural analysis of the newly                  phatic content and strong disorder or α-helical propensities, such
identified aggregation prone regions. Out of 96 peptides designed                as the sequences discovered in cluster 517,46. Further to this,
and experimentally tested, 85 peptides were found to display                    Cordax provides significant advancement by traversing in areas
evident amyloid-forming features, with more than half (55.3%)                   with a higher content of negatively or positively charged regions
being predicted specifically by Cordax, contrary to shared pre-                  (clusters 3, 4, 6 and 7, respectively). Charged residues often act as
dictions with sequence-based tools of high specificity (44.7%)                   gatekeepers that directly disrupt aggregation or modulate it by
(Supplementary Data 2). Pinpointing the location of the identified               flanking APRs within protein sequences47. Based on this premise,
nucleators in parental protein folds (Fig. 5a) revealed that APRs               most sequence-based predictors negatively correlate net charge to

NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications                             5
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
6
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Th-T fluorescence intensity (490 nm)

                                                                                                                                                                                                                                                                                                                                                                           e
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               c
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             a
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       pFTAA fluorescence intensity (510 nm)

                                                                                                                                                                                                                                                                                                                                                                                                                                                         5.0 × 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                     2 × 10
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                4 × 10

                                                                                                                                                                                                                                                                                                                                                                                                                              0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                     2.5 × 103
                                                                                                                                                                                                                                                                                                                                                                                                                                                             3
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   4
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          6 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           8 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         3 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 6 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   9 × 104

                                                                                                                                                                                                                                                                                                                                                                                         (SVDYEV) Acylphos2 (9 – 14)                                                                                p   > 0.9999              (SVDYEV) Acylphos2 (9 – 14)                                                      p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                      (IGVVGW) Acylphos2 (34 – 39)                                                                                  p   < 0.0001           (IGVVGW) Acylphos2 (34 – 39)                                                        p = 0.0322
                                                                                                                                                                                                                                                                                                                                                                                        (NFSIRY) Acylphos2 (94 – 99)                                                                                p   = 0.9995             (NFSIRY) Acylphos2 (94 – 99)                                                      p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                             (VNFSEF) Amphot (36 – 41)                                                                              p   = 0.9995                  (VNFSEF) Amphot (36 – 41)                                                    p = 0.8600

                                                                                                                                                                                                                                                                                                        IFTVQL
                                                                                                                                                                                                                                                                                                                          STVISY
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ARTICLE

                                                                                                                                                                                                                                                                                                                                                                  IQVYSR
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   > 0.9999                                                                                               p = 0.9804

                                                                                                                                                                                                                                                                                                                                             NDAFVL
                                                                                                                                                                                                                                                                                                                                                                                            (SEFSKK) Amphot (39 – 44)                                                                                                            (SEFSKK) Amphot (39 – 44)
                                                                                                                                                                                                                                                                                                                                                                                        (AFFLFC) Amphot (101 – 106)                                                                                 p   < 0.0001             (AFFLFC) Amphot (101 – 106)                                                       p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                             (LSSYWE) ApoC2 (44 – 49)                                                                               p   > 0.9999                  (LSSYWE) ApoC2 (44 – 49)                                                     p = 0.9995
                                                                                                                                                                                                                                                                                                                                                                                               (NLYEKT) ApoC2 (57 – 62)                                                                             p   > 0.9999                    (NLYEKT) ApoC2 (57 – 62)                                                   p = 0.9790
                                                                                                                                                                                                                                                                                                                                                                                                (VYVDVL) ApoAI (41 – 46)                                                                            p   < 0.0001                     (VYVDVL) ApoAI (41 – 46)                                                  p = 0.9988
                                                                                                                                                                                                                                                                                                                                                                                                 (LNLKLL) ApoAI (66 – 71)                                                                           p   > 0.9999                      (LNLKLL) ApoAI (66 – 71)                                                 p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   > 0.9999                                                                                               p = 0.9792

                                                                                                                                                                                                                                                                                               500 nm
                                                                                                                                                                                                                                                                                                                                                                                               (LLDNWD) ApoAI (70 – 75)                                                                                                             (LLDNWD) ApoAI (70 – 75)

                                                                                                                                                                                                                                                                                                                 200 nm
                                                                                                                                                                                                                                                                                                                                    500 nm
                                                                                                                                                                                                                                                                                                                                                        500 nm
                                                                                                                                                                                                                                                                                                                                                                                                     (MDVFMK) -Syn (1 – 6)                                                                          p   > 0.9999                          (MDVFMK) -Syn (1 – 6)                                                p = 0.0433
                                                                                                                                                                                                                                                                                                                                                                                              (IQVYSR) 2-micro (27 – 32)                                                                            p   < 0.0001                   (IQVYSR) 2-micro (27 – 32)                                                  p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                               (LCSTFC) Casein (50 – 55)                                                                            p   < 0.0001                    (LCSTFC) Casein (50 – 55)                                                  p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                          (LNFLKK) Casein (176 – 181)                                                                               p   > 0.9999               (LNFLKK) Casein (176 – 181)                                                     p = 0.9998
                                                                                                                                                                                                                                                                                                                                                                                        (KTVYQH) Casein (196 – 201)                                                                                 p   > 0.9999             (KTVYQH) Casein (196 – 201)                                                       p = 0.7809

                                                                                                                                                                                                                                                                                                                          VHIVKL

                                                                                                                                                                                                                                                                                                        IIYNWQ
                                                                                                                                                                                                                                                                                                                                                                  NFSIRY

                                                                                                                                                                                                                                                                                                                                             VYVDVL
                                                                                                                                                                                                                                                                                                                                                                                           (TKVIPY) Casein (213 – 218)                                                                              p   > 0.9999                (TKVIPY) Casein (213 – 218)                                                    p = 0.8792
                                                                                                                                                                                                                                                                                                                                                                                                   (GIVAGA) Het-S (6 – 11)                                                                          p   > 0.9999                        (GIVAGA) Het-S (6 – 11)                                                p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                (DCFEYV) Het-S (23 – 28)                                                                            p   < 0.0001                     (DCFEYV) Het-S (23 – 28)                                                  p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                 (FEYVQL) Het-S (25 – 30)                                                                           p   < 0.0001                      (FEYVQL) Het-S (25 – 30)                                                 p = 0.0443
                                                                                                                                                                                                                                                                                                                                                                                                   (ILLLFE) Het-S (87 – 92)                                                                         p   < 0.0001                        (ILLLFE) Het-S (87 – 92)                                               p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                            (DLVVFE) Het-S (109 – 114)                                                                              p   < 0.0001                 (DLVVFE) Het-S (109 – 114)                                                    p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   < 0.0001                    (AEIEIE) Het-S (184 – 189)                                                 p < 0.0001

                                                                                                                                                                                                                                                                                               500 nm
                                                                                                                                                                                                                                                                                                                 200 nm
                                                                                                                                                                                                                                                                                                                                    200 nm
                                                                                                                                                                                                                                                                                                                                                                                               (AEIEIE) Het-S (184 – 189)

                                                                                                                                                                                                                                                                                                                                                        500 nm
                                                                                                                                                                                                                                                                                                                                                                                               (ASLTIL) Het-S (195 – 200)                                                                           p   < 0.0001                    (ASLTIL) Het-S (195 – 200)                                                 p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                           (MVVEHP) Gelsolin (52 – 57)                                                                              p   > 0.9999                (MVVEHP) Gelsolin (52 – 57)                                                    p = 0.6738
                                                                                                                                                                                                                                                                                                                                                                                              (AYVILK) Gelsolin (94 – 99)                                                                           p   > 0.9999                   (AYVILK) Gelsolin (94 – 99)                                                 p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                    (GWFLGW) Gelsolin (758 – 763)                                                                                   p   = 0.9781         (GWFLGW) Gelsolin (758 – 763)                                                         p = 0.9995
                                                                                                                                                                                                                                                                                                                                                                                     (YDLHYW) Gelsolin (110 – 115)                                                                                  p   < 0.0001          (YDLHYW) Gelsolin (110 – 115)                                                        p = 0.0396

                                                                                                                                                                                                                                                                                                                          YIILYN
                                                                                                                                                                                                                                                                                                                                             ASLTIL
                                                                                                                                                                                                                                                                                                                                                                  FVIEEV
                                                                                                                                                                                                                                                                                                                                                                                         (IFTVQL) Gelsolin (130 – 135)                                                                              p   < 0.0001              (IFTVQL) Gelsolin (130 – 135)                                                    p = 0.0042

                                                                                                                                                                                                                                                                                                        QVFVWV
                                                                                                                                                                                                                                                                                                                                                                                       (FLGYFK) Gelsolin (157 – 162)                                                                                p   > 0.9999            (FLGYFK) Gelsolin (157 – 162)                                                      p = 0.9991
                                                                                                                                                                                                                                                                                                                                                                                      (MSVSLV) Gelsolin (310 – 315)                                                                                 p   < 0.0001           (MSVSLV) Gelsolin (310 – 315)                                                       p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                        (EDCFIL) Gelsolin (329 – 334)                                                                               p   < 0.0001             (EDCFIL) Gelsolin (329 – 334)                                                     p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                       (KIFVWK) Gelsolin (341 – 346)                                                                                p   > 0.9999            (KIFVWK) Gelsolin (341 – 346)                                                      p = 0.0235
                                                                                                                                                                                                                                                                                                                                                                                        (DFITKM) Gelsolin (364 – 369)                                                                               p   > 0.9999             (DFITKM) Gelsolin (364 – 369)                                                     p = 0.9998
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   < 0.0001                                                                                               p < 0.0001

                                                                                                                                                                                                                                                                                               200 nm
                                                                                                                                                                                                                                                                                                                                                        200 nm

                                                                                                                                                                                                                                                                                                                                    500 nm
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               (YIILYN) Gelsolin (474 – 479)

                                                                                                                                                                                                                                                                                                                 500 nm
                                                                                                                                                                                                                                                                                                                                                                                          (YIILYN) Gelsolin (474 – 479)
                                                                                                                                                                                                                                                                                                                                                                                        (IIYNWQ) Gelsolin (489 – 494)                                                                               p   < 0.0001             (IIYNWQ) Gelsolin (489 – 494)                                                     p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                         (MIIYKG) Gelsolin (544 – 549)                                                                              p   = 0.9997              (MIIYKG) Gelsolin (544 – 549)                                                    p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                      (NDAFVL) Gelsolin (591 – 596)                                                                                 p   < 0.0001           (NDAFVL) Gelsolin (591 – 596)                                                       p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                     (AAYLWV) Gelsolin (601 – 606)                                                                                  p   < 0.0001          (AAYLWV) Gelsolin (601 – 606)                                                        p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                        (FVIEEV) Gelsolin (679 – 684)                                                                               p   < 0.0001             (FVIEEV) Gelsolin (679 – 684)                                                     p < 0.0001

                                                                                                                                                                                                                                                                                                        ILLLFE
                                                                                                                                                                                                                                                                                                                                                                  AEIEIE
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   = 0.1841                                                                                               p < 0.0001

                                                                                                                                                                                                                                                                                                                                             VNFSEF

                                                                                                                                                                                                                                                                                                                          YQLVQL
                                                                                                                                                                                                                                                                                                                                                                                     (QVFVWV) Gelsolin (706 – 711)                                                                                                        (QVFVWV) Gelsolin (706 – 711)
                                                                                                                                                                                                                                                                                                                                                                                                  (GYLTIR) RepA (45 – 50)                                                                           p   > 0.9999                       (GYLTIR) RepA (45 – 50)                                                 p = 0.0002
                                                                                                                                                                                                                                                                                                                                                                                                 (KLFNRD) RepA (75 – 80)                                                                            p   = 0.9998                      (KLFNRD) RepA (75 – 80)                                                  p = 0.0003
                                                                                                                                                                                                                                                                                                                                                                                              (FHVKYR) RepA (97 – 102)                                                                              p   > 0.9999                   (FHVKYR) RepA (97 – 102)                                                    p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                           (MLHKEF) RepA (122 – 127)                                                                                p   = 0.9995                (MLHKEF) RepA (122 – 127)                                                      p = 0.9037
                                                                                                                                                                                                                                                                                                                                                                                            (FYAVRL) RepA (141 – 146)                                                                               p   > 0.9999                 (FYAVRL) RepA (141 – 146)                                                     p = 0.9996

                                                                                                                                                                                                                                                                                                                 500 nm
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p   > 0.9999                                                                                               p = 0.0261

                                                                                                                                                                                                                                                                                                                                                        2000 nm

                                                                                                                                                                                                                                                                                                                                    200 nm
                                                                                                                                                                                                                                                                                                                                                                                              (SQFIKL) RepA (151 – 156)                                                                                                            (SQFIKL) RepA (151 – 156)

                                                                                                                                                                                                                                                                                               200 nm
                                                                                                                                                                                                                                                                                                                                                                                              (FSFTIA) RepA (216 – 221)                                                                             p   < 0.0001                   (FSFTIA) RepA (216 – 221)                                                   p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                              Th-T Control                                                                                                                       pFTAA Control

                                                                                                                                                                                                                                                                                                                                                                                                                                                 Th-T fluorescence intensity (490 nm)

                                                                                                                                                                                                                                                                                                                                                                           f
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       pFTAA fluorescence intensity (510 nm)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               d
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             b

                                                                                                                                                                                                                                                                                                                                                                                                                              0.0
                                                                                                                                                                                                                                                                                                                                                                                                                                    2.5 × 103
                                                                                                                                                                                                                                                                                                                                                                                                                                                        5.0 × 103
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               4 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         6 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          8 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          3 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  6 × 104
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   9 × 104

                                                                                                                                                                                                                                                                                                                                                                                           (YQLVLQ) Keratoep (30 – 35)                                                                              p < 0.0001                  (YQLVLQ) Keratoep (30 – 35)                                                    p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                             (STVISY) Keratoep (77 – 82)                                                                            p < 0.0001                    (STVISY) Keratoep (77 – 82)                                                  p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                        (VNIELL) Keratoep (163 – 168)                                                                               p < 0.0001               (VNIELL) Keratoep (163 – 168)                                                     p   < 0.0001

                                                                                                                           polarised light (CP). Scale bars: 500 μm. Source data are provided as a Source Data file.
                                                                                                                                                                                                                                                                                                                                                                                        (IQIHHY) Keratoep (202 – 207)                                                                               p > 0.9999               (IQIHHY) Keratoep (202 – 207)                                                     p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                           (QIIEIE) Keratoep (245 – 250)                                                                            p < 0.0001                  (QIIEIE) Keratoep (245 – 250)                                                  p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                         (GVIHYI) Keratoep (362 – 367)                                                                              p > 0.9999                                                                                                 p   = 0.0003

                                                                                                                                                                                                                                                                                                                                                                           BF
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              (GVIHYI) Keratoep (362 – 367)
                                                                                                                                                                                                                                                                                                                                                                                       (LNSVFK) Keratoep (416 – 421)                                                                                p > 0.9999              (LNSVFK) Keratoep (416 – 421)                                                      p   = 0.4282
                                                                                                                                                                                                                                                                                                                                                                                      (LYHGQT) Keratoep (449 – 454)                                                                                 p > 0.9999             (LYHGQT) Keratoep (449 – 454)                                                       p   = 0.0006
                                                                                                                                                                                                                                                                                                                                                                                       (TLFTMD) Keratoep (490 – 495)                                                                                p = 0.9997              (TLFTMD) Keratoep (490 – 495)                                                      p   = 0.0925
                                                                                                                                                                                                                                                                                                                                                                                       (VYTVFA) Keratoep (536 – 541)                                                                                p < 0.0001              (VYTVFA) Keratoep (536 – 541)                                                      p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                    (HYYAVA) Lactoferrin (110 – 115)                                                                                p > 0.9999           (HYYAVA) Lactoferrin (110 – 115)                                                      p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                     (GDVAFI) Lactoferrin (223 – 228)                                                                               p < 0.0001            (GDVAFI) Lactoferrin (223 – 228)                                                     p   < 0.0001

                                                                                                                                                                                                                                                                                                                               YIILYN

                                                                                                                                                                                                                                                                                                        AEIEIE
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    p > 0.9999                                                                                                 p   = 0.0319

                                                                                                                                                                                                                                                                                                                                                      NFSIRY
                                                                                                                                                                                                                                                                                                                                                                                     (LLFKDS) Lactoferrin (317 – 322)                                                                                                     (LLFKDS) Lactoferrin (317 – 322)

                                                                                                                                                                                                                                                                                      IQVYSR
                                                                                                                                                                                                                                                                                                                                                                                      (YFTAIQ) Lactoferrin (343 – 348)                                                                              p < 0.0001             (YFTAIQ) Lactoferrin (343 – 348)                                                    p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                   (GWNIPM) Lactoferrin (487 – 492)                                                                                 p = 0.9997          (GWNIPM) Lactoferrin (487 – 492)                                                       p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                   (MDKVER) Lactoferrin (622 – 627)                                                                                 p = 0.9998          (MDKVER) Lactoferrin (622 – 627)                                                       p = 0.9650

                                                                                                                                                                                                                                                                                                                                                                           CP
                                                                                                                                                                                                                                                                                                                                                                                     (YVAGIT) Lactoferrin (686 – 691)                                                                               p < 0.0001            (YVAGIT) Lactoferrin (686 – 691)                                                     p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                    (CEFLRK) Lactoferrin (705 – 710)                                                                                p = 0.9999           (CEFLRK) Lactoferrin (705 – 710)                                                      p = 0.0014
                                                                                                                                                                                                                                                                                                                                                                                            (LQVDLG) Medin (289 – 294)                                                                              p < 0.0001                   (LQVDLG) Medin (289 – 294)                                                    p = 0.0453
                                                                                                                                                                                                                                                                                                                                                                                           (MFSEFD) Prolactin (64 – 69)                                                                             p < 0.0001                  (MFSEFD) Prolactin (64 – 69)                                                   p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                           (LIVSIL) Prolactin (111 – 116)                                                                           p = 0.9813                  (LIVSIL) Prolactin (111 – 116)                                                 p < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                        (LYHLVT) Prolactin (123 – 128)                                                                              p > 0.9999               (LYHLVT) Prolactin (123 – 128)                                                    p = 0.9988
                                                                                                                                                                                                                                                                                                                                                                                        (MELIVS) Prolactin (158 – 163)                                                                              p < 0.0001               (MELIVS) Prolactin (158 – 163)                                                    p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                       (LHCLRR) Prolactin (200 – 205)                                                                               p = 0.9998              (LHCLRR) Prolactin (200 – 205)                                                     p   = 0.8737
                                                                                                                                                                                                                                                                                                                                                                                         (FDKFKH) Myoglobin (44 – 49)                                                                               p > 0.9999                (FDKFKH) Myoglobin (44 – 49)                                                     p   = 0.9921
                                                                                                                                                                                                                                                                                                                                                                                                 (QVYYRP) PrP (160 – 165)                                                                           p > 0.9999                                                                                                 p   = 0.0130
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      (QVYYRP) PrP (160 – 165)
                                                                                                                                                                                                                                                                                                                                                                                                   (YFHARG) SAA (53 – 58)                                                                           p > 0.9999                          (YFHARG) SAA (53 – 58)                                                 p   = 0.0001
                                                                                                                                                                                                                                                                                                                                                                                             (LIFMGH) Sup35 (263 – 268)                                                                             p = 0.9990                    (LIFMGH) Sup35 (263 – 268)                                                   p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                           BF

                                                                                                                                                                                                                                                                                                                                                                                          (MYVSEM) Sup35 (350 – 355)                                                                                p < 0.0001                 (MYVSEM) Sup35 (350 – 355)                                                      p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                          (VVVNKM) Sup35 (403 – 408)                                                                                p > 0.9999                 (VVVNKM) Sup35 (403 – 408)                                                      p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                            (SNFLRA) Sup35 (428 – 433)                                                                              p > 0.9999                                                                                                 p   = 0.0420
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 (SNFLRA) Sup35 (428 – 433)
                                                                                                                                                                                                                                                                                                                                                                                               (IGYNIK) Sup35 (434 – 439)                                                                           p > 0.999                                                                                                  p   = 0.1168
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    (IGYNIK) Sup35 (434 – 439)
                                                                                                                                                                                                                                                                                                                                                                                           (TDVVFM) Sup35 (440 – 445)                                                                               p < 0.0001                  (TDVVFM) Sup35 (440 – 445)                                                     p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                         (VDMAMC) Sup35 (542 – 547)                                                                                 p < 0.0001                (VDMAMC) Sup35 (542 – 547)                                                       p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                              (VHIVKL) Sup35 (612 – 617)                                                                            p > 0.9999                     (VHIVKL) Sup35 (612 – 617)                                                  p   < 0.0001

                                                                                                                                                                                                                                                                                      ILLLFE
                                                                                                                                                                                                                                                                                                                               ASLTIL
                                                                                                                                                                                                                                                                                                                                                      FVIEEV
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               p   = 0.0445

                                                                                                                                                                                                                                                                                                        NDAFVL
                                                                                                                                                                                                                                                                                                                                                                                                   (INFEFS) Ure2p (35 – 40)                                                                         p < 0.0001                          (INFEFS) Ure2p (35 – 40)
                                                                                                                                                                                                                                                                                                                                                                                               (FKVAIV) Ure2p (126 – 131)                                                                           p > 0.9999                      (FKVAIV) Ure2p (126 – 131)                                                 p   = 0.0004
                                                                                                                                                                                                                                                                                                                                                                                             (GYTLFS) Ure2p (113 – 118)                                                                             p > 0.9999                    (GYTLFS) Ure2p (113 – 118)                                                   p   = 0.0003
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                           CP

                                                                                                                                                                                                                                                                                                                                                                                                (TIFLDF) Ure2p (141 – 146)                                                                          p < 0.0001                       (TIFLDF) Ure2p (141 – 146)
                                                                                                                                                                                                                                                                                                                                                                                              (LSIWES) Ure2p (176 – 181)                                                                            p < 0.0001                                                                                                 p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   (LSIWES) Ure2p (176 – 181)

NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
                                                                                                                                                                                                                                                                                                                                                                                            (HLVNKY) Ure2p (187 – 192)                                                                              p = 0.9999                   (HLVNKY) Ure2p (187 – 192)                                                    p   = 0.9910
                                                                                                                                                                                                                                                                                                                                                                                              (INAWLF) Ure2p (211 – 216)                                                                            p < 0.0001                     (INAWLF) Ure2p (211 – 216)                                                  p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                            (LVMELD) Ure2p (270 – 275)                                                                              p < 0.0001                   (LVMELD) Ure2p (270 – 275)                                                    p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                                 (IGINIK) Ure2p (323 – 328)                                                                         p > 0.9999                                                                                                 p   = 0.9999
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      (IGINIK) Ure2p (323 – 328)
                                                                                                                                                                                                                                                                                                                                                                                           (EVYKWT) Ure2p (333 – 338)                                                                               p > 0.9999                  (EVYKWT) Ure2p (333 – 338)                                                     p   < 0.0001
                                                                                                                                                                                                                                                                                                                                                                                (SYVLQT) Semenogelin (193 – 198)                                                                                    p = 0.9329       (SYVLQT) Semenogelin (193 – 198)                                                          p   = 0.0036
                                                                                                                                                                                                                                                                                                                                                                                 (LLVYNK) Semenogelin (260 – 265)                                                                                   p > 0.9999                                                                                                 p   = 0.0941
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      (LLVYNK) Semenogelin (260 – 265)
                                                                                                                                                                                                                                                                                                                                                                                (LHYGEN) Semenogelin (298 – 303)                                                                                    p > 0.9999                                                                                                 p   = 0.2828
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     (LHYGEN) Semenogelin (298 – 303)
                                                                                                                                                                                                                                                                                                                                                                                                               Th-T Control                                                                                                                       pFTAA Control

                                                                                                                           amyloid fibrils bind Congo red as displayed under bright field illumination (BF) and exhibit typical for amyloids apple-green birefringence under crossed-
                                                                                                                           multiple comparison against the vehicle control). e Electron micrographs of amyloid fibrils formed by Th-T or pFTAA-binding peptides. f Suspensions of
                                                                                                                           Fig. 4 Amyloid-forming properties of the peptide screen designed by employing Cordax. a and b Measured pFTAA and c and d Th-T fluorescence of
                                                                                                                           synthetic peptides following rotation at 200 μM for 5 days. Data represents mean ± SD (n = 6 independent experiments, statistics: one-way ANOVA with
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3                                                                                                                                                                                                                                                          ARTICLE

    a                                                                                                                        b                                                                                                                                c                                                                      RepA

                                                                                                                                                                                180°                                                                                                                         180°

                                                                                                                         Ure2p                                            PDB: 1HQO                                                                                                PDB: 1HKQ

                                                                                                                                                                                                                                                                                          Sup35
                                                                                                                             d                                                                                                                  e
                                                                                                                                                        Acylphosphatase-2

                                                                                    Ure2p

                                                                                                                                                                         180°
                                                                                                                                                                                                                                                                                           180°

                                                                                                                                                              PDB: 1APS                                                            PDB: 4CRN

                                                   Prolactin
                                                                                                                           Lactoferrin
    f                                                                 g                                                                                                                                                    h                                                                                               Keratoepithelin

                                                         180°

                                                                                                                                     180°
                                                                                                                                                                                                                                                                                   180°

                                                  PDB: 1RW5                                                               PDB: 1CB6                                                                                                                                        PDB: 5NV6

    i                       1.2
                                            p < 0.0001
                                                                j                   1.2
                                                                                                      p < 0.0001
                                                                                                                              k               5             p < 0.0001
                                                                                                                                                                                       l             40                                                                                   m 100
                                                                                                                                                                                                                                                                      CORDAX                                                          CORDAX
                            1.0                                                     1.0                                                                                         vi                                                                                    Joined                                                          Joined
                                                                                                                                                                                                     30                                                                                                     75
                            0.8                                                                                                                                                  i
        Side chain burial

                                                                Main chain burial

                                                                                    0.8                                                       0                                  n

                                                                                                                                                                                                                                                                                           Residue counts
                                                                                                                                                                                       Residue (%)
                                                                                                                              logD values

                                                                                                                                                                                 s
                            0.6                                                     0.6                                                                                                              20                                                                                                     50

                                                                                                                                                                                vs
                            0.4                                                     0.4                                                     –5
                                                                                                                                                                                                     10                                                                                                     25
                            0.2                                                     0.2

                            0.0                                                     0.0                                                     –10                                                       0                                                                                                      0
                                                                                                                                                                                                                                                ys
                                                          AX
                                       ed

                                                                                                                                                                                                                                                                                ge
                                                                                                                                                                                                                                                                     ge
                                                                                                                                                                                                                          ic
                                                                                                                                                                          x

                                                                                                                                                                                                               ic

                                                                                                                                                                                                                                     tic

                                                                                                                                                                                                                                                          r
                                                                                                                                                       ed
                                                                                                 ed

                                                                                                                    AX

                                                                                                                                                                                                                                                      la
                                                                                                                                                                         da

                                                                                                                                                                                                                                                                                                                                            l
                                                                                                                                                                                                                                                                                                                      ix

                                                                                                                                                                                                                                                                                                                                nd
                                                                                                                                                                                                                         at
                                                                                                                                                                                                           ob

                                                                                                                                                                                                                                                                                                                                         oi
                                                                                                                                                                                                                                            C
                                                                                                                                                                                                                                   ha
                                   in

                                                                                                                                                                                                                                                                              ar
                                                                                                                                                                                                                                                                 ar
                                                                                                                                                                                                                                                     Po
                                                         D

                                                                                                                                                                                                                                                                                                                  el
                                                                                                                                                   in
                                                                                             in

                                                                                                                                                                                                                     om

                                                                                                                                                                                                                                                                                                                                        C
                                                                                                                   D

                                                                                                                                                                     or

                                                                                                                                                                                                                                            o/

                                                                                                                                                                                                                                                                                                                             ra
                                                                                                                                                                                                          ph
                                  Jo

                                                                                                                                                                                                                                                                           ch
                                                                                                                                                                                                                                                                ch
                                                      R

                                                                                                                                                                                                                                                                                                                  H
                                                                                                                                                  Jo

                                                                                                                                                                                                                               lip
                                                                                            Jo

                                                                                                                R

                                                                                                                                                                                                                                           Pr
                                                                                                                                                                   C

                                                                                                                                                                                                                                                                                                                            St
                                                    O

                                                                                                                                                                                                                                                                                                                 α-
                                                                                                                                                                                                                    Ar
                                                                                                                                                                                                      ro
                                                                                                             O

                                                                                                                                                                                                                              ta

                                                                                                                                                                                                                                                                          e
                                                                                                                                                                                                                                                           e
                                                  C

                                                                                                                                                                                                                                                                                                                           β-
                                                                                                            C

                                                                                                                                                                                                     yd

                                                                                                                                                                                                                                                                      iv
                                                                                                                                                                                                                                                          tiv
                                                                                                                                                                                                                          or

                                                                                                                                                                                                                                                                     at
                                                                                                                                                                                                                                                      si
                                                                                                                                                                                                 H

                                                                                                                                                                                                                         Sh

                                                                                                                                                                                                                                                                 eg
                                                                                                                                                                                                                                                     Po

                                                                                                                                                                                                                                                                N

Fig. 5 Cordax identifies surface-exposed aggregation nucleators spanning residues that are typically considered unconventional for amyloid fibril
formation. a Schematic representation of Cordax-predicted topological models for APRs charted against the cognate native crystal structure of the
amyloidogenic protein Ure2p. b–h Surface representation of folded structures for b Ure2p, c RepA, d Acylphosphatase-2, e Sup35, f Prolactin, g Lactoferrin
and h Kerato-epithelin reveals that aggregation nucleators uniquely identified by Cordax (highlighted in red) are primarily exposed to the surface of
proteins, compared to segments of joint prediction (shown in blue) which are predominantly buried within the hydrophobic core of the native fold. Cordax-
specific predicted APRs produced lower volumetric burial values, calculated using FoldX, for i side chain and j main chain groups indicating that they are
considerably exposed compared to jointly identified nucleators. k Partition coefficients indicate that Cordax-specific APRs are significantly more soluble
compared to typically predicted sequences that are primarily hydrophobic and therefore insoluble. Solubility regions (vi very insoluble, i insoluble, n neutral,
s soluble, vs very soluble) are shown as coloured backgrounds72. Significant differences were computed using one-way ANOVA with multiple comparison.
l Surface-exposed Cordax-specific APRs are composed of residues with a 20% increase in polar and charged side chains, in expense of hydrophobic
residues (n = 219 residues in APRs identified by Cordax, n = 203 residues in APRs from joined predictions). m Secondary structure analysis, using FoldX,
indicates that Cordax identifies several APRs that reside in α-helical or unstructured regions within the native fold, suggesting that amyloidogenic proteins
may harbour a plethora of exposed conformation switches that can act as potential nucleators of amyloid fibril formation, under suitable misfolding
conditions (n = 219 residues in APRs identified by Cordax, n = 203 residues in APRs from joined predictions). Violin plots represent the kernel probability
densities of the data with the median, upper and bottom quartiles. Source data are provided as a Source Data file.

protein aggregation and have increased failure rates when iden-                                                                                                           scope to improve the method, as in particular, the region with
tifying such amyloid-forming stretches. On the other hand,                                                                                                                high disorder propensity (cluster 11) still contains many false
sequences with a high content of aromatic side chains are rela-                                                                                                           negatives, in spite of the ability of Cordax to partially pick up a
tively easy to identify (clusters 9a and 9b), following several lines                                                                                                     minority of sequences. Interestingly, a closer look at the partition
of evidence supporting their role in amyloid fibril formation48.                                                                                                           coefficients of the known amyloid sequence space reveals that
Cordax also pushes forward into less well-charted areas of amy-                                                                                                           although Cordax takes a significant step forward towards the
loid sequence space, e.g. exploring clusters with high α-helical                                                                                                          right direction, these APRs remain very hard to identify as they
content (cluster 10) and overall a low content of aliphatic amino                                                                                                         are characterised by even higher solubility values (Fig. 6i). Similar
acids (clusters 5, 6, 7, 8 and 9b). These regions also reveal the                                                                                                         charting of the amyloid sequence space is achieved by using

NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications                                                                                                                                                                                                                        7
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
ARTICLE                                                                                                             NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3

    a                                                                                                               b
                                                                                                                                1

                                                                                                                                                                                                                                          1. Hydrophobic/β-propensity/High aIndex

                                                                                                                        2                                                                                                                 2. Short side chains/hydrophobic/High aIndex

                                                                                                                                                                                        3                                                 3. Acidic/High aIndex

                                                                                                                                                                                                                                          4. Basic/High aIndex
                  CORDAX

                                                                                                                                                                                                      Clustering sorting scheme
                                                                                                                                                                 4
                                                                                                                                                                                                                                          5. Short side chains/hydrophobic/Low aIndex
    Predictions

                  Non-CORDAX                                                                                        5
                                                                                                                                                                                                                                          6. Acidic/Low aIndex
                  Joined
                                                                                                                                                                              9a                                                          7. Basic/Low aIndex
                  Unpredicted
                                                                                                                                                                                                                                          8. Hydrophobic/High instability/Low aIndex

                                                                                                                    7                                                                                                                     9a. Aromatic/High aIndex

                                                                                                                                                                     9b                                                                   9b. Aromatic/Low aIndex

                                                                                                                                                                                                                                          10. α-helix propensity
                                                                                                                                        6
                                                                                                                                                                                                                                          11. Unstructured propensity

                                                                                                                                                                                   11
                                                                                                                                    8
                                t-SNE dim2

                                             t-SNE dim1                                                                                     10

    c                                                                                       d                       e                                                                   i
                                                                                                                                                                                                               10
                                                                                                                                                                                                                                                              p < 0.0001
                                                                                                                                                                                                                                                                      p < 0.0001

                                                                             1                                 2                                                      300                                                                               p = 0.343
                                                      Hydrophobicity

                                                                                                               1

                                                                                                                                                 Alndex
                                                                             0.5                                                                                      200                                                                                                      p = 0.0002
                                                                                                Charge

                                                                                                               0
                                                                             0                                                                                                                                                                                    p = 0.027
                                                                                                               –1                                                     100
                                                                             –0.5                              –2                                                                                                                              p < 0.0001
                                                                                                                                                                      0                                                           5
                                                                             –1                                –3

                                                                                                                                                                                                                                                                                              vi

                                                                                                                                                                                        logD values
                                                                                                                                                                                                                                                                                              i
                                                                                                                                                                                                                                  0
                                                                                                                                                                                                                                                                                              n
                                                                                                                                                                                                                                                                                              s

    f                                                                                       g                       h
                                                                                                                                                                                                               –5                                                                             vs

                                                                                                                                                                          100
                                                                             α-helix
                                                       Secondary structure

                                                                                                                                                   (%) Tiny sc

                                                                                                               60
                                                                                                (%) Aromatic

                                                                                                                                                                          75
                                                                             α/β                               40                                                         50
                                                                                                               20                                                         25
                                                                             β-strand                                                                                     0
                                                                                                               0                                                                                      –10
                                                                             Unstructured
                                                                                                                                                                                                                                                                           x
                                                                                                                                                                                                                                             x

                                                                                                                                                                                                                                                           ed

                                                                                                                                                                                                                                                                                         ed
                                                                                                                                                                                                                                                                         da
                                                                                                                                                                                                                                          da

                                                                                                                                                                                                                                                        in

                                                                                                                                                                                                                                                                                        ct
                                                                                                                                                                                                                                                                      or
                                                                                                                                                                                                                                      or

                                                                                                                                                                                                                                                      Jo

                                                                                                                                                                                                                                                                                    di
                                                                                                                                                                                                                                                                     C
                                                                                                                                                                                                                                      C

                                                                                                                                                                                                                                                                                   re
                                                                                                                                                                                                                                                                   n-

                                                                                                                                                                                                                                                                               np
                                                                                                                                                                                                                                                                no

                                                                                                                                                                                                                                                                               U

Fig. 6 t-SNE 2D-representation of the known experimentally determined amyloidogenic sequence space. a State-of-the-art sequence-based methods
predict amyloid sequences, with (shown in cyan) or without Cordax (shown in yellow), that are grouped together in a major landing cluster and two
islands. Cordax predictions (shown in purple) transgress towards areas of amyloid-forming sequences that remain undetected by most methods (shown in
black). b Clustering of the t-SNE map using basic physicochemical properties and amino acid composition of the amyloid peptides. Each data point is
colour-coded based on the sorting scheme shown in the legend and background areas are used to pinpoint the major clusters of each defined category. The
clustering scheme was defined by characterising the t-SNE map using peptide c hydrophobicity, d net charge, e aliphatic index, f secondary structure
propensity and percentage content of g aromatic or h short residue side chains. i Partition coefficient analysis reveals that APRs identified by Cordax are
primarily soluble sequences compared to easy to identify sequences of joint prediction. On the other hand, APRs that remain hard to detect are
characterised by higher solubilities. Solubility regions (vi very insoluble, i insoluble, n neutral, s soluble, vs very soluble) are shown as coloured
backgrounds. Significant differences were computed using one-way ANOVA with multiple comparison. Violin plots represent the kernel probability
densities of the data with the median, upper and bottom quartiles.

uniform manifold approximation and projection (UMAP) for                                                                    structural compatibility can overcome typical sequence propen-
dimensionality reduction (Supplementary Fig. 3a and b), while                                                               sities as a pivotal driver of aggregation nucleating sequences and
PCA analysis highlights that CORDAX slowly infiltrates the                                                                   suggests that under the proper conditions, the boundaries currently
sequence space of higher solubilities (Supplementary Fig. 3c and d).                                                        considered compatible to protein amyloid-like assembly are
Overall, dimensionality reduction transformation highlights that                                                            potentially far wider than previously expected.

8                                            NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3                                                                                                                                                                        ARTICLE

  a 100                                                                                                                       b
                                                                                         CORDAX

                                                                             d
                                                                                         3D Profile

                                                                           te
                                                                       tric
                                                                     es
                                                                                         Fib Predictor

                                                                 l-R
                       80

                                                              lle
                                                            ra
                                                          Pa

                                                                                                         d
      Prediction (%)

                                                                                                       te
                                                                                                    ric
                       60                                                                                                           VIGIAQ                   NNQGSN                  KGAIIG         GVNLVN            GAIIGL          AVVTGV

                                                                                                 st
                                                                                               Re
                                                                                             1-
                                                                                                                                  RMSD : 1.6 Å              RMSD : 1.3 Å           RMSD : 2.7 Å    RMSD : 1.2 Å     RMSD : 1.4 Å     RMSD : 2.1 Å

                                                                                            ss
                                                                                         Cla
                       40

                       20

                        0
                                                                                       y
                                                         na nd

                                                                                     og

                                                                                                                                   HGVTTV                    GYLLGS                  GVTGIA          IFQINS          GVTAVA           AALQSS
                                                       io a
                                                               y

                                                                                     l
                                                     ct str
                                                           lit

                                                                                   po
                                                   re β-

                                                                                 To

                                                                                                                                  RMSD : 2.3 Å              RMSD : 1.1 Å           RMSD : 0.9 Å    RMSD : 2.5 Å     RMSD : 1.3 Å     RMSD : 1.2 Å
                                                di

  c                                                          p < 0.0001                          d                                                                   e            (IYQYGG) 6G8C    (LNIYQY) 6G8D     (VTQVGF) 6G8E       (IFAEDV) 5TXH
                       40                             p < 0.0001
                                                                                                                   20
                                                                       p < 0.0001

                       30
  Blosum62 score

                                                                                                 Blosum62 score

                                                                                                                   10
                       20
                                                                                                                                                                       PDB
                       10                                                                                           0                                                strucure

                        0
                                                                                                                  –10
                   –10

                   –20                                                                                            –20
                                                                                                                        0          20        40        60       80
                                     at e

                                                                  at y

                                                                  at y
                                                                m log

                                                                m g
                                    m plat

                                                              is lo
                                       ch

                                                                    ch

                                                                    ch
                                                          po

                                                             m opo

                                                                                                                                          Model rank
                                     m

                                                                                                                                                                      Cordax
                                                       To
                            Te

                                                                T

                                                          Template mismatches
                                                                                                                                                                      model

  f                                                     PDB lD                      CORDAX                          3D Profile          FibPredictor
                            Topology prediction (Å)

                                                         6G8C                         0.4                               2.7                 3.8

                                                         6G8D                         2.1                               2.1                 2.4
                                                                                                                                                                     Structural
                                                                                                                                                                      overlay
                                                         6G8E                         1.1                               1.5         Class mismatch

                                                         5TXH                         1.7                         Class mismatch Class mismatch
                                                                                                                                                                                   RMSD : 0.4 Å     RMSD : 2.1 Å      RMSD : 1.1 Å        RMSD : 1.7 Å

Fig. 7 High-precision recognition of amyloid fibril structural architectures using Cordax. a Prediction accuracy comparison of Cordax to the only publicly
available structural predictors, Fibpredictor and 3D-profile. For comparison, methods were run against a non-redundant sequence set extracted from
amyloid-forming peptide interfaces (n = 73 templates). b Model topologies, predicted by applying Cordax (shown in orange), strongly superimpose to
matching solved structural layouts of amyloidogenic nucleators (shown in magenta), as indicated by the reported minor RMSD values. c Sequence identity
contribution for template selection during cross-threading analysis of the Cordax structural library. Alignment scores for selected models matching the
template sequences compared to mismatching template selections of similar or different topological layouts (n = 73 templates). d Alignment scores of the
APRs newly identified by Cordax to the sequence of the selected templates, plotted against their corresponding model ranks (n = 96 sequences threaded).
e Structural alignment of Cordax outputs to experimentally determined 3D structures. Models were calculated for three aggregation prone sequences
derived from CsgA curli forming protein (PDB IDs: 6G8C, 6G8D and 6G8E, respectively) and a peptide mutant sequence derived from Aβ amyloid peptide
(PDB ID: 5TXH). Predicted topologies are overlapping representations of the experimentally determined amyloid fibril cores, f as displayed by a direct
comparison to other software. Violin plots represent the kernel probability densities of the data with the median, upper and bottom quartiles. Source data
are provided as a Source Data file.

Cordax predicts the structural layout and topology of fibril                                                                                                                          closer look reveals that sequence specificity may be a modulating,
cores. Due to restricted availability of experimentally determined                                                                                                                   yet not determining factor for this selection process. Steric per-
structures not included in the Cordax library, we first analysed                                                                                                                      turbations can be introduced due to restrictions deriving from
the information derived from cross-threading analysis in order to                                                                                                                    closely interdigitating side chains within the packed interfaces,
test the performance of the tool in predicting the structural                                                                                                                        therefore, key residue positions can be bound to the overall sta-
architecture of aggregation prone stretches. Among 73 unique                                                                                                                         bility of certain structural topologies and decrease the acceptable
sequences corresponding to the structural library, Cordax was                                                                                                                        sequence space that can accommodate energetically favourable
able to accurately assign the correct architecture to 63%, whereas                                                                                                                   interactions. This is highlighted by the sequence similarity
81% was identified with proper β-strand orientation (parallel/                                                                                                                        observed between topological matches (Fig. 7c, Supplementary
antiparallel) (Fig. 7a, Supplementary Data 3 and 4). In compar-                                                                                                                      Data 4). On the other hand, topologically different model selec-
ison, FibPredictor49 correct topology allocation was limited to                                                                                                                      tions could also be a consequential outcome of amyloid poly-
9.5% of the sequences and assigned β-strand directionality                                                                                                                           morphism. The observed sequence redundancy of the Cordax
amounted to 32.9%, while introducing an evident preference                                                                                                                           library illustrates that APRs can form amyloid fibrils with distinct
towards antiparallel architectures (Fig. 7a). Similarly, the 3D-                                                                                                                     morphological layouts50–52, a notion that is also supported by the
profile method is restricted to linking all potential queries with a                                                                                                                  common morphological variability of aggregates formed at the
class 1 topology, hence was incapable of predicting alternative                                                                                                                      level of full-length amyloid-forming proteins53,54. The modulat-
architectures (Fig. 7a). Structural alignment indicated that even in                                                                                                                 ing role of sequence dependency was also evident for the 96-
cases of mismatching selected templates, modelled architectures                                                                                                                      peptide screen. A ranked analysis of the output models indicated
strongly superimpose to the solved structures (Fig. 7b), suggesting                                                                                                                  that templates with higher alignment scores were not crucial for
that Cordax identifies the correct topology with high accuracy. A                                                                                                                     the topology selection process, although could often correspond

NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications                                                                                                                                 9
Structure-based machine-guided mapping of amyloid sequence space reveals uncharted sequence clusters with higher solubilities - Nature
ARTICLE                                                                 NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17207-3

to the favourable architectures (Fig. 7d), thus highlighting that the        provided significant insights on the polymorphic architectures of
structural predictions of Cordax are relatively unbiased in terms            amyloid fibrils57. In this line, the emergence of cryo-EM has been
of the sequence space composing the structural templates.                    pivotal in determining features of amyloid fibril polymorphs58,
   The accuracy of the tool was also cross-referenced against                complementing earlier efforts developed using solid-state NMR
experimentally determined structures of fibril cores not included             spectroscopy53,59. Notably, these structures represent snapshots
in the structural library. We utilised the recently solved structures        of the kinetic cores of aggregation or end-state morphologies of
of parallel fibril-forming segments derived from the major curli              amyloid fibrils and therefore provide limited information on the
protein CsgA55, as well as an anti-parallel polymorphic APR                  underlying aggregation pathways and toxicity-related effects of
variant segment derived from the amyloid-β peptide56. Compared               amyloids. On the other hand, the growing number of high-
to other structural predictors, only Cordax could invariantly                resolution cryo-EM structures has highlighted the in vivo struc-
predict the correct architecture for every steric zipper as the              tural diversity of amyloid fibrils60, whereas steric zippers have
closest representation of the experimentally determined reference            been recently used for the development of targeted ther-
structures (Fig. 7e and f). This performance can only improve as             apeutics61–63. However, determining the structural layout of
the fragment library expands, so we aim to update it at regular              amyloid fibrils still remains challenging. Cordax attempts to
intervals, providing there is a noticeable increase in solved                provide a cost-effective complementary powerful computational
structures in the future.                                                    alternative that can be operated without any required scientific
                                                                             expertise necessary to apply the intricate technical approaches.
                                                                             Apart from its function as an aggregation predictor, the tool is
Discussion                                                                   uniquely poised to provide detailed complementary structural
The number of amyloid structures in the protein databank has                 information on the putative amyloid fibril architecture of iden-
been steadily increasing over the last two decades. It has now               tified APRs. Users can utilise the method to structurally char-
achieved a number (>80) that was reached for globular proteins at            acterise identified APRs by classifying their overall specific
the beginning of the 1980s and that then triggered the first                  topological preferences, including β-strand directionality and key
developments of template-based modelling methods including                   residue positions that are integral parts of the amyloid core. The
homology-based and threading (or fold recognition) in an                     latter information is imperative for efforts focused on under-
attempt to estimate the versatility of individual folds and discover         standing the underlying mechanisms that dictate amyloid-related
novel folds in a more directed manner. Similarly, we here                    diseases or the formation of functional amyloids, but can also
developed Cordax, an exhaustively trained regression model that              have an immense impact on the design of applied nano-
leverages a substantial library of curated amyloid template                  biomaterials64, targeted amyloid inducers65 or counteragents,
structures combined with machine learning. Cordax uses a                     following the increased interest in the development of structure-
logistic regression approach to translate structural compatibility           based inhibitors of aggregation61–63.
and interaction energies into sequence aggregation propensity
and is therefore unconstrained by defined sequence tendencies,                Methods
such as hydrophobicity or secondary structure preference that                Regression model training. In previous work we synthesised and explored the
direct most sequence-based predictors. As a result, we discovered            aggregation potential of 940 peptide sequences derived from both functional and
unconventional amyloid-like sequences, including sequences with              pathological amyloid-forming proteins, which were supplemented with additional
                                                                             data on 462 hexapeptides derived from other published sources to develop
low aliphatic content, high net charge or sequences with low                 WALTZ-DB 2.032, the largest public comprehensive repository of experimentally
intrinsic structural propensities. Clustering amyloid sequences by           defined amyloidogenic peptides. In total, 1402 hexapeptide sequences from
t-SNE two-dimensional reduction revealed the substructure of                 WALTZ-DB were modelled on the 140 backbone structures of the Cordax library,
amyloid sequence space. Apart from a large cluster corresponding             leading to the generation of 196,280 models. The thermodynamic stability of each
                                                                             model (ΔG, kcal mol−1) was calculated using FoldX and fed into a logistic
to sequences found in the hydrophobic core of globular proteins,             regression model (Fig. 2c). This model was used to distil the aggregation propensity
we also found clusters corresponding to surface-exposed amyloid              from the free energy values. Towards this end, from the calculated ΔGs, we isolated
sequences in globular proteins, small aliphatic functional amy-              50 representative energies using a recursive feature elimination algorithm (using
loids, N/Q/Y prions, strongly helical and intrinsically disordered           the RFE module of the SciKit-learn python package33 and selecting for the set of
sequences which could be compatible with liquid–liquid phase                 templates that maximised the AUC). As a result, each sequence is described with a
                                                                             50-dimensional vector. Next, the data were transformed in order to be constrained
responsive sequences. Our analysis highlights the discovery of               in a scoring range between 0 and 1, using a Min/Max scaling algorithm. The
highly soluble, yet amyloid-forming, sequences and suggests that             regression model was trained with L2 penality and regularisation strength (C) equal
the largest portion of the remaining uncharted amyloid sequence              to 1. Both scaling of the estimated ΔG and the machine-learning model were
space is hidden in this corner (Fig. 6a and i). Indeed, most                 developed using the SciKit-learn python package66.
archetypal hydrophobic APR sequences have low intrinsic solu-
bility. As a result, low solubility and aggregation propensity are           Model pipeline. Cordax receives a protein sequence in FASTA format as input,
properties that are often wrongly used interchangeably. It is                which is fragmented into hexapeptides using a sliding window process. Sequences
                                                                             are then threaded against the fragment library utilising FoldX and the derived free
important to differentiate between the initial solubility and                energies are translated into scoring values for every peptide window. An energe-
aggregation propensity of a peptide, as soluble monomeric                    tically fitted model is selected as the closest representative of the overall topology of
sequences can often self-assemble, at later time points, into                the amyloid fibril core for each predicted window and is provided as output in
insoluble amyloid fibrils. The APRs that are newly discovered by              standard PDB format to the users (Fig. 2c). An amyloidogenic profile is generated
                                                                             by scoring every single residue of the input sequence with the maximum calculated
Cordax are often highly soluble in their monomeric form, even                score of the corresponding windows, followed by a binary prediction for every
more than the already known polar APRs from the yeast prions,                segment. Finally, calculated energies are stored automatically in a growing local
as they contain many charged and polar residues, yet surprisingly            database and can be retrieved, thus creating a ‘lazy’ interface that bypasses unne-
can still assemble into amyloids. Overall, our approach demon-               cessary computation for recurring sequence segments or future runs.
strates that the increasing structural information on amyloids
now allows for more fine-graded structural rule learning of the               Datasets. Performance assessment of Cordax was carried out utilising two indi-
amyloid state.                                                               vidual data sets for peptide and protein aggregation propensity detection. Further
                                                                             validation of the method was performed against an independent subset screen of 96
   Recent developments in microcrystal electron diffraction have             hexapeptides sequences.
enabled structural determination from nanocrystals that are not                  For peptide aggregation propensity, we used a dataset of 1402 non-redundant
typically suited for traditional X-ray diffraction and have                  hexapeptides contained in the WALTZ-DB 2.0 repository32. This database is the

10               NATURE COMMUNICATIONS | (2020)11:3314 | https://doi.org/10.1038/s41467-020-17207-3 | www.nature.com/naturecommunications
You can also read