Structural clustering algorithm
Table of Contents
The clustering procedure is similar to MMseqs2’s clustering but, instead of using sequences, Foldseek’s 3Di alphabet (Extended Data Fig. 1) was used to represent the structures as one-dimensional sequences. The clustering algorithm combines Linclust17 and cascaded MMseqs2 (ref. 42) clustering. The pipeline applies this strategy to allow for efficient clustering of millions of structures. First, protein structures are converted to 3Di sequences and processed according to the Linclust workflow. This includes extracting m k-mers (default m = 300, k = 10) from each sequence and grouping them on the basis of their hash value. The k-mer groups are then used to assign each structure to the longest sequence (representative) within the group. The shared diagonal on which the k-mer is found is also stored for further use in the alignment step.
The pipeline then proceeds with an ungapped alignment algorithm that rescores the structures on the basis of the shared diagonal between members and representatives using 3Di and amino acid information. The sequences that meet the defined alignment criteria, such as E-value, alignment coverage, sequence identity, alignment LDDT43 or TM score44, are clustered using the MMseqs2 clustering module (default using the set-cover algorithm). After this step, the structures that have been assigned already are removed from the set and the remaining representative member hits are aligned using Foldseek’s structural Gotoh–Smith–Waterman algorithm15, and all passing hits are clustered as well. The remaining cluster representatives are successively clustered by three cascaded steps of prefiltering, structural Smith–Waterman alignment and clustering.
Distinguishing homologues from analogues
Structural similarity between two sequences can be attributed to either common evolutionary ancestry (homologues) or convergent evolution (analogues). We investigated the association between cluster members, computed by our pipeline on the basis of structural similarity, and homology relationships using the ECOD database24. ECOD is a hierarchical domain database that describes the evolutionary relationships between pairs of protein domains. Its hierarchical levels from root to leaf are classified as: A-group (same architecture), X-group (possible homology), H-group (homology), T-group (topology) and F-group (sequence similarity). Analogues are expected to occur between members of different X-groups, whereas homologues should be found within the H-group.
For our benchmark, we downloaded the ECOD (F99 v.20230309) PDB database and applied the same MMseqs2 and Foldseek clustering procedure used for the AFDB. We conducted an ECOD cluster purity analysis on all non-singleton clusters by measuring the pairwise cluster member consistency at different hierarchy levels. The analysis revealed high average consistency rates of 99.6%, 98.6%, 97.4%, 96.8% and 72.8% for ECOD’s A-group, X-group, H-group, T-group and F-group, respectively. This indicates an effective clustering of homologous proteins, demonstrating a nearly exclusive distinction between homologues and analogues. The high level of consistency in our clustering is mainly attributed to the stringent E-value of 10−2; when raising it to 10, the consistencies decrease to 69.7%, 55.7%, 53.3%, 51.9% and 36.6%, respectively. A similar result was observed using the MALISAM database45, a single-domain database of analogous protein domains. When clustering the 260 protein structures within the MALISAM database with Foldseek’s default parameters, no clustering of analogues occurs. However, if we increase the E-value threshold, we begin to form clusters containing analogues.
Cluster purity analysis
To assess cluster purity, we followed a two-step approach. First, we calculated the average LDDT and TM score per cluster to assess the structural similarity. For this, we aligned the representative to the cluster members using the structurealign -e INF -a module in Foldseek and reported the alignment LDDT and TM score using –format-output lddt,alntmscore. For each cluster we computed the mean illustrated in Fig. 1c.
Second, we evaluated the Pfam consistency of each cluster by using Pfam labels obtained from UniProtKB. We took into account only the clusters that have at least two sequences with Pfam annotations and we calculated the fraction of correctly covered Pfam domains for all Pfam sequence pairs ignoring self-comparison. We define true positives as a pair of Pfam domains belonging to the same clan. For each pair, we computed the consistency scores by true-positive count divided by the count of Pfams in the reference sequence. Finally, we computed the mean overall pair scores. This approach enabled us to determine the proportion of sequences within a given cluster that shared the same Pfam annotation.
Finally, we also calculated the EC number consistency of each cluster. EC numbers were extracted from UniProtKB. The EC consistency was evaluated similarly to the Pfam consistency but was done four times according to the four classes of the EC number. We considered only the clusters with at least two sequences that have EC annotations. At each class of the EC number, the annotation without any code at the class was ignored. For each pair as the Pfam consistency, the consistency scores were computed by the true-positive count divided by the number of ECs in the sequences in the pair avoiding self-comparison. The scores were finally computed to the mean overall pair scores.
Dark clusters and LCA
To eliminate clusters similar to previously known experimental structures, we conducted a search using Foldseek against the PDB (v.2022-10-14) for each cluster representative, with an E-value threshold of 0.1. We then excluded clusters annotated with Pfam domains by searching the cluster representatives using MMseqs2 with parameters -s 7.5 –max-seqs 100000 -e 0.001 against the Pfam database. Finally, we removed clusters with members annotated with Pfam or TIGRFAM20 annotations in the UniProt/TrEMBL and SwissProt database. To determine the LCA of each cluster, we used the lca module in MMseqs2 (ref. 46) ignoring the two taxa (1) 12,908 unclassified sequences and (2) 28,384 other sequences. We visualized the LCA results using a Sankey plot generated by Pavian47.
Prediction of functions and pockets
We predicted small-molecule-binding sites for representative dark cluster members by adapting a previously described approach9. We used AutoSite to predict pockets48, and selected pockets with an AutoSite empirical composite score of >60 and mean pocket residue pLDDT of >90 for additional analyses. To assign putative function and predict catalytic residues, we used DeepFRI49 to predict enriched GO/EC terms and residue-level saliency weights across available GO/EC categories (BP, CC, EC, MF). Pocket and functional predictions were then visually examined using a web app (Data Availability).
Domain prediction from local alignments
First, we filtered out low-scoring Foldseek hits using an E-value of 10−3 as the threshold. We defined potential domain boundary positions for each protein sequence by clustering start–stop positions (hierarchical clustering, height parameter of 250 to establish clusters). Predicted domains were then linked to others on the basis of structural similarities, retaining the highest scores when duplicates were found. The resulting network was then trimmed excluding connections with E-value higher than 10−5, predicted domains with more than 350 amino acids and connected components with less than 5 nodes. We applied graph-based clustering (walktrap, 6 steps), keeping communities with at least 5 members. Each predicted domain inside the selected communities was annotated using Pfam-A regions mapped to UniProt identifiers (v.35.0), more than 75% of the Pfam domain has to overlap with the predicted domain. We calculated inside each community the frequency of Pfam annotations and defined them on the basis of the highest one. Owing to its size, we decided to keep out of the following analysis one community with 152,959 structures (group ID 1;1, see supplementary files at https://cluster.foldseek.com/). We connected the remaining communities on the basis of the structure similarities, allowing connections with a P < 10−3.
We developed a web server to allow for user-friendly exploration of clusters, their members and related similar clusters. The server was implemented using a REST-based client-server architecture, with a VueJS front-end and a NodeJS back-end. The clustering-related information is accessed through an SQLite database and information related to individual structures through Foldseek compatible databases through a C++-based NodeJS-extension for fast read-in and search. Similar to the Foldseek webserver, we used NGL50 to visualize structures and WebAssembly-based versions of PULCHRA51 to restore full protein structures from our stored C-alpha traces and TM-align for pairwise structure alignments of cluster members to their representatives. To visualize the taxonomic distribution, we implemented Sankey diagrams inspired by Pavian. Clusters can be found through member UniProt accessions, through a Foldseek search to similar clusters or by searching for GO terms. Individual cluster members can be further explored with links to UniProt, the Foldseek webserver and the UniProt3D Atlas52.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.