By: Ofir Ezrielev
Good night friends, today we are again in our section DeepNightLearners with a review of a Deep Learning article. Today, I've chosen to review the article: Highly accurate protein structure prediction with AlphaFold.
Reading recommendation: Exciting read, especially for those who are also interested in Bioinformatics. The amount of resources that were allocated in this research and its results are fascinating. Beyond the innovations for solving this specific problem, they present new general techniques as well.
Clarity of writing: Medium-high
Required math and ML/DL knowledge level to understand the article: knowledge with Deep Learning is required. Furthermore, knowledge with bioinformatic terms is recommended.
Practical applications: This is the strongest tool today to predict protein structure. It is going to be massively used to bring the world protein research forward, both for pure science and for protein engineering. However, although DeepMind released the model with the network weights, the use of the weights is not allowed for commercial use.
- Predicting protein 3D structure
Mathematical Tools, Concepts and Marks:
- Graph Neural Network \(GNNs\)
Introduction and general explanation:
Proteins are the ones responsible for most of these actions that we call "life". Proteins are created by ribosomes from RNA in a process called translation. The RNA itself is transcribed most of the time from DNA (except in some viruses).
Proteins are complex because they are composed of 20 different building blocks (20 different amino acids) and can contain dozens to thousands of these blocks. There are now over two billion types of proteins that we can distinguish. This complexity is described by Levinthal's paradox as follows: given the enormous variety of conformations - 3d structures - that a protein can have, how does it converge into a single conformation (or a few conformations) in such a short time? Apart from the 20 basic blocks, there is also a wide range of molecules that meet the amino acid definition. These additional amino acids can become a part of a protein, due to the process of expanding and reprogramming of the genetic code, either through amino acids modifications in a protein, or by modern attempts of using amino acids in protein engineering.
The high importance of proteins makes it important for us to know for various of goals. For example:
- to understand biological mechanisms
- to develop medications and immune
- for various industrial usages
To understand how a specific protein works, we must figure out its structure. Continuing the Levinthal's paradox, it's hard to determine a protein 3D structure, based on its amino acid sequence. There are some physical methods to determine the protein structure, but they require time and significant resources (sometimes they can last years for new proteins), are expensive and not always succeed. Therefore, there is a need in computational methods to be able to predict the protein structure - a task that over the years turned out to be very complex.
Google's DeepMind proved again their contribution in basic research, and introduced a solution to this problem that significantly bypassed other academic solutions \(Full disclosure: my master’s degree research lab is dealing with it, among other topics\).
The article in brief:
AlphaFold2 system (which they simply call AlphaFold, so we will too) is very complex and contains many new architectures that were built especially for it. We will gradually decompose the system into its base components. AlphaFold includes 5 different modules which works together as an ensemble.
AlphaFold is composed of two different vector representations which are being updated in parallel:
- MSA representation - We align "our" protein with similar proteins, so that the similarities between them is properly represented through a maximizing a score. In MSA, every row represents a different protein, and every column is the same location in the protein code. A column has gets a higher score when it has fewer spaces or fewer different amino acid on that column, which represents the same position for all the proteins. The algorithm uses a sequence dataset for this alignment.
- Pair representation - a structure is represented in a latent way through a matrix that contains the distances between every amino acid pair in the protein. In such a representation, one should validate that the distances obey the rules of a proper 3D system (such as the triangle inequality).
Both representations pass a process, so:
- they are passed through 48 layers of Evoformer
- The row that represents "our" protein in the MSA and the structure representation are passed through 8 blocks of structure module
- The output structure goes through a fine-tuning to get the final structure
- Finally, after passing the representations through Evoformer and the sturcutre module, they get recycled with a reversed "skip-connection" that is being added to the pre-Evoformer structure - and this 3 times.
The Evoformer layers are constructed of several sub-layers with skip-connections between them.
The MSA representation passes through the following layers:
The pair representation is handled as so:
- The averaged MSA representation multiplication, after the transition layer, is added to the pair representation
- The result is being updated according to the amino acids triplets in a process called "triangle multiplicative update" - predicting a node based on 2 other nodes and their edges
- These three nodes are updated using self-attention
- Then it goes through another update with the transition layer, similarly to the MSA representation
The Structure Module shifts and rotates the amino acids in the protein. These updates are done based on neighboring amino-acid triangle of nodes and edges in the protein graph. The MSA row representing "our" protein is getting updated using invariant point attention and the skip-connection that includes it, the pair representation and the amino acid triplets.
The result is used to update the rotations and shifts of the amino acid triplets. The pair representation is not being updated in this module.
Alphafold includes several types of loss functions, which together create the overall system loss:
- The final pair representation is discretized and scored with a cross-entropy loss.
- The initial MSA representation is being processed similarly to BERT with random masking, where the network must guess these blanks and a similar loss calculates the score.
- Additional loss is calculated on the local distances between the alpha carbon of the main amino-acid chain.
- During the final fine-tuning process, a loss scores violations of structural constraints.
The main dataset for the algorithm is the Protein Data Bank (PDB) which includes physical protein structures. This data was enriched with protein sequences data (for the MSA) with the Big Fantastic Database (BFD), which includes over 2.2 billion protein sequences from over 66 million protein-families. An additional enrichment that Alphafold uses is its own 3D structural predictions, where its confidence is high, as additional training data (semi-supervised learning).
The algorithm was tested during the CASP14 competition, which is considered as the golden standard in the field of protein prediction. Every two years, the community is challenged to predict proteins' structure, whose structure is about to be physically determined within a few months. Hence, it is in a form of a blind test-set, without the possibility to cheat.
The article achievements:
Alphafold is currently the strongest tool to predict protein 3D structure, leaving all the competition far behind.
The capabilities it presented are expected to dramatically advance the field of protein research, in terms of functionality, illness research, medications, immune and more. DeepMind had released both the model and the weights to be used for free for academic research.
However, Alphafold heavily depends on the amount of data it has on similar proteins (MSA). The researches state that for proteins with less than 30 similar but different proteins, the prediction quality drops significantly. As a result, the generalization capabilities are limited, and it can not handle well random proteins. Furthermore, it still can not handle proteins with different amino acids than the 20 standard ones, and therefore it still can not be used for researches which include additional amino acids (genetic code expansion).
Currently, a dramatic event is taking place in the field of protein structure prediction, with the understanding that research laboratories can't compete with the amount of resources available to companies such as DeepMind. The domain is in turmoil, and future CASP competition may even be canceled.
Written by Ofir Ezrielev.
Ofir is a data scientist at Dell EMC and MSc student at BGU in bioinformatics and DS. He enjoy finding new solutions to odd and unique problems, and he has several patents in the making.