Computer taught to intuitively predict chemical properties of molecules
Scientists from MIPT's Research Center for Molecular Mechanisms of Aging and Age-Related Diseases together with Inria research center, Grenoble, France have developed a software package called Knodle to determine an atom's hybridization, bond orders and functional groups' annotation in molecules. The program streamlines one of the stages of developing new drugs. A paper on the new development has been published in the Journal of Chemical Information and Modeling.
Imagine that you were to develop a new drug. Designing a drug with predetermined properties is called drug-design. Once a drug has entered the human body, it needs to take effect on the cause of a disease. On a molecular level this is a malfunction of some proteins and their encoding genes. In drug-design these are called targets. If a drug is antiviral, it must somehow prevent the incorporation of viral DNA into human DNA. In this case the target is viral protein. The structure of the incorporating protein is known, and we also even know which area is the most important – the active site. If we insert a molecular "plug" then the viral protein will not be able to incorporate itself into the human genome and the virus will die. It boils down to this: you find the "plug" – you have your drug.
But how can we find the molecules required? Researchers use an enormous database of substances for this. There are special programs capable of finding a needle in a haystack; they use quantum chemistry approximations to predict the place and force of attraction between a molecular "plug" and a protein. However, databases only store the shape of a substance; information about atom and bond states is also needed for an accurate prediction. Determining these states is what Knodle does. With the help of the new technology, the search area can be reduced from hundreds of thousands to just a hundred. These one hundred can then be tested to find drugs such as Reltagravir – which has actively been used for HIV prevention since 2011.
From science lessons at school everyone is used to seeing organic substances as letters with sticks (substance structure), knowing that in actual fact there are no sticks. Every stick is a bond between electrons which obeys the laws of quantum chemistry. In the case of one simple molecule, like the one in the diagram, the experienced chemist intuitively knows the hybridizations of every atom (the number of neighboring atoms which it is connected to) and after a few hours looking at reference books, he or she can reestablish all the bonds. They can do this because they have seen hundreds and hundreds of similar substances and know that if oxygen is "sticking out like this", it almost certainly has a double bond. In their research, Maria Kadukova, an MIPT student, and Sergei Grudinin, a researcher from Inria research center located in Grenoble, France, decided to pass on this intuition to a computer by using machine learning.
Compare "A solid hollow object with a handle, opening at the top and an elongation at the side, at the end of which there is another opening" and "A vessel for the preparation of tea". Both of them describe a teapot rather well, but the latter is simpler and more believable. The same is true for machine learning, the best algorithm for learning is the simplest. This is why the researchers chose to use a nonlinear support vector machines (SVM), a method which has proven itself in recognizing handwritten text and images. On the input it was given the positions of neighboring atoms and on the output collected hybridization.
Good learning needs a lot of examples and the scientists did this using 7605 substances with known structures and atom states. "This is the key advantage of the program we have developed, learning from a larger database gives better predictions. Knodle is now one step ahead of similar programs: it has a margin of error of 3.9%, while for the closest competitor this figure is 4.7%", explains Maria Kadukova. And that is not the only benefit. The software package can easily be modified for a specific problem. For example, Knodle does not currently work with substances containing metals, because those kind of substances are rather rare. But if it turns out that a drug for Alzheimer's is much more effective if it has metal, the only thing needed to adapt the program is a database with metallic substances. We are now left to wonder what new drug will be found to treat a previously incurable disease.