Over the past decade, there has been continual growth in computational material discovery fueled by advances in computing power and material discovery algorithms. Unfortunately, most material discovery algorithms are limited in the space they can explore due to the computational cost of the quantum mechanical simulations required to relax a candidate structure and evaluate its energy. Past studies have bypassed many unneeded simulations by utilizing machine learning methods to predict the stability of a material. Unfortunately, the arrangement of atoms in structures produced by the materials discovery algorithms often deviates from the state of the training structures, leading to poor predictions.
To give an analogy, imagine you have been employed at a fruit farm and tasked with determining the sweetness of the fruit. When you receive training for the job, you are handed a basket of fruits, picked and allowed to reach the perfect ripeness, along with the sweetness rating of each fruit. You intently consume each piece of fruit and study the sweetness score accompanying each fruit. On your first day, you go to the farm, pick a fruit basket, and begin your work. You select a banana that seems far greener than the bananas you tried during your training. You take a bite and are shocked by the bitterness; you discard the fruit and give it a poor sweetness score. This cycle repeats, and all the fruits you analyzed have a sweetness score much worse than any fruit you tried during your training, but since you were never told that the relevant metric was the sweetness of the fruit once ripened, you report the poor scores.
The above analogy (illustrated in Fig. 1) is essentially how we train machine learning models to predict the thermodynamic stability of materials. Currently, we train the models on structures that have been allowed to relax (ripen) and train the models to predict the structures' formation energy (sweetness). Then when the models are utilized in crystal structure prediction algorithms, we ask them to predict the formation energies of unrelaxed structures. Much like taking a bite of unripe fruit and reporting it to be less sweet, the machine learning models predict the structures to be higher in formation energy than the relaxed state, leading to significant prediction errors. To some extent, it seems wrong to even call the prediction errors an error because the model is making a prediction on the structure as it is presented, and as it is presented, it is higher in energy.
To change the question we ask our model from "What is the formation energy of this structure?" to "what will the formation energy of this structure be once it is relaxed?" requires a change or augmentation to how we present the potential energy landscape (PEL) to the ML model. To ask the correct question requires representing the PEL as a step function.
Our approach to representing the PEL as a step function is essentially to take a relaxed structure and make it look like it was produced in a material discovery algorithm. This is done by looking at how far atoms in structures produced in a material discovery algorithm moved during relaxation and perturbing the atomic coordinates of every relaxed structure in our training set based on this information. We then map both the perturbed and relaxed structures to the relaxed formation energy and train our ML model on this augmented dataset.
Our results showed that this data augmentation method is a viable option for predicting the relaxed formation energy of unrelaxed structures. To summarize our findings, we found that when training on only relaxed structures, there was a strong anti-correlation between relaxed and unrelaxed predictions. However, training on an augmented dataset strongly correlates relaxed and unrelaxed predictions. Training on the augmented dataset yielded a 3-fold reduction in prediction MAE of unrelaxed structures compared to a model trained only on relaxed structures. Finally, when used to screen materials for thermodynamic stability, the model trained on the augmented dataset was 5-times more efficient than the model trained on only relaxed structures.
To conclude this blog, I will elaborate a bit more on the concluding sentence of the paper reads: "While there likely exist more advanced augmentation techniques, this work showed the surprising effectiveness of a relatively simple method of augmentations that outperformed the current state of the art in formation energy prediction of unrelaxed structures." As scientists, we sometimes tend to make things more complex than necessary. However, the paper shows that a simple approach can be sufficient. That said, much improvement can be made to the data augmentation method described. I hope this work motivates others to develop better augmentation techniques to enable the study of more complex materials using material discovery algorithms.