Council for the Advancement of Science Writing

Unearthing new materials with the aid of machine learning

By Kara Manke | 

Alán Aspuru-Guzik is building facial recognition software—for molecules. 

Aspuru-Guzik, a professor of chemistry at Harvard University, uses computers to explore chemical space—the near-infinite array of molecules that can be created by joining tens or hundreds of atoms together into different shapes. With the help of quantum chemistry calculations, he and his team sift through millions of virtual molecules in search of promising new materials for solar energy generation and storage.

And they are applying machine learning to speed up the process.

“Machine learning is basically a rebranding or a renaming of the old field of artificial intelligence,” said Aspuru-Guzik. “They have successfully used it to filter spam every day, to do very intelligent translations and image recognition … so it’s not too hard to think that machine learning could have a very big impact in chemistry.”

Finding a molecule in a haystack

Out of the approximately 10­­60 possible molecules that can be formed by linking tens or hundreds of atoms together, how do chemists identify which ones could be useful as new solar cell materials or pharmaceutical drugs?

One approach is to make individual molecules in the lab, and then test to see what they can do. But theoretical chemists like Aspuru-Guzik prefer to simulate molecules in a computer and use quantum calculations to estimate their properties. Once the computer has identified promising candidates, experimentalists can focus their energy on making only those.

“You can make a molecule maybe every ten weeks with your hands, but you can compute it in seconds,” said Aspuru-Guzik at CASW’s New Horizons in Science briefing during the ScienceWriters2014 conference in Columbus, Ohio, on Oct. 19. “We want to use computers more and more because making [molecules] is so hard.”

As part of the Harvard Clean Energy Project, for instance, Aspuru-Guzik and his team used quantum chemistry calculations to screen 2.3 million molecules for organic solar cells devices and found 35,000 highly-efficient candidates.

But though it is faster and cheaper to simulate a molecule before making it in real life, quantum chemical calculations are still costly in terms of computing time. Aspuru-Guzik estimates that it took over 20,000 CPU years to screen the 2.3 million molecules—that is, it would take a single computer 20,000 years to run all of these calculations.

With the help of the IBM World Community Grid, which relies on donated computer time from a network of volunteers, these calculations were completed in three years. But to accelerate the search, Aspuru-Guzik is using machine learning to rapidly recognize some of the most promising molecules.

Teaching a robot to speak chemistry

A molecule is built of atoms that have been linked together to form different shapes: some of the more common geometries include pentagons, hexagons, and long chains. And just like a puzzle piece or a key, a molecule’s shape usually dictates what it can do.  

Chemists know the language of molecular shape: they can look at how a group of atoms is arranged in a molecule—say, six carbon atoms joined in a ring—and guess what properties that molecule will have. Of course, as molecules become larger and more complex, this becomes harder and harder for humans to do.

Aspuru-Guzik is teaching computers how to read this molecular language. To educate the computer, he says, a molecular structure is converted into a so-called fingerprint, and each molecular fingerprint is associated with a particular property—say, the solar cell efficiency of that molecule. After being given thousands of molecular fingerprints, the computer builds a model to relate molecular structure to molecular properties.

“So if I give you a new fingerprint—and I ask ‘what is its solar cell efficiency?’—instead of evaluating the quantum chemistry calculation, I can use the [machine learning] model, which will really quickly tell me the answer,” he said in an interview.

This approach isn’t new. Since the 1990s, scientists have employed machine learning to anticipate a molecule’s pharmaceutical activity based on its structure. And more recently, a similar method has been implemented to predict the crystal structures of novel materials.

But Aspuru-Guzik and his group are the first to apply this method to the search for cheaper and more robust materials for organic solar cells, flow batteries (akin to fuel cells) and LEDs. So far their models have been successful: with machine learning, they predicted top candidates for organic solar cell materials.

“In some sense we have increased our throughput by millions because we taught a little machine-learning robot to actually sift through molecules like your camera sifts through faces,” Aspuru-Guzik said in his presentation. “And when I get good molecules from that model, then I can throw them at quantum chemistry.”

Sailing the chemical seas

Aspuru-Guzik compares the current efforts in materials discovery to early sea exploration. “When the Polynesians were exploring, you had a huge ocean with very few islands,” he said in an interview. “Molecular space is infinite. We are using machine learning and quantum chemistry to really quickly explore it, and I’m pretty sure we will find those islands.” Plus, he said, those who use such techniques first will reap the biggest returns.

Credit image: Rafa Gomez-Bombarelli, Timothy Hirzel, Jorge Aguilera-lparraguirre - Aspuru-Guzik Group, Harvard University

Back to the Newsroom page >

Kara Manke is a Ph.D. student studying Physical Chemistry at the Massachusetts Institute of Technology. In her graduate work, she uses high-frequency sound waves to study the properties of complex materials, including viscous liquids and nanostructured composites, under the advisement of Keith Nelson. She loves to write both fiction and non-fiction, and in 2014 was awarded a AAAS Mass Media Fellowship to work for National Public Radio. She has also served as a lead organizer for the Communicating Science Conferences, a series of science communication workshops for graduate students.