July 24, 2024

Markus Buehler on knowledge graphs for scientific discovery, isomorphic mappings, hypothesis generation, and graph reasoning (AC Ep54)

Podcast: Play in new window | Download

“If you read 1,000 papers and build a powerful representation, humans can interrogate, mine, ask questions, and even get the system to generate new hypotheses.”

– Markus Buehler

About Markus Buehler

Markus Buehler is Jerry McAfee (1940) Professor in Engineering at Massachusetts Institute of Technology (MIT) and Principal Investigator of MIT’s Laboratory for Atomistic and Molecular Mechanics (LAMM). He has published over 450 articles with almost 50,000 citations and is on the editorial boards of numerous journals including PLoS ONE and Nanotechnology. He has received numerous awards including Presidential Early Career Award for Scientists and Engineers (PECASE) and National Science Foundation CAREER Award. In addition he is a composer and has worked on two-way translation between material structure and music.

Wikipedia Profile: Markus J. Buehler
Google Scholar Page: Markus J. Buehler
LinkedIn: Markus J. Buehler
MIT Page: Markus J. Buehler

What you will learn

Accelerating scientific discovery with generative knowledge extraction
Understanding ontological knowledge graphs and their creation
Transforming information into knowledge through AI systems
The significance of ontological representations in various domains
Visualizing knowledge graphs for human interpretation
Utilizing isomorphic mapping to connect disparate concepts
Enhancing human-AI collaboration for faster scientific breakthroughs

Episode Resources

Transcript

Ross Dawson: Marcus, it is fantastic to have you on the show.

Markus Buehler: Thanks for having me.

Ross: So you sent me a paper, which is titled, Accelerating Scientific Discovery with Generative Knowledge Extraction, Graph-Based Representation, and Multimodal Intelligent Graph Reasoning, and it totally blew my mind. So I want to try to use the opportunity to unpack it to a degree. It’s an 85-page paper, so obviously I won’t be able to get out of the detail level, but to unpack the concepts, because I think they’re extraordinarily relevant, not just for accelerating scientific discovery, but also across almost any thinking domain. It’s very, very rich and very promising just because so much to my interest. So let’s start off and essentially, I’m saying, you’ve taken a thousand papers, and from those have been able to distill those into some ontological knowledge graphs. So could you please explain ontological knowledge graphs, how those are created, and what they are?

Markus: Sure, yeah, so the idea behind this sort of graph representation is really changing information into knowledge. And what that means is that we’re trying to take bits and pieces of information, like a concept — concept A, concept B, like a flower, composite a car. And in these graph representations, we were trying to connect them to understand how a car, a flower, and a composite are related. And, traditionally, we would create these knowledge graphs, manually, essentially, would create sort of categories of what kind of items we want to describe, and what the relationship might be. And then we would basically manually build these relationships into a graphic presentation. And we’ve done this for a couple of decades, actually. I think the first paper was 10 to 20 years ago. And yeah, back in the day, we did this manually, essentially understanding a certain scientific area. We would build graph representations of the knowledge that connect information and understanding structurally what’s going on.

And then now, of course, in the paper, and we’ll probably talk more about this, we have been able to do this using Generative AI technologies. And this allows us to, as you said, build these knowledge graphs for a thousand papers or more, and do it in the way, actually in an automatic way. So we don’t have to manually read the papers and understand them, and then build the knowledge graph, we can actually have AI systems build these graphs for us. And this, of course, is a whole different level of scale that we can now access.

Ross: So there is an important word there, ontological. So what’s the importance of that?

Markus: Yeah, so when we think about concepts, like, let’s say, we take a look at biological materials, a lot of them are made from proteins. Proteins are made of amino acids. And there are certain rules by which you put amino acids together, which in turn are encoded by DNA. And depending on the pattern you have in the DNA, and then in the protein sequence, you’re going to get different protein structures, which have different functions. And, certain types of sequences will give you a helical structure like a telephone cord, others will give you a brick-like structure like a widget configuration. And so what I’ve just talked about really are relationships between concepts and amino acid protein break properties. And these are ontologies that basically describe how things work. And things as you alluded to, could be anything and it could be scientific concepts. It could be business concepts, it could be social science concepts, it could be really a lot of different types of things.

So these ontological representations allow us to understand how these different building blocks, like common building blocks, are related and how these relationships ultimately lead to certain properties. And so in these models, essentially, what we are really trying to understand is how properties emerge from the building blocks and their nonderivative relationships. Right? So for example, looking at biological materials, and proteins, there are certain patterns that are really important, maybe a single mutation in a protein, right? We have all heard of genetic diseases. There are certain mutations, a single point mutation and a protein out of 1000s of amino acids could create a disease, whereas a million other mutations do not create a disease at all right? And so those are the kinds of things we’re trying to understand. And usually in science, we like to build models of such things. We’d like to understand what’s important, what can we ignore, what kinds of relationships are really critical, and how we model them. So, in a non-ideological presentation, we look at data, and we’re trying to build a representation of how the system works. These things can be pretty complex for real-life systems because there are a lot of nuances and how the world ultimately works.

Ross: So these are conceptual relationships. And in this case, they are distilled from the text in the papers to essentially an emergent structure to ensure the relationships between these concepts.

Markus: Yes, so let’s say you read a paper as a human, but you can sort of look at the paper as a collection of words. And when you read it, you make sense of the sentence or the words, the abstracts, the paragraphs. And what you do in your brain essentially, is you build relationships, you build an ontological concept map of what’s going on, you have information, which is the words, the sentences, and the way you understand the paper really is by building, creating a representation in your mind, of how the words relate in the sentence, how the sentences relate, and how maybe individual words in multiple sentences relate and you understand, “oh okay”, there’s a specific nuance in the way this word is described, which really allows me to understand the entire abstract or the entire paper, because there’s sort of this detail, right?’ That matters, just like in the amino acid sequences. And so those relationships really are what you kind of take in when you read and the AI systems work actually in a similar way especially transformer-based architectures they quite literally take information, which are tokens. They built a presentation map internally of how these tokens are related, which is a graph representation, which then we call knowledge — knowledge is how information is related.

And there’s a, I think, a reference in the paper to some of the discussions at five had about similarities, of course, before AI, but he really talked about how information is well important. But really, knowledge is the relationship between information pieces, and this is what we try to distill out and science is to understand the relationships and then translate them across different areas, which we’ll also probably talk about later on as well, using isomorphisms. But not going there quite yet. But yeah, so information to knowledge is really what we do as humans do what AI systems do internally. When we’re building these Knowledge Graph representations. We’re doing it on scientific papers or tax patterns information out there. And it is to make it accessible to a human right. So when you think about a large language model, it builds graphs internally in what we call the embedding space of the hidden space, which we do not understand its very high dimensional vectors, and tensors, which are all related and that graphs are being constructed. But I look at it, you look at it, we have no clue what the model is really thinking internally, building knowledge graphs, using human language or numbers, numerical symbols, and so on, allows us to actually trace knowledge information in a way that humans can understand. It’s sort of the same level of abstraction as what we have in a scientific paper. Except if you read the paper, you as a human, take a couple of hours to read and understand. If you want to read a thousand papers, I will say by the time you read the 10th paper you probably have forgotten about the first paper already, right? So if we can automate this process and build these connections between bits and pieces of information in the papers, we can suddenly have AI systems help us and actually read a thousand papers and build a really, really powerful representation, which then humans can interrogate, and actually mine or ask questions about, or get the AI system to give us new hypotheses, or anticipate behaviors and so on. There are lots of different things we can do once we have these knowledge graphs that have been extracted, and again, they are human-readable. So we can ask the AI system questions, and they can tell us the answer. And we can actually trace how the model thought about giving the answer. And that’s a very powerful way of interpretability, which is oftentimes quite important, especially for human-AI collaborations.

Ross: Yep, absolutely. So one of the ways in which these are accessible is through visualizations of the knowledge graphs. So these, of course, basically collapse the intense multi-dimensionality of the embedding structure of the concepts and their relationships from the AI mapping down to two dimensions for a visual that humans can see. So I mean, without going into too much depth in that, what is the process of then pulling that into a visualization, which can be useful for inspiring or being able to help people understand relationships between concepts?

Markus: Yeah, that’s a quite good question. So maybe I can walk through a little bit of the process by which these are constructed from the original data, the raw data. So our data here are scientific papers, as you said, in about a thousand. We’ve done it for more than that now. But, it’s maybe a limitation of the computer you have, but let’s say you have thousands of papers like we did in the paper. And you essentially, at AI systems, we read these papers, and we do a distillation process by which we ask the AI system, the nominee, to read it, but actually extract useful information, like what’s in the paper? What are their prophecies, what are the key findings, numerical insights, quantitative, strategic decisions, the model that the scientists have made, and so on, and so on?

So we give a kind of a very detailed nuanced summary of the paper in, chunks of text, essentially, we do this for chunks, so we don’t look at maybe the entire paper, but look at sections of the paper, and for each section, we get this sort of very detailed understanding of what’s being described in the paper, these sort of distilled. And the reason why we do this is quite similar to what people do and chain of thought reasoning, or multi-step reasoning, react reasoning, where you essentially, don’t just use a single agent and ask a question to your AI model, but you actually sort of deliberately asked multiple questions, give me this answer, critique this answer, give me another angle, summarize in that way, and so on. And so this gives sort of a very detailed description of what is in the paper. But so this then forms, this is the information. So that’s just what’s in the section of the paper. Now, we build graphs of each section. These are called kind of local graphs. So those are basically saying, Oh, love it, this description of summary. And build an ontological representation. So tell me, what are the key concepts in there, and what are the relationships between them by the might be, a protein and amino acid, right? And so the paper might talk about how I mean, acids are combined to form proteins. But these proteins are very strong. And because these proteins have a strength that’s been used to build a composite, which the authors have talked about, maybe applying it in the airplane or going to coding. And so that’s where that graph ends. So now the next section might talk about the synthesis process. So it might say, Yeah, I’m making these proteins out of these amino acids. But I’m actually using this particular chemical process to make them and I’m using this organism to grow the individual protein maybe, and then purify them. And I use this machinery, this chemical process, and so on. So you kind of go through these different steps of how to make it. And the third part of the paper might talk about, maybe this, the theory behind it, maybe the office has developed a model of how this can be done, and maybe a molecular simulation, and so on. And so these are all kinds of separate graphs. Now, ontological graphs, describe concepts and relationships, like we talked about earlier. And now we do this for all these 1000 papers, and all the sections in the 1000 papers.

So this, 10s of 1000s, 100s of 1000s of different sorts of small graphs that are now in my storage, basically, now what I do, I combine them, and because they’re overlapping relationships, and so there’s some transitive properties here. So if one section talks about protein and how it’s made, another section talks about the protein and how it’s applied. Another section talks about the protein and maybe its toxicity or biomedical applications. When you connect this in a graph, the nodes are going to be connected in many different ways. So the node protein might be connected to manufacturing and all the processes as well from this, there might be a connection between protein and toxicity, and so on, and so on. And you can see this graph now being constructed. So similar nodes are actually combined into a single node. And you mentioned embedding that so we can actually use natural language processing, to say, Yeah, I might have a terminology for a protein. In one section, it’s called the protein in another section is called proteins, right? In another section that might be called amino acid groups or something, I’m making up sort of some, ways of how scientists describe things. Not embedding models allows us to understand that these are actually identical, similar things, similar entities. And so what we do is we group them together, we say instead of calling it protein, or proteins, or groups of amino acids to actually the same thing, and they’re all called protein.

And so this step allows us to sort of combine different concepts which would have been different nodes into one single node. So this helps to sort of make the graph simpler and more compact and more accurate because different scientists, different groups, different people are going to use different terminologies, but embedding models allow us to combine them and then we do this combination process for every single small graph and combine them into a gigantic graph and what happens because these concepts through this process of distilling information into similar terms, combining similar terms, the there’s a consistency here. And so you create very, very deep connectivity across things because proteins occur in hundreds of papers and manufacturing occurs and hundreds of different papers, and they have different relationships, they’re not just connected in one way, they actually connected in many different ways. Because manufacturing might relate to a chemical process A, B, or C, this process might be used in a lot of different contexts and might be used to make proteins. However, the same chemical process might also be used to make a polymer coating or paint. And so in a way, when you read the individual papers, you don’t know that because you just read the papers. But because he asks them about these 1000 papers, and it’s creating these graphs, it’s connecting them. Now, the whole graph, when you look at it, you can actually see the connections. And so you’re asking about visualization, right? So this graph is obviously very big, you can really look at it on a screen because, of 10s, or 1000s of nodes and connections and edges.

So what we do is we usually look at a subgraph. And so we say, let’s say we want to look at one concept, like graphene, or protein or health or whatnot, manufacturing process, some kind. And we can just click on that note, and we can sort of see, okay, what are the neighbors to that node. And so we create a subgraph representation, let’s say, look at one node and all the first, second, and third nearest neighbors. And that gives us a graph, we can now fit on the screen basically. Or I can look at two concepts, I can ask the question, I have a concept like graphene, and a concept like sustainability, or something like graphene and music, something totally weirdly different. But can you identify a connection between them? And so we can have an algorithm, look for the path that connects graphene with music or graphene with sustainability or graphene with whatever you pick. And of course, if there is no connection, the model will tell us there’s no connection, but we can use the embedding model again and say, Hey, if you don’t find music in the graph, find me the closest node that relates to music. Okay, and so then we’ll find that, that node, and we’ll actually identify a graph between them.

Ross: We’re just gonna say, this takes us to the concept of isomorphic mapping. So taking well could be areas of the graph that have similar structural similarity, or structurally similar and be able to find those. And so then, I suppose part of it is being able to identify where there is structural similarity, the alternative is to take one area of a particular structure and then map it against another domain.

Markus: Right, yeah, so it was a great, great, good point. I mean, there are sort of two ways in which we could connect this. So one of the things I should take a step back, why we are interested in connecting different things. I mean, that’s all that science is about, or technology, you want to have a solution to a problem, like, let’s say you have, you want to build a, I don’t know, very scratch resistant paint or coating. And you want to use graphene. So the question is that as an engineer, or scientist, you come, you say, Well, how do I do this? Well, I want to understand how graphene is connected to crack-resistant coatings, especially coatings. And so you can find a path maybe between these concepts, right? So like I described, you basically look at the graph and find similar nodes. So if the graph has this connection, you can build that out. And you can then read the graph, essentially, and it will tell you how to get from graphene to scratch-resistant coatings. And it tells you a lot. And so you can then develop technologies out of that, or you can feed it into an AI system, which the, we call graph reasoning where you say, okay, instead of just answering directly to the AI, look at the graph, look at the subgraph of how graphene and scratch resistant coatings are related. And look at this entire path, the relationships, and maybe even the source papers that came where this came from. And answer the question, right, so this gives a very, very detailed sort of substrate for the model to think about.

But now to your question: So what if there’s no connection? Okay, so what if you have different graphs that are not actually connected, right? So for example, music, music theory, and material science or philosophy, they might have no connection or very few connections. And so you really can’t quite figure out how to get from point A to point B, because there really isn’t any paper that talks about, the relationship directly or even multiple papers that you can use to build a graph between them. So that’s what we use isomorphic mapping we basically look at, graph structures, we say, okay, if I’m interested in how materials become resilient, or tough, resilient, what would be an, so we identify structures that cause that describe essentially in the graph, resilience in materials and so, those are sort of groups of structures, maybe in there that talk about how this happens mechanistically like, you have to build a composite you have to build a particular pattern, you have to do a certain type of processing step. So that gives the graph structure that tells us how to do this, how to achieve this in engineering from materials and now we say can you find a similar kind of structure in music, either can you find it a valence something that a topic you might be interested in? Or just look at the entire graph and see what is similar with similar patterns.

So, in the paper, we’ve done this sort of an experiment to say, Okay, I have, the concept of the graph that I’ve identified in materials, what would be can you identify similar structures are identical structures, if we truly do isomorphic mapping, and it has to be identical topologically in the music space, and then we can look at them and we can see, okay, so here’s the same graph topology, in material science, and then music. So the graph, if you look just at the graph, the nodes and the edges, they look exactly the same. What’s different is within the nodes and the edges. So in the materials, domain, nodes, and edges include things like atoms, microstructure forces and stresses, and things like this. In music, they’re going to talk about scales, tonality, maybe the composer’s name, and musical concepts, essentially. So what you can now do is you can say, Okay, I have now an isomorphic, mapping the same graph structure in materials and in music.

So which have you discovered? Yes, exactly. Yeah. Yep. Yeah, so I discovered this now from the data. And, and this all can be done algorithmically, all through to computational processes. And so now what I can do, I can look at what’s inside the nodes and the artists. And that’s really the interesting part. And it’s not just that there’s a similar structure, which is already interesting, but I can actually sort of ask the question, so what do the nodes and the edges mean? And materials and music? And how does this picture now relate to one another? And, of course, we can look at this as a human, we can look at analyze this, or we can give these two graphs now to an AI system and ask the question, like we’ve done in the paper, hey, look at these two graphs that so often mappings between these concepts, interpret this for me, tell me how the relationships could be explained, and do it in the tape, right, So we can give some structure to the thinking. We basically ask the model in multiple steps first to make a table of the nodes and the edges. And then we say, add a column in the table that describes the relationship, and maybe add another column to describe an interpretation of the relationship, so how would I understand these graphs, what they mean in materials and music, and what the translation means? And so this is something that can be automated, fully using, these AI systems. Now,

Ross: So on the paper, you used Beethoven’s Ninth, and mapping that against some material structures, I believe. So, in doing that exercise, what was most striking to you about it? What did you see from that mapping of Beethoven’s Ninth to the domain? Which made you sort of surprised or insightful?

Markus: Yes, I think we’d have, first of all, I mean, we, we have tried, as I mentioned earlier, actually, you would have looked at these types of relationships, especially between, music and materials in previous work could have more with pen and paper methods. So that’s, analyzing the structural hierarchies and the relationships. And it’s very human bias, essentially, you, you basically, do the analysis in that case, based on what you anticipate or know or understand. And, and that’s sort of limiting because, especially, across multiple domains, we’d like to have an automatic process. So the first thing that is surprising was that, first of all, we found, very similar graph structures that actually could be isomorphically mapped between music and the materials graph, and when we saw those structures, and you can see this in the paper, and you can probably pull it up and, in the video, you can actually see that the graphs, really look exactly identical to what logically right so they are, exactly what the algorithm is asked to do, and sort of discovering these, these, these isomorphic, mappings and then you can look at what’s sort of in these in these in these notes and in the edges, and you can ask, the kind of the question, how could they be related and I’m kind of maybe, kind of, you can actually look at the graph, and there’s, there’s a table in the paper, I’m actually trying to pull it up right now to kind of go through some of the examples. So there’s for example, so I’m looking at figures figure eight in the paper for those of you who want to know my friends, this shows the two graph structures okay. And so, you can see that they are identical topologically meaning that they can be isomorphically mapped so every node and every edge can be connected to the other graph in the different domain.

And then you can look at the, the individual, kind of, specific I would say specific, the content and the nodes and you can see that in in the materials worlds, the nodes have things like adhesive force or beam and failure and characteristic length dimensions and structural features, buckling behavior, and so on. And in the music world, again, we have things like tonality, the composer, then Beethoven, F major C major, different scales, and so on. And so then the question that we then looked at, and sort of the surprising thing was, I thought, Okay, this is great, but I can interpret this, but it’s going to be biased by my humans by my understanding of the system. So why don’t I let an AI system that looks at these edges and labels in the two different systems, and, it kind of explains how they’re related? Okay, and so then in the, in the, in the paper, you can read through the different, the specific analysis behind it, but, kind of the assistant will actually try to explain how they could be potential, connected, and that I found actually quite, quite surprising, first of all, that the model could give us an answer. And, of course, Mom was very eager to answer. So, generally, they, that’s what they like to do.

But the answers actually were quite, quite, quite rich in the way they understood the topics. That’s part of what, transform models will do very well as they are very good at connecting in translating insights and ideas from one domain to another. So for example, if you say, write me a poem, in a style I don’t know, yeah, make a poem of spider silk in the style of Shakespeare, it will do a pretty good job with that. And so similarly, here, we’re basically giving an example of how things look in one domain and another domain giving graph structure. And the model is sort of tasked to explain these relationships, and they’re very good and kind of interpolating between these different domains. Now, the graph is really critical here, because the graph gives the model something to think about, right? So without the graph, if I were to ask the model directly, tell me this relationship between music and materials, they’ll give me an answer. But it’s not very rich, it’s not very nuanced, it’s going to be a more generic answer. But if you give the graph the substrate to think about, which has, as we talked about earlier, there’s a lot of work that goes into this to build this graph using the AI tools. Now, the model has a much more, much deeper, I would say substrate to think about, and that’s exactly what. And then so now, the answer is actually much more intelligent, essentially. And I think that’s really the surprising part was that, yes, we can automate this, it creates these really interesting isomorphic mappings. And can actually tell us something about these mappings between the domains, which are free of my bias that I have, because I might be an expert in one field, but not in the other. Or I might be more familiar with one or the other. And so my answer is going to be biased by whatever I know, or anyone else for that matter.

Ross: So I mean, it’s a massive topic, but just sort of just begin to answer it is. So then what’s the value for the scientist, material scientist, who then says, Okay, I’m gonna map Beethoven’s Ninth against it, you get these nice illustrations of the relationship between the concepts? Right, so what do the scientists do with that?

Markus: Yes, well, so one thing that we are quite interested in science is to expand the horizon of what we can maybe build or understand or hypothesize. And so one clear thing is, you can look at these now and say, Okay, if, if, if I understand, how these are related, potentially what I can do, I can maybe get a new hypothesis about it. So I can say, if this is how resilience looks like in music, I can maybe look at a different, part of the musical graph and say, can you use this previous analogy that you’ve already developed? Right, but now go to the next step, and explain to me how I could use maybe even let’s give you a specific example, you take a look at Beethoven’s Symphony, and you look at the, maybe part of that is particularly interesting, musically, and you say, okay, here the, here’s this part of the graph that describes this the construction of this particular part of the music, can you tell me how I could utilize this as a material design principle? What would that mean? Right? So that’s sort of a very specific example. And then the AI model will because it does in context learning, it understands the previous graphs, the previous answers, so it’s kind of like a chat interaction, right? And then you end it will then say, okay, so here’s my extension of what this graph will look like, if I were to take this new part of the graph and music and I would apply this to the material. So this now is sort of where the novelty comes in, because this new part of the music did not have an analogy yet in materials, right? So that’s something I’m asking them to create for me.

So now that creates a new graph, which is a hypothetical graph that does not yet exist in materials and engineering science. but it exists in music. I can use this understanding of how relationships work to move into the zone, this could actually be done mathematically as well because you mentioned, we talked earlier about embeddings, as embeddings provide us with an opportunity to understand what things mean in a sort of abstraction of a vector space. And I can actually look at Graph representations and look at how relationships look in a graph with respect to, changes in the vector. So if you imagine going from one point from one node to another node, there’s a relationship. And so this can be expressed as a vector, essentially, in this high dimensional embedding space. So I can even sort of formalize this mathematically and say, if I were to have an extension of the graph, in the wave music, extensive sort of every connection between a node and a node or node is sort of a vector. And if this node does not exist in materials, let’s say in engineering, I can still compute what it would be. I can then solve the inverse problem and say, this node does not exist yet in engineering and science, it exists in music, what would it be if it were to be there in materials? And so now you’re getting a new node in the graph, which is not there yet? And it’s not there, because it doesn’t exist? , fundamentally, it’s just there, because we haven’t discovered it yet. So this is sort of asking the question, right, so now you’re beginning to discover new relationships in materials that are totally inspired by music. So you’re gonna use this isomorphism as a way of building a foundation of footing, saying, here’s some real connection, and I can understand how relationships look like and was sort of the mathematical foundation for it. And now I can ask sort of, like, we call a Taylor expansion of series expansion, or you could call it an extrapolation, if you wish, I can say, so if I’m comfortable with this connection, I can really understand this vigorous mathematically sound, I can now extrapolate from this, and I can go a little bit on the edges, like build new nodes, build new connections, and how would that look like, right, so you can now, make new discoveries, scientific discoveries, or, or do a lot of other things. And the beautiful thing about this, and I’ll stop in a second, but, is that you can do this using, mathematical methods like embedding, but you can also do it with human language. And so that’s sort of the beauty of language models, of course, they provide us with the connection between a very, very abstract representation of information and knowledge. But also, we can talk about it. So instead of formulating the math, writing down all the equations building the embedding problem, and doing the inverse problem, I can also simply ask the question, I can say, tell me what the extension would be if I were to follow the same topology, as in music, but build new notes. And this is sort of the interesting part, which makes it much more traceable, tractable, and flexible because you and I interacting with this AI model can be very creative. And then the human mind can sort of ask questions in our own language, we don’t have to use just math, we can look at it, we can see it. But also, we can use math. So it’s sort of the best of both worlds, I think.

Ross: So in a way, we could say, another phrase we could use is idea space. So we think of an idea space, and this sounds like you’ve already mapped, but by doing these mappings, we can actually start to find well that these are parts of the ideas space, which are not represented and as you say, we can actually discover what is in those ideas, spaces, which are found by for example, some of these isomorphic mappings, so we can actually in a structured approach to pull out some of the latent ideas, innovation discovered as structures. So this comes back to, I suppose, extending my previous questions around, humans plus AI. So clearly, clearly, your work is not trying to replace scientists. It is an amplification of scientists and how they work. And by, amongst other things, effective ideation or hypothesis generation back. So, this is I mean, it’s a massive question. And I think this is an area for a lot more research, but what are some of the configurations of, human scientists and the kinds of AI structures that you are building? And how do they work together to accelerate scientific discovery as fast as possible? And what are some of the capabilities of the scientists in being able to use these models or structures effectively, and being able to discover things faster?

Markus: Yes, great, great question. Exactly. So it’s really inspired by the, our MIT’s campus here is really kind of a connected campus, we call it and one of the reasons why MIT has been so successful in science and engineering innovation is we have a lot of connections, random connections that can form so let’s say a walk through we call the Infinite Corridor, all the departments and everything’s connected here. So math and chemistry and engineering and all the buildings are connected socially, right? So if I’ve walked through the Infinite Corridor, which is the main artery of connections here, I might run into, scientists X, Y, and Z, and we might go to the coffee shop to have a conversation our students might need. And those random connections actually spur a lot of stimulation and discussion discovery. And that connection really is what makes discovery and innovation possible, do you kind of go and have a crazy idea, right, and you explore that? And so the idea sort of behind this, all of this work is how can we formalize this and actually supercharge this? So instead of us humans with a 1000? Faculty here? What 10,000 students? What if we had, AI that could help us make these connections faster with more data? And exactly, so this sort of, kind of drove this a little bit and, and so in a way, what we need is, we definitely need some point, let’s say you have a new hypothesis generated or new, crazy idea, based on analyzing musical structures and extrapolating into materials and seeing well, I can maybe, the assets and tells me, I could make this really amazing electronic, electrically conductive spider silk based fiber, which might be used in a new computer chip. I mean, this is a weird idea, but I can dig deeper into this. So now, we need a sort of human-AI collaboration to come up with this interesting new idea, this new computer chip design, or this new material. And now I’m gonna have to make it right, so I’m gonna go to the lab, and I’m gonna have to try to build this material. And this is really important, because, the AI system is sort of extrapolating and thinking about what might be possible. But we need, of course, now, the grounding in physics, in experimentation, or in other theories, and part of that can be automated as well. So I can use multi-agent AI, which in some of the work have been using that quite heavily for that purpose. So you can say, create a new idea, a new design. But now let’s test it out, run a simulation, write some code, run a simulation, or do an experiment, or look in the literature if this has been explored before. So that’s kind of step, would be typically in the way we would do that, as humans, as scientists, my students, I would do this, we can automate part of that, but at some point, you’re gonna have to, at least today, we’re gonna have to supervise the process, and actually, probably still go to the lab ourselves and make this material. And this will provide feedback.

And so one of the interesting things now, of course, is that, let’s say you begin to build this material like we’re doing this right now, for one of the materials designed in the paper, this mycelium composite was designed, we’re going in the lab, we’re building it, and so my student, she’s actually, following the recipe that AI is generated, and we’re gonna identify what are the shortcomings, like some of the things we’ve noticed is, some things are missing, that AI system has not really thought about every single part of it. So we need to either go back to the AI and say, Hey, I’ve tried this, but I’m missing a temperature, I’m missing the processing step. So one of the possibilities would be then the human going to the lab trying it out, going back to the AI saying, Hey, you missing something, tell me what I should do, right? Or are we just deciding on our own? So that’s sort of the deciding point where you say, a decision point where you say, am I going to go just on my own to conventional signs in the lab? Or am I going to involve AI again? Or am I going to end the AI at the very end when I have made the material and I figured out the differences and here’s my design, here’s a picture of it, so you can kind of decide on what you want to do? But typically, you need feedback from the world. And you and this is part of what we think, general AI needs to do is to build, of course, better world models, which are usually referred to as models. So they need to understand much more about physics, especially as they extrapolate. And this is sort of similar to what science conventionally does, of course, if I’m done not using AI at all, I mean, I’m sitting here, my computer on my lap, and I’m coming up with a new idea, I’m going to make it or a new product, I’m going to try to build it, I’m going to do an analysis of the market and do an analysis of feasibility, the cost of all these things, and collecting new data.

And so the way we’re doing the same thing here, except that the ideas come from AI and the process comes from AI. But it can follow a similar process. And now, again, the good thing is that there are certain steps in this way that are going to be much faster, right? So for example, a few years ago, if I had a new protein design, I did not know what it looked like now with alpha fold, I can actually pretty quickly get an idea of how this protein will look like right now, I can then run a molecular dynamics simulation of this protein can actually, make it in a lab using high throughput processing. There are a lot of advances that come together that are individually small advances. But when you put them together in this whole ecosystem of general AI, high throughput science analysis, they become extremely powerful. And this is part of what the graph structure really is. Also, we started off the graph structure, it’s about the connections. And science is all about connections, right? You kind of get an idea. You bounce off the idea by either physics and you say this is possible, or you say, as an engineer, yeah, it looks impossible. But I can maybe find a way of making it work somehow. And then you become an engineer, you figure out, okay, if I tweak the condition, I can actually achieve the school. And you get this feedback and you bounce off ideas. You can do this manually, or you can automate it. So multi-agent AI is the way we’re doing this in a lot of the work we’re doing is trying to automate this process sort of, beyond the idea beyond the discovery, which we talked about in this paper, we take those outputs and put them into agentic AI. And these have very deep capabilities, from generative modeling to physics to experimentation, perhaps even robotic experimentation. And I think that’s a way we can ultimately really accelerate science, which the title of the paper kind of alludes to in this direction that we try and accelerate how we make discoveries and how we can prove whether discoveries or hypotheses work and how they can be falsified or verified. And all of this provides new information, which needs to be connected, like in a graph, to other pieces of information that provides knowledge.

And ultimately, this gives us theories. And I want to make one more point about isomorphism. So isomorphisms, actually are very deep in their meaning, because they are ultimately the way by which we can understand generalization. So if you think about knowledge might be isolated in one domain, another main domain, and it found these graphs very, very rich, but they’re not connected across. There’s no literal connection between music and economics and materials, perhaps, right? However, isomorphisms provide us with a structural substrate to say, Okay, here’s something there that’s universal. And, in conventional science, we call this a foundational theory, or maybe, relativity theories or quantum mechanical theories. But we’re far away from discovering these with AI, of course, but one day, maybe we can get there. But generically, those kinds of theories are very universally applicable, but other kinds of things we can, we believe formalize mathematically using isomorphisms. And that framework, I think, is going to be leading us into the future where we can actually make scientific discoveries meaning generalizable knowledge, that is not just true in one domain, but true in many different domains. And that’s what science ultimately is about. Right? So we really want to kind of figure out how we can discover something that’s true in many, many different settings and unifies different, different phenomena, across many different areas of observation.

Ross: So not ambitious at all.

Markus: Right? Well, yeah. I mean, yes, I mean, it is ambitious. And but I think that, there’s always I mean, you have to envision kind of where you want to go, of course, and no, it’s got to be a long, it’s a long road to get there. But the vision, yeah, the vision is very clear. Yeah.

Ross: Yeah. Well, that’s, I mean, we’re six is sort of round out. I think, there’s, there’s not as many as I would like, but there’s still quite a few others, like yourself who are looking at generative AI of scientific advancement, and I think, as a concept of structural level. So things like, , alpha fold, and so on, are very specific. We’re looking around the AI tools that assist us in, for example, ideation, hypothesis generation, being able to come up with novel methodologies, a whole array of, cognitive real roles for generative AI and augmented science. So what you’re doing is ambitious, there’s, there’s others who are already doing it. So. And you have, I’ve got to, I should mention as well that you have shared code for everything you’ve done on an Apache 2.0 license. And so this is all out there to be used by people. But what’s, what is it going to take for your work and those of your colleagues around the world doing related work to be adopted by the scientific community so that we see this fastest possible acceleration of scientific discovery?

Markus: Yeah, good question. Yeah. So one thing you mentioned is open source was open source, all the code and, and the methodology, and of course, the paper has lots of details in it as well. And the hope is exactly that. I mean, it will take adaptation to specific problems. I mean, I think in science, this paper was a methods paper, really talking about the methodology, the idea, a couple of examples in the paper and how we apply this but what it will take and what we’re doing in some of the follow-up work now is to say, Okay, here’s very specific use cases, scientific problems, engineering problems, for which we have not found a solution yet. How can we use this methodology and actually find a solution? So the proof is always in fact, well, like in many things in life, solving the actual problem or doing something in the real world. instead of, doing an abstraction, so I think it will take a couple of successful cases where people have shown that, this ideation or hypothesis generation or Senate discovery can be done and it solves an actual real problem. Because ultimately, that’s what people care about. And they don’t care about the, ability of AI to do this, we care about solving, making lives better, or creating new economic opportunities and things like that. So, so we’ll take adaptation for many, I hope, and we’ll use cases that are useful. And they are sort of, maybe, bottlenecks in a sense of one that is computed to me. Now, this is a pretty expensive computational undertaking, no, you gotta mind these, fast way to get up, you have to have these papers, and so at a university, we have libraries that have access to these papers. But that’s something you need. So if you work in a different space, you might or might not have access to the raw data of knowledge of information, slash knowledge, right, and to building these connections. But even if you have that, yeah, I mean, there’s a lot of computers involved. And so there’s sort of a limit in how fast can you do it? Well, can you do it?

I mean, we’re using a lot of language models and multimodal language models to do this. How good are they, I mean, if you build these ontological graphs, we actually let AI systems build them from scratch without any structure to them. That’s the whole point, we want to really discover these structures natively from the data, instead of us creating an ontological framework and then letting it work on this, we don’t do this, we actually get the model discovered. But that’s where a lot of the frontier models reach the limitation, right? So we’re kind of pushing the know saying, Well, the best AI models today, Claude 3.5, GPT-4,4o, and things like that, or maybe open source models that kind of can be used as well, how well are they able to do this, and, and their limits there. And then the graph reasoning as well. So once we have the graph, how good are the models actually, in understanding graph structures? And so we’re really pushing the limits of what today’s AI models can do. Now, the good news here is that, let’s say models become faster, which they have been, they become leaner, like GPT-4 omni, much faster than GPT-4, before. But also, they’re gonna have better capabilities. And so when these models become just 10-20%, better, and maybe have a slightly better ability to comprehend and reason and logically connect, it’s going to be supercharged by the graph. Because we have an emergent system of interactions of multiple pieces of information and knowledge. In multi-agent AI, we were going to have multiple AIs talking to each other and communicating, if every one of them is just slightly better, the collective sum is going to be emergent. Much better actually. And so that’s something we find as we sort of follow the evolution of AI systems that we use as the backbone. Any small advance in this field has a huge impact on the science that we can do with them. So there are these sorts of bottlenecks and opportunities. But I also want to say I think for the folks working on foundational AI models, I think this is really great. I hope it’s great to see for them to say, hey, there’s actually something that here’s a use case, and I’ve had lots of discussions, actually, with folks that work on sort of the more really fundamental aspects of creating LLMs and multimodal LLMs, they can see that this is an avenue where their models are going to be extremely useful.

It’s also exciting for them to see and they might benefit from the shortcomings that are identified today to make better models that address some of those like graph reasoning abilities, or the ability to look at very long contracts, and lengths, that’s a limitation. So there are a couple of, detailed things, technological issues that are bottlenecks. Yeah, but there’s definitely, a nature to this, which is plug and play, which is, I think, very attractive, that you can actually you can substitute the and we did this in the paper, the cloud models and the open AI models, and you can compare them, but they all communicate with each other because they all work through human language, natural language. So they are compatible. But that’s the beauty of using natural language as a way of communicating between agents in AI and humans if they can be combined. And so yeah, if tomorrow GPT-5 comes out, I can plug in GPT-5 and do graph reasoning with GPT-5, and that’s going to be presumably, much more impressive in that case. Right. So this is, I think the way we can kind of leapfrog ourselves out of a lot of shortcomings is that we can basically plug in more capabilities, more capable models, or less capable models, if you have a computer limitation, right? Let’s say you want to run this on your phone. Well, no problem. I can run it using a We’ve been using the Phi-3 model quite a bit for Microsoft has a very good model. Very small, with 3.4 billion parameters. Works really well for some use cases. Well, yeah, I can get it on my cell phone, right? So there’s kind of, capabilities like this or, being able to use different quality models. Speculative decoding is an avenue we’ve implemented in some of our inference frameworks like Mr. RS, or we can use different quality models and size models to do the inference step. And so there’s a couple of steps, I think, and these are practical considerations that are going to be useful in the future. Even if we have a PhD level model, maybe one day, we’re still going to want to run it maybe on a phone write on an autonomous robot that runs in the lab and needs to understand how to do an experiment, but so that that robot might not be able to run GPT-5, plus GPT-6, so there’s definitely trade-offs, but lots of engineering involved. So as you can tell from this discussion, I mean, there’s a lot of engineering and how do we actually make it work in real situations, but so there’s lots of really cool stuff for PhD students, for engineers, scientists to explore and ultimately create, hopefully, good products that people can use and benefit from.

Ross: Yeah, I think of it as the easy use of a relatively easy user interface. But I think part of the point is that you’re already getting extremely interesting results. And what you’re pointing to is, what it will take to get to the next generations of it, and I think it is an adoption issue as much as a technological issue. But, essentially, we are already already seeing just on the threshold of what is potentially quite an extraordinary acceleration of scientific discovery. So, thank you so much for your time, for sharing your insights very clearly, and for your incredible work. It’s very exciting to see.

Markus: Thank you, Ross.