Name: Unlocking the Promise of Foundation Models in Pathology for AI-Driven Drug Discovery & Development
Uploaded: 2024-11-06T17:32:17.011Z
Duration: 1 h 8 s
Description: Unlocking the Promise of Foundation Models in Pathology for AI-Driven Drug Discovery & Development

Transcript for "Unlocking the Promise of Foundation Models in Pathology for AI-Driven Drug Discovery & Development": On unlocking the promise of foundation models and pathology for AI driven discovery and development. Thank you so much for being here. I'm Ashley Faber, director of product marketing at Prosha, and it is my pleasure to welcome you and to moderate today's webinar. We have a great presentation and discussion for you today, and we're going to get started very shortly. But first of all, I'd like to introduce our all star speakers who are with us today. First, we have Juliana Ianni, who is the VP of AI research and development here at Prosha. She earned a PhD from Vanderbilt University in biomedical engineering, where she developed methods that enable faster and safer MRI from fast image reconstruction techniques to predicting patient tailored RF parameters for high field scanners. At Prosha, Juliana leads a team of engineers and scientists developing AI systems to help laboratories and research organizations realize the transformative power of digital pathology. Welcome, Juliana. Also here from Prosha is our senior AI scientist, Corey Chivers. Corey holds a PhD in computational biology from McGill University and has over 10 years of experience developing, deploying, and evaluating safe and effective AI for medical applications. Corey has over 30 peer reviewed publications and really embodies the spirit of the scientist engineer with a relentless focus on creating proven solutions that measurably improve the lives of patients and clinicians alike. Welcome, Corey. And we're thrilled to be joined by Zelda Mariet, who is a cofounder and principal research scientist at Bioptimist. Zelda obtained her PhD in computer science at MIT, developing mathematical models of diversity for machine learning robustness and reliability. She then joined Google DeepMind where she also applied her work on robustness to protein engineering targeting COVID 19 treatments in particular. Since earlier this year, Zelda has been coleading the scientific efforts of Bioptimist to build multiscale foundation models of biology. Welcome, Zelda. Alright. So let's jump into an overview of what we'll cover today. First, Juliana will walk through the growth of computational pathology in recent years and how recent years and how it's driving breakthroughs in precision medicine. Then Zelda will speak to the power of foundation models and how they are advancing computational pathology as well as provide a deep dive into the BIOPTOMIS pathology foundation model, h optimist 0. Then Corey will discuss some practical strategies for leveraging foundation models and provide a live demonstration of Proteus concentric embedding solution. And just a quick word on concentric embeddings, it brings a collection of foundation models, including h Optimus 0 to the fingertips of data scientists and developers using our concentric software platform, where an enterprise's pathology data is stored, enriched, and analyzed from discovery to clinical trial. We will then spend a few minutes discussing real world use cases where data science and AI teams are using foundation models for pathology AI development. And we will leave time, at the end for some q and a. So please enter your questions in the q and a box throughout the presentation, and we will try to get to all of them at the end. I'm going to turn things over to our speakers in just a few moments, but first, I wanted to provide some quick background on for those who are not familiar with us. Platform data and AI solutions are fueling the development and use of novel therapies and diagnostics. Our concentric enterprise pathology platform supports pathology really everywhere as practiced from discovery to diagnostics, from bio pharmas and CROs to hospitals and reference labs. And wherever it is utilized, it serves as a central hub powering scientific AI enabled workflows, unifying people, data, and applications. The Concentrix platform consists of 2 products, Concentrix LS, which is a new name for our Concentrix for Research product powering life sciences r and d. Concentrix LS is used by 14 of the top 20 pharmaceutical companies. And our Concentric AP and Concentric APDF products are used by leading hospitals, health systems, and reference labs in the clinical space with the latter receiving FDA 510 k clearance for primary diagnosis. And with that, I'll hand things over to Juliana. Alright. Thanks, Ashley. Before we jump into the power of foundation models and practical strategies to leverage them for pathology AI development, I thought it would be helpful to take a step back and provide some background on why all of this AI development really matters. What's been abundantly clear in recent years is that life sciences organizations, like so many businesses, have made massive efforts to leverage data and AI driven strategies that make every stage of drug discovery and development faster, smarter, and more efficient. We've all heard this to some degree. Right? What's not as well recognized is the huge role that computational pathology is playing within these strategies. Computational pathology has proven to be one of the most promising avenues to unlock value. With the ability to analyze 100 of thousands of cells per slide with unprecedented detail and speed. This not only enhances r and d processes, but as senior leadership at AstraZeneca, for example, has noted, it underpins the development of the next generation of cancer therapies and diagnostics. Computational pathology accomplishes everything life sciences organizations are going after when implementing AI strategies. It improves r and d efficiency. It supports decision making. It fuels scientific innovation, and ultimately, it brings new medicines to patients faster. And what we're seeing is that life sciences organizations are developing their own novel pathology AI models to drive these r and d efficiencies. By building these models internally, drug developers can expedite AI development cycles, retain proprietary data, and strengthen their competitive advantage. And they can do this because most life sciences companies now have the data science and AI resources to actually see these strategies through. I'll mention a few examples. GSK is developing proprietary algorithms to capture more quantitative information from digitized slides of cancerous tissue with the goal of gaining insights that improve their probability of success in clinical development. Another example, in their recent AI Day, BioNTech talked about histology AI as one of 5 core strategies to creating an AI first immunotherapy platform. AstraZeneca has also talked about their ability to cut pathologists' image analysis time by over 30% with their proprietary AI. It's clear that these novel algorithms are driving efficiencies and supporting decision making in therapeutic r and d. Now beyond driving workflow efficiencies, the rise of precision medicine approaches and the rapid growth of next generation therapeutics have elevated computational pathology to a strategic enabler. What you see here is all the genomic biomarker driven drug approvals in the US and EU only between April 2019 April 2021. This is just an example of the rapid Patients need Patients need timely access to these advanced diagnostic tools for these treatments to actually reach and benefit them. And now with the explosive growth of antibody drug conjugates or ADCs and other next generation therapies, and there are over a 100 of these in clinical trials today, diagnostics are all the more relevant. That's because, while highly efficacious, ABCs bear high toxicity potential if used in patients who don't express the related biomarker. So it's more important than ever that biomarker status is determined quickly with a high degree of accuracy. And this is why computational pathology is really driving a paradigm shift in how precision medicines are delivered and facilitating the development of next gen therapies. For example, AstraZeneca highlights computational pathology as one of 3 pillars core to their oncology r and d strategy. And they've put it into practice with their announcement that a computational pathology based biomarker was successfully used in clinical trials to predict whether patients with non small cell lung cancer would benefit from a treatment code developed with Daiichi. And announced earlier this month, they are working with our friends at Oaken to develop an algorithm that processes digitized slides to detect a biomarker to prescreen breast cancer patients. Also this past June, Johnson and Johnson Innovation published their research on AI that detects bladder cancer biomarker using digitized H and E tissue samples. They were able to demonstrate the economic impact of AI based biomarker detection in both the clinical and drug development setting. And most importantly, they believe this type of algorithm will provide rapid, actionable insights into a patient's disease that would help clinicians make more informed care decisions that result in improved patient outcomes. So there's a lot more going on in this field, and I didn't highlight it all here, but I hope this gives you an idea of the real impact computational pathology is making. And I should note that the growth of computational pathology isn't just happening in the pharma industry. It's also been reflected in the scientific literature, representing academic and clinical advancements as well as the developments that we've seen in pharma. I'm sure these numbers from PubMed are underestimates, But nonetheless, over 9,000 papers related to computational pathology and AI have been published in the last 5 years and over 12,500 in the last 10 years. So more than 70% of the publications in this field came out in just the past 5 years. These numbers highlight the exponential growth trend of computational pathology. What has fueled this trend? Digital pathology platforms have made considerable progress over the past 2 decades, laying the foundation for computational pathology to make a real clinical impact and provide life sciences organizations with the tools they need to analyze images at scale. There is no way to build or use AI if all of your slides are sitting in file cabinets or in hard drives on people's desks. Enterprise digital pathology platforms like Proteus Concentric create the accessible and rich data foundation that's critical for AI development. Beyond digitization, a bunch of things have come together to support computational pathology's progress, abundant computational resources, public datasets, commercially available datasets like Prosha's real world data cohorts, as well as tremendous advances in AI and computer vision. All these things have come together to put computational pathology at this paradigm shift moment. So we know AI driven pathology is capable of unlocking the next generation of precision medicine therapies. Now it's time to build, and foundation models are a pivotal technology helping everyone do that. I'll hand it over to Zelda to talk about how. Thank you very much, Juliana. I'm very excited to be here, and indeed, let me talk through a little bit how foundation models are actually impacting digital histopathology as well as what foundation models really are. And I'm discussing this as part as a cofounding member of Bioptimist, a French startup that came out of stealth in February 2024, where our goal is to build foundation models that go from the protein to the cell to the tissue, thus our focus currently on histopathology to unlock, accelerations in science and discovery. So as Juliana just said, we have access to these incredible amounts of data now with all these whole slide images. And we also are working with models, foundation models. You might if you're thinking of something like CHAT gpt, that's actually the scale of model we're working with, although obviously not the same model. So models, there's going to be something like over a 1000000000 parameters, maybe several billions of parameters large. That is an incredible size. And to be able to power that, we also need to work with an amount of data that's comparable. And it turns out that digital histopathology is perfect for this because even a single slide contains a wealth of information that for people who just work in computer vision but not histopathology is almost unheard of. A whole slide image is typically going to have something like a 100000, 500000 pixels. The amount of structure of information that's contained in this is a gold mine for our computer vision models. The problem is that typically when you actually have access to a H and E slide, you don't have access to annotations. You have the slide, you have all of your pixels, and maybe if you're lucky, you have maybe someone drew a circle around it showing where the interesting tissue was. But you're not going to have any of the things that are typically required for traditional machine learning things that start happening in the early odds and until a couple of years ago, which would be actual clinical annotations. Maybe that someone wrote down the actual presence of a biomarker, described the type of tumor that's being exhibited or not. This is comparatively going to be something like less than 1%, less than 0.1% of the data compared to the amount of unlabeled histopathology size that we have access to. And foundation models are actually going to take advantage of this and find a way to actually learn structure and representations for whole site images without requiring labeled data. So let me talk you through a little bit how this works before I talk about h optima0, the foundation model that's being released on the partial path. Typically, when someone talks about foundation model or large language models, really, what they're actually talking about is 2 things that have been combined. The first thing is going to be the foundation model itself. It's this incredibly large several 1000000000 parameter model that's typically trained without indentations. I'll think about talk about how in a second. And because, as Giuliana said, training these models is so expensive in terms of hardware, in terms of compute, in terms of time, in terms of engineering skill, this is typically something you're going to do once, maybe twice. But the amount of time that goes into it and the amount of resources makes it so it's really not scalable to have more than one. So you're really going to try and make the most out of it once you have it. And the way you do this is by adding a second component, the predictive model, which is essentially added on on top of your foundation model. And you can really think of it as something very simple. It can be something as simple as a linear regression model or even logistic regression, something really small. And it needs to be really small because this is where the annotations come in. Now that you have your large foundation model and you actually want to use it for something downstream, you're going to take a very small typically very small because of the constraints we work with, dataset with annotations. And you're going to use this dataset to train your smaller, let's say, linear regression model based on the outputs of your foundation model. So how does this actually work in practice? Well, if you're going to train a foundation model and you don't have labels, that really means that we're going to have to find a trick to actually generate the labels ourselves. And there are 2 major ways in which people do this in practice. The first one is typically called contrastive learning. And, really, here, your label is going to be, this is the same thing versus this is not the same thing. For example, if you start with, slide histo histo slide, what you're going to say is you're going to extract 2 parts of that slide, 2 patches, and then your model has to learn to say whether or not those 2 patches came from the same tissue. If you grab them from the same slide, then they do, or if they don't in case you sample them from 2 different organs, for example. And just with that, you've artificially created labels that are meaningful because you're learning something about the biology of the image that you're working with. And the second approach that people typically use, you might have also heard of it as masked modeling, is generative learning. And here, it's almost even easier because you don't even need to create a label. Your slide, your patch, your tile, your subset of the image is the label. What you're going to take, you take your slide, you hide parts of it, you erase them, and then you say, okay. I want my model to reconstruct what was missing. And that, of course, is also going to teach the model something really fundamental about the structure, about the biology of the tissue, about what you can expect to see, but also about what types of reconstructions are not biologically likely. And without any kind of label, just by erasing parts of the image and telling the model to fill in the blanks, you're hopefully going to encode within the model some very important representations. And these representations, these embeddings can then be used downstream. What does that actually look like? So now we turn to the second part of the model, which is, quote, unquote, maybe the easier part of the model where you take your small, let's say, linear regression components. And what you take is let's say you want to predict maybe the the cancer subtype for a slide. You're going to take the slide. You're going to pass it through your foundation model. The output of that is going to be an embedding or feature representation. And typically, although not always, that representation is going to be frozen. So you take your slide, you get the output, and that's going to be your current representation of the data. And that representation is going to be so much smaller. Right? Because we started with something that was a 100000 pixels by a 100000 pixels. And, hopefully, now we have a representation that's actually captured the important meaningful parts biologically of this input in a much smaller space, which means that now that you have a much smaller space, you can apply a much smaller model such as linear regression and train it to predict things like whatever annotations you have access to, for example, cancer subtypes. And, again, this part of the process is where you're going to actually take your foundation model and really conjugate it across whatever task you care about. If you want to do cancer subtyping, you're going to fine tune, which is essentially train the smaller predictive model, on that kind of task. If you are interested in biomarkers, you're going to do that specific kind of task as well. But you're always reusing the core component of the foundation model and only just paying the cost of training the very small subcomponents. And, again, Julianne had carved some of this, but this has been an incredibly exciting topic over the past couple years. I'm only actually focusing on papers that came out over the past 2, 3 years. And we've seen incredible work on building histopathology foundation models that do incredibly well in a variety of tasks, and that's extremely exciting because it can really speed up all of the research and developments even on the clinical side that's being empowered by all of this. And I, of course, would like to talk about our own foundation model, H optimus 0, which is a foundation model trained on over 600,000 slides and over 200,000 cancer patients, which is an incredible amount of diversity for a slide dataset, which we then convert into 273,000,000, actually a little bit more than that, slot tiles. And with this amount of data, we are able to train a a type of foundation model, which is called the the Giant 14. It has over a 1000000000 parameters. When we just when I was talking earlier about the representation or the features, so when we take this 100,000 by a 100,000 slide image, we turn it into a representation where each patch is going to be 1,536 numbers, essentially. So quite a compression. And if you're in the know about the types of details that we're working with, we have 4 d transformer encoder blocks as well. And this is, at the time that it was released, the largest open source foundation model for histopathology. Available completely open source on Hugging Face. You can import it in 3, 4 lines of codes depending on, your your formatting and is now also available in Prosha's concentric platform, which is really exciting, and we'll be hearing more about that in a second. Just maybe to set the scene for why you might care about using this model, the type of benchmarks that have been run on our side to understand the performance of h optima0 in contrast to other models that are already available, We've looked at tile level benchmarks where your input is really a subset of the whole slide, and you're going to look at things like cancer subtyping, as well as much more complicated slide level tasks, which typically are going to require even a downstream model that's a bit more powerful than something simple by linear regression. And we see that we actually like, of course, all of these models perform very well, but we're very excited to see that h optimist optimist 0 is typically defining the state of the art on these results. I don't wanna go into too much detail for these for these eval tasks, but these are really exciting and we're very keen to see on how the field can work with this model to actually use this downstream in applications. And maybe the last thing I want to talk about a little bit before handing it over to Corey is something that we are particularly excited about in terms of working with embeddings from foundation models. Is not only to build foundation models for 1 modality, not just for histopathology, but actually to learn to connect the different levels of biology to each other. And so there's this test benchmark, which was released by the Mahood Lab at Harvard, where they look at reconstructing gene expression starting from the slide. And this is, for a variety of reasons, incredibly interesting, in part because the technology for getting this gene information is much more recent and so the amount of data is much less available. And what we see is that H Optum is 0, so our foundation model trained only on histology slides, is able to actually predict quite well gene expression working only from the slide, allowing us essentially to reconstruct this expensive amount of information that's not typically available to practitioners. This is, again, not our benchmark. I'll talk a little bit later about other experiments that people have run with our model, but this is something that we are particularly interested in seeing. And to be able to do this, you need to actually be able to access and work with embeddings that are provided by our models. And with this, I will hand it over to Corey. That's great. Thank you, Zelda. Yeah. So, you know, as you heard from Zelda, this is clearly transformative, you know, and has transformative potential, for all the reasons we've just seen. But even with that potential in mind, there are still a number of barriers that typically stand in the way of realizing the full value of these foundation models. And that's especially the case with, when working with whole slide images in histopathology. And so our own internal team, our AI team at Prosha, from our experience and from discussions we've had with many other data science and AI teams in the life science sector, we found that these barriers can really be bucketed into 3 main categories. Firstly, inefficient workflows that can result from trying to work with these things. There's expensive compute requirements involved in actually running inference on these models. Finally, there's limited options for unifying the array of foundation models that exist out in the field and will continue to come on the scene. Just going through those 1 by 1, starting with the inefficient workflows. Data scientists and developers, we want to perform our entire AI development workflow without ever having to really deal with managing whole slide image storage and manipulation, data transfer. We really want to be able to kick off our AI research and development process right from where the images already live. And that's not only to save steps, but to prevent potential errors that can happen, when data has moved around. And we wanna be in this final smaller world that Zelda was mentioning, and we wanna start there rather than starting all the way back with the complicated requirements involved in dealing with large slides. We certainly don't wanna have to wrangle disparate libraries and SDKs just to open files from various scanner vendors. Never mind trying to standardize the various image formats and metadata schemes that are likely to exist, in diverse datasets that we have available for AI development. And we really don't want to be spending our time building, managing, and continuously maintaining data pipelines, further diverting our focus from the innovation and discovery at hand. Next, we'd also like to avoid having to provision, orchestrate, and maintain expensive GPU clusters. These are the specialized hardware on which these models tend to run. Not only because that adds another layer of complexity and resource demands that can slow us down as scientists and researchers, but also it can be incredibly expensive and it can be a large capital outlay just to get started on one of these projects and that can really block progress before it even begins. The same goes for storage costs. Those getting driven up by having data duplicated potentially and stored in multiple different locations. It can really slow down our progress when our work is reliant on specialized engineering and IT personnel that we have to depend on before we can even start experimenting and quickly prototyping new solutions. Finally, we really need to have the freedom to experiment with new foundation models as they emerge, and they're certainly emerging at a rapid clip as we've heard. And we know that some models may excel more at certain tasks than others, and there's also emerging evidence that downstream model performance can actually be enhanced by ensembling multiple models together. So the value, of being able to easily leverage these multiple foundation models together, that can't be really understated here. There's been a lot of recent evidence and I'm gonna go through a little bit of it. Breen and colleagues at the University of Leeds conducted a rigorous single task validation of histopathology models. Specifically, this was ovarian cancer subtyping. Similar to the benchmarking results that Zelda mentioned earlier, h Optimus 0 was the the best performing model on average across all the validations, and it's definitely one of the reasons why we're really excited to have h Optimus, included in our, foundation model collection with concentric embeddings, which we're gonna, show you in a minute. And similarly, Campanella, and colleagues from Mount Sinai and Memorial Sloan Kettering earlier this year, compared 8 foundation models in 9 disease detection, sorry, in 9 disease detections and 11 biomarker prediction tasks and found that all the foundation models performed, fairly well, and with no model emerging as a clear winner. So it's definitely dataset and task dependent about what is going to perform the best. So we want to have a strategy where we're able to actually access multiple models and quickly iterate and test, all of these models on a given task and be able to leverage those foundation models to enhance the further further the downstream task performance. When it comes to taking multiple models together, Niedlinger and colleagues recently demonstrated, in a study benchmarking 10 histopathology foundation models from data on lung, colorectal, gastric, and breast cancers that creating an ensemble of complementary foundation models outperformed the highest performing foundation model in about 2 thirds of the tasks that they looked at. So there's definitely a lot of power, to be derived from being able to work with multiple models. So this is another reason we think to consider a strategy that allows teams to leverage multiple foundation models for a given development, model development project, and to harness the individual strengths of the various models to improve downstream performance. All of this is leading into why we built, Concentric Embeddings. Concentric is, as you've heard from Ashley, our platform where life sciences organizations manage their pathology image data and conduct research studies from discovery to clinical from discovery to clinical trial to support the development of new therapies. Concentric embeddings allow scientists and AI developers to transform their whole slide images into numerical representations also known as embeddings as we've heard, that are ideal for AI development. So by leveraging the power of foundation models directly within Concentric where the data lives, Concentric embeddings allows organizations to build AI at least 13 times faster than with previous methods, all while saving 1,000,000 of dollars through data science and computational efficiencies that can be gained. So we offer multiple foundation models in our collection including h Optimus 0. So you can choose the best one, specific to your R and D needs, from biomarker discovery to companion diagnostic development to trial endpoint assessments and beyond. So I'll now walk you through the workflow of concentric embeddings and talk a little bit about an internal study we performed, to really just demonstrate the value of, what what we've developed here. And then I'll walk through an actual demo, of how to use the the product. So at a high level, what users do is in Concentric, users select 1 or more images or entire repositories, which is Concentric's concept of a collection of images, and submit an API request with a chosen foundation model. Like I said, there are several foundation models included, including H optimus and we're adding more models as we go along. Once the embeddings result is submitted, sorry, request is submitted, tile embeddings are generated quickly, and you'll receive a safe tensor file containing embeddings and metadata ready for download. From there, embeddings are ready to be used to build downstream models. So this puts you, I like to think of it as sort of starting on 3rd base from a model development perspective. You're in that that, right hand side of the diagram that Zelda was showing earlier. And from our early, evaluations that we did with some, partners and with some of our own internal testing, we were able to identify 3 sort of main areas of value that this product, concentric embeddings, really unlocks. First is that teams can create more AI models in less time. Definitely the the the main one. Building, iterating, and refining AI models, can only only takes hours rather than weeks by eliminating all the operational steps that I talked about before, like manual data transfers, external processing, pipeline management, image format standardization, all of that gnarly stuff that we previously had to deal with. And secondly, Concentrix horizontal scalability means you can process multiple models simultaneously without increasing overall processing time. And in our own internal testing, this allowed us to build, train, and evaluate 80 breast cancer biomarker prediction models in under 24 hours. And we did that all using nothing more than a consumer grade laptop. So we're really just making this, model development using foundation models very accessible by taking all of this, difficult upstream things off the plate of the researcher and AI developer. You can also optimize AI model performance by using multiple foundation models as I mentioned before. There's often no single best foundation model for a given pathology application, and performance can vary across datasets and use case. That's why we think that maintaining a flexible strategy is crucial. With concentric embeddings, your team can easily test and compare multiple foundation models, finding the best fit for your specific applications. Additionally, you can combine the strengths of different foundation models using ensemble techniques. Again, as I mentioned earlier, leading to potentially more accurate and more robust predictions. Finally, from a cost effectiveness perspective, concentric embeddings allows you you and your team to focus on scientific discovery and innovation while significantly reducing infrastructure costs, especially those related to data handling and storage. And this frees up valuable resources that can be reinvested to drive impactful AI transformation across your r and d life cycle. And with that as a setup, let's move on to the fun stuff from my perspective of looking at a demo and we'll see, actually how the concentric embeddings works in action and we'll go through a little bit of a hello world example. I think I can share. Hopefully, my screen is visible and I want to first just draw your attention to this public open source repository which we released in concert with the core product of Concentric Embeddings. And this is a set of, Python tools which actually allow you to interact with the core, REST API that makes up concentric embeddings and, handles a lot of the work of just basically calling a REST API and downloading data, caching data, managing your your workflow, and and I'll show an example how that works. So it takes, something like I mentioned in a REST API and puts it into Pythonic terms, which we think is sort of the lingua franca of, of your typical AI developer, including myself, where you can interact with the embeddings product. I'm going to go through an example and we have several examples here in the Notebooks folder showing a couple of use cases and starting with the simplest of just clustering a dataset. We're going to look at this example and I'll go over to actually looking at one here where I have outputs so we can actually see, what outputs are produced when we go through this code. And hopefully, you'll be able to see just how easy it is to get started, and go from 0 to being at the point of having embeddings in hand and you can start doing model development or analysis. Okay. So, just to get us oriented here, I wanted to just quickly highlight Concentric LS, which is where we have our data, stored. And so if I sign in here, I'm gonna show you the dataset we're gonna work with. Here we're working with this IMPRESS, dataset. This is a collection of breast cancer images, both IHC and HNE stained, images, and they're all organized into a repository within Concentric LS. Here, it's repository 1918, which is the number we're going to reference, when we call our API. So, yeah. So all of the data is all in one place, and we're going to be able to make a single call to the platform, and get back all of our embeddings across all of the slides within this repository. So, we make that as easy as possible, by providing this, client within the repository which I mentioned on GitHub and we just provide our credentials to the platform and we instantiate a client and here we can decide whether we want to use CPU or GPU and that's, what we're using on our local machine or the machine that we're doing our AI development. The embeddings themselves are actually computed, in the Cloud using a GPU cluster And this is simply defining how we wanna interact with the embeddings once we have them in hand and we're doing downstream development. So here, I'm gonna use my laptop with just a CPU. And, actually getting the embeddings is is quite a simple process. It's a call, to, the embed repos, method of this client, and we just define the model we want to use and the microns per pixel at which we want to, generate the embeddings. And we get back a ticket, which is going to be the ID we need to fetch the results once the job is complete. This job here at 1MPP for, 252 images is kind of like a, go get a coffee kind of and come back, in terms of the timing of producing a job like this, something in the 15 minute range, to get this many images, computed, at this resolution. The higher the resolution, you compute, the the longer the run time, but we also scale horizontally. So, you do get your results back relatively quickly, regardless, of size. So here when I came back, I fetched the embeddings using the, get embeddings method and, you can see here that we got the 252 images with, the h Optimus 0 model, and that model uses a patch size of, 224 by 224. And, you what you get back is a dataset now which contains all of the images within that repo, all embedded. And so we go image by image here. I'm showing the first image and just showing what is contained in this, dataset that you get back. You get information about, the images themselves and then some information about how the image was tiled, into patches. So you get the number of rows and columns in the grid, as well as whatever padding was potentially added, at the bottom in the rightmost, side of the image just to make it all divisible by patch size. And you get back a, signed URL which allows you to download, the actual safe tensors file, which the Python code actually takes care of for you. It just goes at goes at and downloads it and puts it into a local path and caches it for you locally to be, interacted with. And that is it. At that point, you are up and running. You are in that right hand corner of Zelda's diagram. You now have a collection of embeddings. Here I'm showing just the first image, the upper left corner, the 00th tile in the grid. Here's the embedding. It's a torch tensor, one dimension of size 1536. And that is the embedding dimension of the h Optimus 0 Foundation Model. From here, we can also use some utilities which we provide to fetch thumbnails for doing visualization, of your, dataset. It's a very similar pattern. You submit a job, you get a ticket, and then you fetch the results. And here I just go ahead and fetch those, and I I'm just gonna link them to my, core object that I'm working with, which contains all my data. And from there, I can go on to do really any downstream task. We can start running classification models, whole slide models, anything you'd like to do. Here, I've just got a hello world kind of example of visualizing, those the embeddings by sort of clustering them. And, I won't go through every, step in this process here, but it's all pretty much data science 101 kind of stuff. We're going to map the tiles down onto a 2 d projection so we can sort of get a sense of the structure, that emerges when we use the foundational embedding. And then we can further, go ahead and, cluster them into here. I'm just just using nClusters and start visualizing some core concepts that the model is sort of discovering in this dataset. So here you can see it. It's gonna put similar morphologies and staining patterns, together, just as a result of those, things yielding similar embedding vectors. So here you can see this is clearly a big adipose cluster, and and other, small pathologies are gonna cluster together. So that's kind of the start to finish process of getting embeddings to your fingertips and being able to just get started without having to muck around with, data pipelines, GPUs, all of that stuff that previously was a headache for scientists looking to develop using foundation models. So I'm going to stop sharing and I think bring it back to Ashley. Yes. Let me pull up the slides again here. One second. I think, Zelda, we were just going to speak a little bit about, you know, the what you see in the concentric embedding solution being able to leverage that to really make the age optima zero foundation model more available to pathologists and scientists? Yeah. Of course. I mean, I can only just agree with everything that Corey just said. Having access to embeddings that are pre computed or you or that are going to be computed for you, where you actually don't have to go through the effort of understanding all of these different SDKs, working with all of these different libraries. It's as someone who has done that, it's being able to not have to worry about it and to outsource it is it's 90% of the development's time. 90% of the development time is not doing the data science. It's not analyzing necessarily these results. It's really just figuring out if how do you connect all of the different types until you actually get your embedding that then you can do analysis on. And so yeah. I mean, I work on a team that actually builds the embeddings, and for us, it's going to be useful. And I can't even imagine how it's going to be for people who don't necessarily want to think too hard about what it means to generate these embeddings, just wants to be able to immediately get up and running and start using them. That's phenomenal. That's an that's going to cut down development time by orders of magnitude, honestly. And, also to address something that Corey has been talking about in terms of ensembles. As someone who used to do research on ensembles, I can also agree that having being able to combine and iterate and, like, think about how these models can complement each other is also going to be it's an incredibly powerful tool that's so simple to use, ensembling, I mean. Being able to access it so easily is also scientifically something that's incredibly useful. That's great. Thank you so much, Zelda, and great presentations from everyone. I think we're just gonna have a few minutes here to just close out on some real world use cases using foundation models and life sciences r and d and maybe, generally even beyond that. So, Zelda, if you just wanna take a minute to speak to that. Sure. Of course. So as and I think this is made obvious, I think, as well by this partnership with Prosha, but building coordination models is complicated. It's it takes a long time. It takes a lot of resources. Being able to streamline that to pay the upfront cost by just have like, having by outsourcing it, having someone else build the models, and then having access to the embeddings through a different platform is something that I believe is going to be extremely mainstream and is already in the process of becoming mainstream. I don't want to talk too much in detail about some of the collaborations we have ongoing because they can't point you to any, like, public, benchmarks yet. But we definitely are in this use case where we're working with people to say, well, we want we don't wanna have to worry about how to get them. We want to use them downstream. Please give us our model. We'll see if it works. And indeed, and this was discussed, I think a little bit as well by Corey, but we've seen a lot of, academics already start to benchmark our model. We've had over 10,000 downloads on Hugging Face. This Brean et al paper has shown that we are state of the art on ovarian cancer subtyping. And I think more generally, this is worth saying. But, the Brean et al paper will actually state that foundation models for histopathology have reached a level where they're going to be an integral part of research and development for diagnosis for clinical trials. And then, again, I mentioned this before, but it's so dear to my heart. I do want to end on this. Seeing how we can have these models connect to different scales, to different modalities, connecting H and E slides to something that's a completely different modality. Right? So it's like gene counts and gene expressions in a cell, showing that we have this ability to meet and, again, we didn't benchmark this. This is really what we would call maybe an emerging property of foundation models. Seeing that we can do this in h optima 0, seeing that there are other labs that are excited to try this out and use this for downstream use cases, That's that's what we're hoping for, and that's what we're seeing. Thank you so much. And, Julianne, I'll pass it to you on any closing thoughts as well on some real life use cases. Yeah. Thanks. And, yeah, it's really cool to hear what folks are able to do with with such a powerful model and to hear your perspective, Zelda. Our so our concentric embedding solution was launched only a few weeks ago, And we really can't wait to see what our users are building with this technology. So feedback from our early adopter users of concentric embeddings tells us that 90% or more of the model development our users are doing with AI can be accelerated with concentric embeddings. And this acceleration means that users will be able to mine their existing data for research insights and conduct in silico studies that would have been impractical or even cost and time prohibitive before. Data scientists and developers can build a wide range of supervised and unsupervised models with impacts across the r and d life cycle, including models for drug discovery, clinical trial optimization, and patient stratification. And these are just a few examples. So we're really excited to see what our users build. Great. Thank you so much. And thank you, Julianne, Zelda and Corey for great presentations and discussion. Now we'll open it up for questions from our audience. We only have about 5 minutes left, but we'll try to get through as many as we can here. And as I mentioned before, please put your questions in the q and a box, and we will get to as many as possible in the next few minutes. I do see we have a few already. So let's start with a question here for Zelda. Does Bioptimist have plans to develop next generations of h Optimus 0, and how do you expect that they might be different from the current version? I don't know how much you can share here, but any light you can shine on future iterations? Of course. Well, I think the answer is in the fact that we call it h optimus 0 and not just h optimus. We indeed plan on having, at least 1 and probably several follow ups. There are some very typical things that people will do to improve on a model. There are some architectural differences that I think I perhaps won't be able to talk about too much that we're investigating, but then there are some very straightforward approaches that we know are why foundation models find, and that's scaling. So we're going to be looking in particular at scaling the amount of data that these models are trained on as well as the scale of the model itself. This this is a tried and true method that has worked across a variety of modalities, and we've seen it ourselves benchmarking smaller sizes of age optima 0 as well, where we know that we're going to get more than enough bang for our buck in that sense. Great. Thank you. And, maybe one for Juliannek here. How did you select the foundation models that are part of the concentric embeddings collection today. Anything you can share on, you know, criteria that goes into that? Sure. Yeah. The initial foundation models in Concentric Embeddings are chose chosen based on just their widespread use, the trust within the AI and pathology communities, for them, and availability as open source models, all of which h Optimus Sierra covered and more. And then we're also thinking about the range of models we have and always looking to ensure that we meet diverse needs, for example, by having both vision and vision language models in the collection. AI development in the life sciences covers a wide range of use cases, and model performance can vary significantly depending on dataset and application. We also really wanna hear from our our users about what models you're using or interested in using. And, yeah, please let us know, what models you'd like to see added so we can include that feedback in our prioritization. Great. Thank you. Okay. We have 1. Zelda, you may be best to answer this one. Kind of an age old question in the world of foundation models, it seems. What is more important for building foundation models, data quantity, or diversity? So, yes, definitely an age old question and not even just for for foundation models. It's I mean, it's a lot of my thesis was based on this, so I'm more than happy to expand on this at length, but I will try to be brief. I would say we honestly, the answer is both. We do know that just scaling up the amount of data has a huge influence on the performance of models, But it doesn't make sense to scale up in a modality or in an area where you already have enough information. Once you've kind of squeezed out all of the information you can get out of a specific tissue or specific disease type or category of population, just adding more, it's not it can actually even hurt in some aspects, although, admittedly, it is difficult to reach that threshold with the size of models we have currently. So you you need diversity. Right? Like, it's it's really one of these. You need both. You need to increase the amount of data, but you can only but you want to increase the amount of data by adding diversity both in terms of, you know, like, the way that the experimental data was acquired, the type of tissue, anything that you can think that can inject diversity in terms of the representation and the kind of structure that the model will learn is going to be the defining factor in how well your model is actually going to work on downstream tasks. And that's really the key part because you don't always when you're building a foundation model, you can't imagine all of the different uses that people are going to have downstream. And so to have something that's actually representative of use cases that we can't even imagine today, we really need to have access to as diverse a dataset as possible. Great. Thank you. And, this is a more Corey, I think you can probably take this one. I have datasets with or, you know, working with datasets with slides from multiple scanners. Does concentric embeddings work with all formats? I think you covered this a little bit, but maybe you can, elaborate on that. Yeah. Yeah. That's a great question. So, because Concentric Embeddings starts from Concentric, LS, the the platform, and that platform is compatible with, all the major scanner vendors, you are it's totally unified. So it's scanner agnostic from the perspective of somebody using the API. So any brightfield image, that that you can load in in Concentric LS, you can use Concentric embeddings for. And that is just been yeah. From an internal user, you know, like we developed this for ourselves as scientists, just remove such a huge pain point of having to try to unify datasets coming from multiple scanner vendors, file types, all of that stuff. All that's removed and it's compatible with, yeah, the vast majority of the major, scanners and scanner type, WSI file types. Great. Thank you. And, unfortunately, I think that's all the time we have for questions. So thank you everyone for your thoughtful questions. If we did not get to your question, we will send you an answer over email. So look out for that, and, let's just close out today. I just wanna thank everyone again for joining this webinar. I hope you enjoyed the presentation. We will be sending a recording to your inbox as well as the inbox of anyone who registered for this session. We encourage you to tell anyone who you think might be interested in this content about it. They can sign up to view it on demand using the same link where you registered and very much encourage you to check out the resources section where we have case studies, get repository links, and lots more information about both concentric embeddings and h optimist 0. Thank you again, and have a great day, everyone.