Building unbiased AI

Getty Images

Artificial intelligence can be a powerful tool—but without careful supervision, it can contribute to ongoing issues with equity of care and integrate harmful biases. The technology, which often crunches data to draw out patterns and insights, could be a key to early detection of medical conditions, identifying patients at risk of deterioration and recommending care plans. But recent research suggests AI and predictive algorithms can be less accurate for vulnerable populations and could exacerbate existing health disparities. That attention to potential bias in AI is a growing area of focus, with individual hospitals, the federal government and international agencies ringing alarm bells. The Federal Trade Commission in the spring advised companies not to implement AI tools that could unintentionally result in discrimination. The Health and Human Services Department’s Agency for Healthcare Research and Quality earlier this year issued a request for information on algorithms that could introduce racial or ethnic bias into care delivery. And the World Health Organization in June published its first report on ethical considerations for AI in health. A first step to addressing bias is to be aware of the potential for it to creep into algorithms in ways that can potentially worsen disparities, said Dr. Edmondo Robinson, senior vice president and chief digital officer at Tampa, Florida-based Moffitt Cancer Center—and to continue to be vigilant about such concerns. “You always want to stay aware of the possibility,” Robinson said. Half of healthcare executives indicated potential bias was one of the greatest potential risks of AI adoption, according to a survey from consulting firm KPMG and ranked second behind concerns about privacy violations at 54%.

Steps to audit deployed AI algorithms Inventory. Create a list of algorithms used at your organization. Task an executive with stewarding and continuously updating the list.   Screen. Assess the inputs and outputs of each algorithm and whether they’re susceptible to or demonstrate bias, with particular attention toward thinking through whether proxies that an algorithm uses could introduce bias.   Retrain. If the organization identifies bias in an algorithm, figure out a way to improve it—possibly by retraining it with more data or predicting a slightly different outcome.   Prevent. Set up checks and balances to mitigate bias in future algorithms before they’re deployed and to regularly audit live algorithms to ensure they’re working as expected.  

Source: The Center for Applied Artificial Intelligence at the University of Chicago Booth School of Business’ Algorithmic Bias Playbook.

But it needs to be top-of-mind for everyone, experts say, since widely deployed algorithms inform care decisions for thousands—if not hundreds of thousands—of patients. “AI has a huge potential—if it’s done right,” said Satish Gattadahalli, director of digital health and informatics in advisory firm Grant Thornton’s public sector business. Like with the rest of medicine, that means taking the steps necessary to ensure a commitment to “do no harm.” That “needs to be baked into the strategy from the get-go,” he said. Here are a handful of approaches health systems and payers are trying out to cut down the likelihood of bias unintentionally creeping in at each stage of AI development.Work from a diverse dataset AI requires a massive amount of data. And not just any data, but data that’s reflective of the patients a hospital will be treating. To create an AI tool, developers feed a system reams of training data, from which it learns to identify features and draw out patterns. But if that data lacks information on some populations, such as racial minorities or patients of a low socioeconomic status, insights the AI pinpoints might not be applicable to those patient groups. That lack of diversity is one of the core problems driving bias in AI, the Government Accountability Office wrote in a report released last year, since it could result in tools that are less safe and effective for some patient populations. That’s been a particular concern in dermatology, where researchers have said many AI tools designed to detect skin cancer were primarily trained on images of light-skinned patients. It’s a challenge Meharry Medical College, a historically Black medical school based in Nashville, Tennessee, is planning to tackle through its new School of Applied Computational Sciences, which opened this year. The school, which also offers degrees in data science and houses faculty research in the field, is building a central repository of data from its patients. The data lake so far includes data from electronic health records, but down the line will incorporate genomics, wearable sensors and social determinants. Many healthcare data repositories mainly include data on white patients, since they tend to use healthcare services more frequently than other groups and account for the majority of healthcare spending. But since Meharry Medical College treats many Black and Latino patients, executives hope to make such datasets more diverse. “Organizations like ours that treat a large number of African American and Hispanic patients need to be involved in this kind of work,” said Dr. James Hildreth, president and CEO of Meharry. “If algorithms are going to be applied to the treatment plans or care plans of a diverse population, they should include data that’s accumulated from a diverse set of patients.” Meharry isn’t just setting up a data lake for internal research. It’s also partnering with outside organizations, like the Patient-Centered Outcomes Research Institute, who they’ll share data with for research, and is part of HCA Healthcare’s COVID-19 research consortium, through which Nashville, Tennessee-based HCA is sharing data on COVID-19 patients treated at its facilities with a consortium of universities. Having diverse data also helps validate AI. Researchers can draw out data from specific subgroups to test the algorithm on and ensure that it works for all populations. One researcher at the School of Applied Computational Sciences is developing an algorithm to predict which COVID-19 patients are at risk for readmission and long-term symptoms, said Ashutosh Singhal, director of medical research, development and strategy at the school. The researcher will create the algorithm with data from the HCA consortium, but plans to validate it with data on underserved patient populations from Meharry. “We’ll be relying on these algorithms in the future a lot,” Singhal said—so it’s important to ensure they work for all patients, and not just for some populations. ​ He urged organizations that know they have gaps in their data to partner with collaborators like Meharry, or to tap into national consortiums like the National COVID Cohort Collaborative at the National Institutes of Health, which is corralling electronic health record data from multiple healthcare organizations into a central database.

Why AI is difficult to review

Identifying patients at high risk for a stroke is a challenge—but one that could help physicians prevent stroke, if paired with appropriate interventions and treatments. But there’s a ton of data to parse through, which is why researchers at New York City-based Montefiore Medical Center are using machine learning to study which clinical, demographic and social determinants variables are most associated with stroke, with a goal of using findings to inform development of new tools that assess risk of recurrent stroke. Building a predictive model without AI would be difficult, given the number of variables the researchers wanted to include, said Dr. Charles Esenwa, a researcher and neurologist working on the project. “That was the reason we experimented with machine learning,” said Esenwa, who’s also director of Montefiore’s Center for Comprehensive Stroke and an assistant professor at the Albert Einstein College of Medicine. Despite its benefits, that ability to ingest massive amounts of data can also pose challenges. Even the researchers developing a machine-learning algorithm might not know what variables the algorithm is paying most attention to or how different variables are weighed, since the algorithm isn’t capable of describing its decision-making process. That means that, unlike other types of software, it’s not always clear how an AI algorithm is reaching its conclusions—it’s often hidden in what experts call a “black box.” To add another layer of complication, some AI tools continually adapt in response to new data, changing how they make decisions over time. “With traditional advanced analytics, you know all the variables upfront and you tune (the inputs) over time,” said Jason Joseph, chief digital and information officer at Spectrum Health in Grand Rapids, Michigan. “With AI you do not give it the formula. You don’t tell it what the variables are. You just give it a whole bunch of data … and it figures out what the variables are.”

Make sure AI helps patients Dr. Steven Lin in 2019 founded the Healthcare AI Applied Research Team at Stanford Medicine in California. He’s now executive director of the program, a research group abbreviated as HEA3RT that studies how to translate AI research into actual care delivery and operations. That includes deploying AI at Stanford Health Care, as well as working with outside companies that bring AI tools and ideas to the group. HEA3RT’s team of physicians, quality improvement staff and implementation scientists then test the AI and figure out the best way to implement and scale it. “There’s a ton of amazing work that’s happening at the basic science level,” where scientists are building increasingly accurate algorithms for healthcare using AI and machine learning, said Lin, who’s also family medicine service chief at Stanford Health Care. “But very few of those innovations are actually benefiting patients, providers and health systems on the front lines.” He said it’s important to think about equity issues from the get-go, when first considering a use case and gathering data. There have been cases where a project is brought to HEA3RT, and it becomes clear the algorithm wasn’t trained on a diverse patient population—at which point the group has to pinpoint a broader dataset to continue training the algorithm on.

STANFORD MEDICINE

Dr. Steven Lin, a family medicine doctor, founded the Healthcare AI Applied Research Team at Stanford Medicine in 2019 to study how to translate AI research into care delivery and operations.

“Think about equity issues at the very, very beginning,” Lin said. “If you think about this once the technology is fully fleshed out … it’s often very difficult to go back and ‘tweak’ something.” One of the principles Lin said he follows to tackle potential bias at the start of a project is ensuring there are diverse stakeholders weighing in on the design of the AI tool, as well as how it’s deployed. That means including developers with diverse backgrounds, as well as perspectives of those who will be affected by an AI rollout—like clinicians and patients. HEA3RT recently worked on a project testing an AI chatbot that could collect a patient’s medical history before an appointment. While some patients responded well to the chatbot, others said they wouldn’t feel as comfortable giving sensitive health data to a machine, according to Lin. Generally, younger and healthier patients tend to be more comfortable conversing with a chatbot, compared with older patients who had multiple or more complex chronic conditions, he added. If a chatbot like this was rolled out to patients, it would also be important to make sure it could interact with patients who aren’t fluent in English. To ensure ethical considerations like equity are thought about from the start, Mount Sinai Health System in New York City is building an AI ethics framework led by bioethics experts. Bioethicists have researched health disparities and bias for decades, said Thomas Fuchs, dean of AI and human health at the Icahn School of Medicine at Mount Sinai. The framework will use the WHO’s ethics and governance report as a foundation. “AI brings new challenges,” Fuchs said. “But very often, it also falls into categories that have already been addressed by previous ethics approaches in medicine.”Pinpoint the right outcome to predict Independence Blue Cross, a health insurer in Philadelphia, develops most of its AI tools in-house, so it’s important to be aware of the potential for bias from start to finish, said Aaron Smith-McLallen, the payer’s director of data science and healthcare analytics. Since 2019, Independence Blue Cross has been working with the Center for Applied AI at the University of Chicago Booth School of Business. The center provides free feedback and support to healthcare providers, payers and technology companies that are interested in auditing specific algorithms or setting up processes to identify and mitigate algorithmic bias. Working with the Center for Applied AI has helped data scientists at Independence Blue Cross systematize how they think about bias and where to add in checks and balances, such as tracking what types of patients an algorithm tends to flag, and whether that matches up to what’s expected, as well as what the implications of a false positive or negative could be. As developers move through the stages of creating an algorithm, it’s integral to continuously ask “why are we doing this?” Smith-McLallen said. That answer should inform what outcomes an algorithm predicts. Many of the algorithms used at Independence Blue Cross flag members who could benefit from outreach or care management. To get to that outcome, the algorithms predict which members are at risk for poor health outcomes. That’s been a major takeaway that the Center for Applied AI has learned from working with healthcare organizations: the need to carefully think through what outcomes an algorithm predicts. Algorithms that use proxies, or variables that approximate other outcomes, to reach their conclusions are at high-risk for unintentionally adding in biases, said Dr. Ziad Obermeyer, an associate professor in health policy and management at the University of California at Berkeley and head of health and AI research at the Center for Applied AI. The center launched in 2019 in the wake of a study that staff, including Obermeyer, published that found that a widely used algorithm for population health management—a predictive model that doesn’t use AI—dramatically underestimated the health needs of the sickest Black patients and assigned healthier white patients the same risk score as Black patients who had poorer lab results. The algorithm flagged patients who could benefit from additional care-management services—but rather than predicting patients’ future health conditions, it predicted how much patients would cost the hospital. That created a disparity, since Black patients generally use healthcare services at lower rates than white patients. Developers need to be “very, very careful and deliberate about choosing the exact variable that they’re predicting with an algorithm,” Obermeyer said. It’s not always possible to predict exactly what an organization wants, especially with problems as complex as medical care. But keeping track of the information an organization would ideally want from an algorithm, what an algorithm’s actually doing—and then how those two things compare—can help to ensure the algorithm matches the “strategic purpose,” if not the exact variable. Another common challenge is not acknowledging various root causes that contribute to a predicted outcome. There are many algorithms that predict “no-shows” in primary care, which staff might use to double-book appointments, Obermeyer said as an example. But while some of those patients are likely voluntary no-shows, who cancel appointments because their symptoms go away, others are patients who struggle with getting to the clinic because they lack transportation or can’t get time off work. “When an algorithm is just predicting who’s going to no-show, it’s confusing those two things,” Obermeyer said. Once a health system has an AI tool, even one that’s validated and accurate, the work isn’t done there. Executives have to think critically about how to actually deploy the tool into care and use the insights that the AI draws out. For an algorithm predicting no-shows, for example, developers might create a way to tease out voluntary and involuntary no-shows and handle those two situations in different ways.

Defining AI

Algorithm: A sequence of instructions that a computer program follows to solve a particular problem, such as to calculate a risk score. A predictive algorithm is employed to predict an outcome, rather than to describe or diagnose what’s already occurred.Model: ”Algorithm” and “model” are often used interchangeably in casual conversation but have different meanings. A model is a tool that’s built to analyze data, using statistical algorithms that are tailored to the question a researcher or developer is trying to answer.Predictive modeling: A statistics technique used to predict future behavior by analyzing historical and current data.Artificial intelligence: The development of computer programs that analyze data, recognize patterns and ultimately learn to perform tasks that would typically require a human being. Machine learning is a subset of AI.Black box AI: An AI system that’s unable to explain how it crunched data and analyzed information to reach its conclusion. That contrasts explainable AI, an emerging field that tries to add techniques to AI so that a system can outline decisions in a way that human users can interpret and understand.Locked algorithm: Algorithms that use static processes for making decisions that don’t change with each use—it will only change if updated by a developer. An unlocked algorithm evolves, learns and changes how it makes decisions over time using AI as it ingests more data.  

Source: Modern Healthcare reporting

In one case study, physicians at UCSF Health in San Francisco implemented an AI algorithm designed to predict no-shows—but only to determine which patients could benefit from targeted outreach. They didn’t double-book appointments. Ideally, that active outreach would serve to support patients with challenges accessing care, rather than leaving them with delayed or rushed visits as a result of double-booking. The physicians in the paper acknowledged that double-booking patients would likely have maximized schedules. While targeted outreach reduced no-shows, it didn’t eliminate them. But it can be “worth taking the risk,” said Obermeyer, who wasn’t involved with the case study, if it serves to disrupt, rather than reinforce, existing disparities.  That’s the kind of impact that executives need to think about before deploying an AI algorithm, no matter how accurate, into care delivery. And AI isn’t just a one-time implementation. Healthcare organizations need to constantly monitor AI—particularly if an AI tool isn’t a locked algorithm, and is evolving, learning and changing the way it makes decisions over time—documenting the decisions it makes and refining when it isn’t working as expected. After AI tools are deployed, it’s important to continuously check them for fairness, said Robinson at Moffitt Cancer Center. That can involve building a fairness check directly into an AI tool or separately evaluating its outputs. “The challenge is that you have to define (what fairness is),” he said. That varies by use case. In some instances, it could be important to check whether the AI is making the same determination regardless of demographics like race or gender—like in the algorithm studied by the Center for Applied AI, where similar Black and white patients were assigned different risk scores. Monitoring the demographics of the patients an algorithm flags is particularly useful with a “black box” AI algorithm, where it’s not clear how an algorithm is making individual decisions and is just sharing an outcome or recommendation. If a tool is consistently directing more white patients than Black patients into certain programs, it might be time to ask why. “It really depends on the use case,” Robinson said, but an overarching theme is explainability—investigating how an AI arrives at an outcome, and not just accepting it. “It gets harder the more complex the algorithms are,” he said. But “we (need) an understanding, or some kind of way to explain, why we are where we are.”