How Can Doctors Be Sure A Self-Taught Computer Is Making The Right Diagnosis?

Apr 1, 2019
Originally published on April 1, 2019 7:29 pm

Some computer scientists are enthralled by programs that can teach themselves how to perform tasks, such as reading X-rays.

Many of these programs are called "black box" models because the scientists themselves don't know how they make their decisions. Already these black boxes are moving from the lab toward doctors' offices.

The technology has great allure, because computers could take over routine tasks and perform them as well as doctors do, possibly better. But as scientists work to develop these black boxes, they are also mindful of the pitfalls.

Pranav Rajpurkar, a computer science graduate student at Stanford University, got hooked on this idea after he discovered how easy it was to create these models.

The National Institutes of Health one weekend in 2017 made more than 100,000 chest X-rays publicly available, each tagged with the condition that the person had been diagnosed with. Rajpurkar texted a lab mate and suggested they should build a quick and dirty algorithm that could use the data to teach itself how to diagnose the conditions linked to the X-rays.

The algorithm had no guidance about what to look for. Its job was to teach itself by searching for patterns, using a technique called deep learning.

"We ran a model overnight and the next morning I woke up and found that the algorithm was already doing really well," Rajpurkar says. "And that got me really excited about the opportunities, and the ease with which AI is able to do these tasks."

Fast forward to February of this year, and he and his colleagues have already moved far beyond that point. He leads me to a sun-filled room in the William Gates (yes, that Bill Gates) Computer Science Building.

His colleagues are looking at a prototype of a new program to diagnose tuberculosis among HIV-positive patients in South Africa. The scientists hope this program will help fill an urgent medical need. TB is common in South Africa, and doctors are in short supply.

The scientists lean into the screen, which displays a chest X-ray and the patient's basic lab results and highlights the part of the X-ray that the algorithm is focusing on.

The scientists start scrolling through examples, making guesses of their own and seeing how well the algorithm is performing.

Stanford radiologist Matthew Lungren, who is the main medical adviser for this project, joins in. He readily admits he is not great at identifying TB on an X-ray. "We just don't see any TB here" in the heart of Silicon Valley, he explains.

True to his warning, he misdiagnoses the first two cases he sees.

Rajpurkar says the algorithm itself is far from perfect, too. It gets the diagnosis right 75 percent of the time. But doctors in South Africa are correct 62 percent of the time, he says, so it's an improvement. The usual benchmark for TB diagnosis is a sputum test, which is also prone to error.

"The ultimate thought from our group is that if we can combine the best of what humans offer in their diagnostic work and the best of what these models can offer, I think you're going to have a better level of health care for everybody," Lungren says.

But he is well aware that it's easy to be fooled by a computer program, so he sees part of his job as a clinician to curb some of the engineering enthusiasm. "The Silicon Valley culture is great for innovation but it's not got a great track record for safety," he says. "And so our job as clinicians is to guard against the possibility of getting ahead of ourselves and allowing these things to be in a place where they could cause harm."

For example, a program that has taught itself using data from one group of patients may give erroneous results if used on patients from another region — or even from another hospital.

One way the Stanford team is trying to avoid pitfalls like that is by sharing their data so other people can critique the work.

Some of the most cogent analysis has come from John Zech, a medical resident at the California Pacific Medical Center in San Francisco, who is training to be a radiologist.

Zech and his medical school colleagues discovered that the Stanford algorithm to diagnose disease from X-rays sometimes "cheated." Instead of just scoring the image for medically important details, it considered other elements of the scan, including information from around the edge of the image that showed the type of machine that took the X-ray.

The laptop displays clinical information, a chest X-ray and a heat map that indicates where the algorithm is focusing its attention.
Richard Harris/NPR

When the algorithm noticed that a portable X-ray machine had been used, it boosted its score toward a finding of TB.

Zech realized that portable X-ray machines used in hospital rooms were much more likely to find pneumonia compared with those used in doctors' offices. That's hardly surprising, considering that pneumonia is more common among hospitalized people than among people who are able to visit their doctor's office.

"It was being a good machine-learning model and it was aggressively using all available information baked into the image to make its recommendations," Zech says. But that shortcut wasn't actually identifying signs of lung disease, as its inventors intended.

Technologists will need to move forward carefully, to make sure they are getting rid of these biases as well as they can. "I'm interested in doing work in the field," Zech says, "but I don't think it's going to be straightforward."

Diagnosing disease is far more than an image-recognition exercise, he says. Radiologists dig into a person's medical history and talk to referring doctors at times. "Medical diagnosis is hard," he says. And he predicts it will be a long time before computers will compete with humans.

Zech was able to unearth the problems related to the Stanford algorithm because the computer model provides its human handlers with additional hints by highlighting which parts of the X-ray it is emphasizing in its analysis. That's how Zech came to notice that the algorithm was studying information along the edges of the image rather than the picture of the lung itself.

That added feature means it is not a pure black-box model, but "maybe like a very shady box," he says.

Black-box algorithms are the favored approach to this new combination of medicine and computers, but "it's not clear you really need a black box for any of it," says Cynthia Rudin, a computer scientist at Duke University.

"I've worked on many predictive modeling problems," she says, "and I've never seen a high-stakes decision where you couldn't come up with an equally accurate model with something that's transparent, something that's interpretable."

Black-box models do have some advantages: A program made with a secret sauce is harder to copy and therefore better for companies developing proprietary products.

As the Stanford graduate students' experience shows, black boxes are also much easier to develop.

But Rudin says that especially for medical decisions that could have life or death consequences, it is worth putting in the extra time and effort to have a program built from the ground up based on real clinical knowledge, so humans can see how it is reaching its conclusions.

Dr. Matthew Lungren (left) and Pranav Rajpurkar attend a lab meeting where colleagues are testing an algorithm for tuberculosis diagnosis.
Richard Harris/NPR

She is pushing back against a trend in the field, which is to add an "explanation model" algorithm that runs alongside the black-box algorithm to provide clues about what the black box is doing. "These explanation models can be very dangerous," she says. "They can give you a false sense of security for a model that is not that great."

Bad black-box models have already been put to use. One designed to identify criminals likely to offend again turned out to be using racial cues rather than data about human psychology and behavior, she notes.

"Clinicians are right to be suspicious of these models, given all the other problems we've had with proprietary models," Rudin says.

"The right question to ask is, 'When is a black box OK?' " says Nigam Shah, who specializes in biomedical informatics at Stanford.

Shah developed an algorithm that could scan medical records for people who had just been admitted to the hospital, to identify those most likely to die soon. It wasn't very accurate, but it didn't need to be — it flagged some of the most severe cases and referred them to doctors to see whether they were candidates for palliative care. He likens it to a Google search, in which you care only about the top results being on target.

Shah sees no problem using a black box in this case — even an inaccurate one. It performed the task it was intended to.

While the algorithm worked technically, Stanford palliative care physician Stephanie Harman says it ended up being more confusing than helpful in selecting patients for her service, because people in most need of this service aren't necessarily those closest to death.

Shah says, if you're insisting on an algorithm that's explainable, you need to ask, explainable to whom? "Physicians use things that they don't understand how they work all the time," he says. "For the majority of the drugs, we have no idea how they work."

In his view, what really matters is whether an algorithm gets enough testing along the way to assure doctors and federal regulators that it is dependable and suitable for its intended use. And it is equally important to avoid misuse of an algorithm, for example if a health insurer tried to use Shah's death-forecasting algorithm to make decisions about whether to pay for medical care.

"I firmly believe that we should be thinking about algorithms differently," Shah says. "We need to worry more about the cost of the action that will be taken, who will take that action" and a host of related questions that determine its value in medical care. He says that matters a lot more than whether the algorithm is a black box.

You can contact NPR science correspondent Richard Harris at rharris@npr.org.

Copyright 2019 NPR. To see more, visit https://www.npr.org.

AUDIE CORNISH, HOST:

We're taking a look at artificial intelligence - its benefits, its limits and the ethical questions it raises in this month's All Tech Considered.

(SOUNDBITE OF MUSIC)

CORNISH: As artificial intelligence becomes more sophisticated, it allows computer programs to perform tasks, at one time, only people could do, like reading X-rays. Many of these programs are called black-box models; that's because even the scientists who created them do not know how they make decisions. NPR's Richard Harris reports on the promise and the pitfalls of applying AI to medical care.

RICHARD HARRIS, BYLINE: If you want to glimpse the brave new world of artificial intelligence programs that are taking on life and death medical judgments, there's no better place than the Stanford University campus.

Hey.

PRANAV RAJPURKAR: Richard?

HARRIS: Yes.

RAJPURKAR: Pranav. Nice to meet you. How's it going?

HARRIS: Nice to meet you. Great.

Pranav Rajpurkar is still a graduate student but clearly a rising star in this world. He's developing computer programs that can learn how to diagnose lung disease. He basically gives his computer algorithm a big pile of data and lets it go to town on its own.

RAJPURKAR: Here's what a chest X-ray looks like, and here's the corresponding diseases in that chest X-ray. And then you just feed it hundreds of thousands of these, and then it starts to be able to automatically learn the pattern from the image itself to the different pathologies.

HARRIS: One weekend, they got a huge download of X-ray data that had just been released by the National Institutes of Health. He and his colleagues set up a machine-learning algorithm and let it run overnight. Lo and behold, by morning, the algorithm had taught itself to diagnose 14 different lung diseases with pretty good accuracy.

RAJPURKAR: And that got me really excited about the opportunities and the ease with which AI is able to do these tests.

HARRIS: Fast-forward to February of this year, and he and his colleagues have already moved far beyond that point. He leads me to a sun-filled room in the William Gates computer building.

RAJPURKAR: This is our lab.

HARRIS: The team is looking at a prototype of a new program which can diagnose tuberculosis among HIV-positive patients from South Africa, a country that has a shortage of doctors for that task. They are checking out how well the program actually performs.

UNIDENTIFIED PERSON #1: Can we go through a few of these?

RAJPURKAR: What is your guess on this one?

HARRIS: The computer scientists, including Amir Kiani and medical student Chloe O'Connell, lean into the screen, which shows a chest X-ray. There's also an image that shows where the program is focusing its attention and basic lab results.

CHLOE O'CONNELL: Oh, this is a great-looking chest X-ray.

HARRIS: O'Connell doesn't see any white areas in the lung. The algorithm says the patient is unlikely to have tuberculosis, and she agrees. They then click a button to see how the patient was actually diagnosed at the time of the X-ray. No TB.

O'CONNELL: Yay.

UNIDENTIFIED PERSON #2: OK, next case.

O'CONNELL: Hey. How are you?

MATTHEW LUNGREN: How's it going, guys?

HARRIS: Matt Lungren, a Stanford radiologist who's the main medical advisor for this project, comes in. He admits, first off, TB is not his strong suit.

LUNGREN: Usually, I'm exactly the opposite of the truth on TB.

(LAUGHTER)

LUNGREN: We just don't see any TB here, so that's the issue. OK. What do you got?

HARRIS: The film pops up. The algorithm says it's a likely case of TB. Lungren mulls it for a while before deciding to trust the algorithm's finding.

LUNGREN: Yeah, I'm going to go. I'll see.

HARRIS: Someone clicks a button, and the actual diagnosis pops up. Oops - not TB.

LUNGREN: Oh. We were wrong. I don't know. It's like I said, every time, just go the opposite...

(LAUGHTER)

HARRIS: The hospital diagnosis was based on the standard method - a sputum test, not an X-ray. The Stanford algorithm agrees with that call about 75 percent of the time. Rajpurkar says that's not too bad.

RAJPURKAR: And we also know that radiologists we have measured in South Africa get it right 62 percent of the time.

LUNGREN: And I get it right 50 percent.

(LAUGHTER)

HARRIS: Well, you're 0-for-1 right now - just saying.

LUNGREN: Exactly. Thank you for reminding me.

(LAUGHTER)

HARRIS: It's certainly beguiling to think a computer could do this better than a doctor. But will doctors trust an algorithm if they can't see for themselves how it reached its conclusion? John Zech has his doubts. He's training to be a radiologist. We sit down in a coffee shop patio in San Francisco, where he's doing a residency. Zech and his colleagues got interested in this project and dissected some of the pneumonia studies from the Stanford lab.

JOHN ZECH: Going to show you a few examples here.

HARRIS: He pulls up an X-ray on his iPad and notes that the lung has a big white spot, indicative of pneumonia. But the software indicates that it doesn't consider the white spot important in reaching its diagnosis.

ZECH: So if that's the case, like, what is it using?

HARRIS: Zech says sometimes the algorithm homes in on irrelevant information, like the type of X-ray machine. Those used in hospital rooms were much more likely to show pneumonia compared with those used in doctor's offices. That's hardly surprising since, if you have pneumonia, you're much more likely to be in a hospital.

ZECH: It was clear to us that it wasn't just looking for pneumonia in the lung, which is what you'd like such a model to do. It was - you know, it was being a good machine-learning model, and it was aggressively using all available information baked into the image to make its recommendations.

HARRIS: To put it bluntly, it was cheating. The algorithm also doesn't have access to a lot of information real-life doctors use when making tricky diagnoses, such as a patient's medical history, which he plunges into to sort out difficult cases.

ZECH: Medical diagnosis is hard. There's a lot of room. I want this technology to try to help me make those decisions.

HARRIS: Despite hype that this technology is just around the corner, Zech expects it will be a long time before a black-box algorithm can replace these human judgments.

CYNTHIA RUDIN: It's not clear that you really need a black box for any of it.

HARRIS: Cynthia Rudin is a computer scientist at Duke University who is a bit worried about where the field is heading at the moment.

RUDIN: I've worked on so many different predictive modeling problems, and I've never seen a high-stakes decision where you couldn't come up with an equally accurate model with something that's transparent, something that's interpretable.

HARRIS: Is it just that people are in love with this technology of black boxes, or are there other reasons why they want to employ them?

RUDIN: It's both.

HARRIS: It is just plain cool that you can give a bunch of data to a computer and it can train itself. It's also the case that it's easier to make a proprietary, commercially valuable product if it uses some sort of secret sauce nobody knows how to replicate.

RUDIN: And also, the black-box modeling software is much easier to use.

HARRIS: But Rudin says, especially from medical decisions that could have life or death consequences, it's worth putting in the extra time and effort to have a program built from the ground up, based on real clinical knowledge, so humans can decide whether to trust it or not. Pranav Rajpurkar and his colleagues at Stanford are acutely aware of this issue about black-box algorithms.

RAJPURKAR: The first thing I think about is not about convincing others but about convincing myself that this is, in fact, going to be useful for patients, and that's a question we think about every day and try to tackle every day.

HARRIS: One approach they've taken is they've added features so the algorithm not only comes up with an answer but also says how confident its human overlords should be in that result.

RAJPURKAR: In some way, a humble algorithm, but more importantly, an algorithm that's conscious of what it knows and what it doesn't know.

HARRIS: Perhaps most important - his team is not making a commercial product in secret. Instead, they're freely sharing their software and results so others can pick it apart and help the whole field move forward with more confidence. Richard Harris, NPR News.

(SOUNDBITE OF ANDREAS VOLLENWEIDER'S "STELLA") Transcript provided by NPR, Copyright NPR.