Looking for Equipoise

Controversy regarding potential treatments and vaccines for COVID-19 has brought arguments about randomized trial methods into the public square.  Disagreements about interim analyses and early stopping of vaccine trials have moved from the methods sections of medical journals to the editorial section of the Wall Street Journal.   I never imagined a rowdy public debate about the more sensitive Pocock stopping rule versus the more patient O’Brien-Fleming rule.  A randomized trial nerd like me should be heartened that anyone even cares about the difference.  I imagine it’s how Canadians feel during Winter Olympic years when the rest of the world actually watches curling.  But the debates have grown so heated that choice of a statistical stopping rule has become a test of political allegiance.  And that drove me to some historical reading about the concept of equipoise.

As originally understood, the ethical requirement for equipoise in randomized trials applied to the individual clinician and the individual patient.  Each clinician was expected to place their duty to an individual patient over any regard for public health or scientific knowledge.  By that understanding, a clinician would only recommend or participate in a randomized trial if they had absolutely no preference.  If they preferred one of the treatments being compared, they were obligated to recommend that treatment and recommend against joining a randomized trial.    

In 1987, Benjamin Freedman suggested an alternative concept of “clinical equipoise”, applied at the community level rather than to the individual patient.  A clinician might have a general preference for one treatment over another.  Or the clinician might believe that one treatment would be preferred for a specific patient.  But that clinician might still suggest that their patient participate in a randomized trial if the overall clinical or scientific community was uncertain about the best choice. 

Freedman’s discussion of equipoise offers some useful words for navigating current controversies regarding COVID-19 clinical trials.  He suggested that a clinician could ethically particicipate in a randomized trial or recommend participation to their patients “if their less-favored treatment is preferred by colleagues whom they consider to be responsible and competent.”  A clinician is not expected to be ignorant or indifferent (“I have no idea what’s best for you.”).  Instead, they are expected to be honest regarding uncertainty (“I believe A is probably better, but many experts I respect believe that B is best.”).    Freedman proposed that a randomized trial evaluating some new treatment is ethically appropriate if success of that treatment “appears to be a reasonable gamble.”  That language accurately communicates both the hope and uncertainty central to randomized trials of new treatments.  It also communicates that odds of success can change with time.  A gamble that appeared reasonable when a randomized trial started may become less reasonable as new evidence – from within the trial or outside of it – accumulates.  Reasonable investigators must periodically reconsider their assessments of whether the original gamble remains reasonable.  That’s where those stopping rules come in. The concept of clinical equipoise also helps to frame our MHRN randomized trials of new mental health services or programs.  We are often studying treatments or programs that we have developed or adapted, so we are neither ignorant nor indifferent.  We certainly care about the trial outcome, and we usually hope the new treatment we are testing will be found helpful.  But we do randomized trials because our beliefs and hopes are not evidence.

Greg Simon

Judging Which Model Performs Best

Projections of the course of the COVID-19 pandemic prompted vigorous arguments about which disease model performed the best.  As competing models delivered divergent predictions of morbidity and mortality, arguments in the epidemiology and medical twitterverse grew as rowdy as the crowd in that male model walk-off from the movie Zoolander.  Given the rowdy disagreement among experts, how can we evaluate the accuracy of competing models yielding divergent predictions?  With all respect to the memory of David Bowie (the judge of that movie modeling competition), isn’t there some objective way of judging model performance?  I think that question leads to a key distinction between types of mathematical models.

Most models predicting individual-level events (such as our MHRN models predicting suicidal behavior) follow an empirical or inductive approach.  An inductive model begins with “big data”, usually including a large number of events and a large number of potential predictors.  The data then tell us which predictors are useful and how much weight each should be given.  Theory or judgment may be involved in assembling the original data, but the data then make the key decisions.  Regardless of our opinions about what predictors might matter, the data dictate what predictors actually matter.

In contrast, models predicting population-level change (including many competing models of COVID-19 morbidity and mortality) often follow a mechanistic or deductive approach.  A deductive model assumes a mechanism of the underlying process, such as the susceptible-infected-recovered model of infectious disease epidemics.  We then attempt to estimate key rates or probabilities in that mechanistic model, such as the now-famous reproduction number or R0 for COVID-19.  “Mass action” models apply those rates or probabilities to groups, while “Agent-based” models apply them to simulated individuals.  In either case, though, an underlying structure or mechanism is assumed.  And key rate or probabilities are usually estimated from multiple sources – usually involving at least some expert opinion or interpretation.   

Judging the performance of empirical or inductive prediction models follows a standard path.  At a minimum, we randomly divide our original data into one portion for developing a model and a separate portion for testing or validating it.  Before using a prediction model to inform practice or policy, we would often test how well it travels – by testing or validating it in data from a later time or a different place.  So far, our MHRN models predicting individual-level suicidal behavior have held up well in all of those tests.

That empirical validation process is usually neither feasible nor reasonable with mechanistic or deductive models, especially in the case of an emerging pandemic.  If our observations are countries rather than people, we lack the sample size to divide the world into a model development sample and a validation sample.  And it makes no sense to validate COVID-19 predictions based on how well they travel across time or place.  We already know that key factors driving the pandemic vary widely over time and place.  We could wait until the fall to see which COVID-19 model made the best predictions for the summer, but that answer will arrive too late to be useful.  Because we lack the data to judge model performance, the competition can resemble a beauty contest.

I am usually skeptical of mechanistic or deductive models.  Assumed mechanisms are often too simple.  In April, some reassuring models used Farr’s Law to predict that COVID-19 would disappear as quickly as it erupted.  Unfortunately, COVID-19 didn’t follow that law.  Even when presumed mechanisms are correct, estimates of key rates or probabilities in mechanistic models often depend on expert opinion rather than data.  Small differences in those estimates can lead to marked differences in final results.  In predicting the future of the COVID-19 pandemic, small differences in expectations regarding reproduction number or case fatality rate lead to dramatic differences in expected morbidity and mortality.  When we have the necessary data, I’d rather remove mechanistic assumptions and expert opinion estimates from the equation.

But we sometimes lack the data necessary to develop empirical or inductive models – especially when predicting the future of an evolving epidemic.  So we will have to live with uncertainty – and with vigorous arguments about the performance of competing models.  Rather than trying to judge which COVID-19 model performs best, I’ll stick to what I know I need to do:  avoid crowds (especially indoors), wash my hands, and wear my mask!

Greg Simon

Incredible research is sometimes not credible

The controversy around COVID-19 research from the Surgisphere database certainly got my attention.  Not because I have any expertise regarding treatments for COVID-19.  But because I wondered how that controversy would affect trust in research like ours using “big data” from health records

In case you didn’t follow the story, here’s my brief summary: Surgisphere is a data analytics company that claims to gather health records data from 169 hospitals across the world.  In early May, Surgisphere researchers and academic collaborators published a paper in NEJM showing that drugs used to treat cardiovascular disease did not increase severity of COVID-19.  That drew some attention.  In late May, a second paper in Lancet reported that hydroxychloroquine treatment of COVID-19 had no benefit and could increase risk of abnormal heart rhythms and death.  Given the scientific/political controversy around hydroxychloroquine, that finding drew lots of attention.  Intense scrutiny across the academic internet and twitterverse revealed that many of the numbers in both the Lancet paper and the earlier NEJM paper didn’t add up.  After several days of controversy and “expressions of concern”, Surgisphere’s academic collaborators retracted the article, reporting that they were not able to access the original data to examine discrepancies and replicate the main results.

Subsequent discussions questioned why journal reviewers and editors didn’t recognize the signs that Surgisphere data were not credible.  Spotting the discrepancies, however, would have required detailed knowledge regarding the original health system records and the processes for translating those records data into research-ready datasets.  I suspect most reviewers and editors rely more on the reputation of the research team than on detailed inspection of the technical processes.

Our MHRN research also gathers records data from hundreds – or even thousands – of facilities across 15 health systems.  We double-check the quality of those data constantly, looking for oddities that might indicate some technical problem in health system databases or some error in our processes.  We can’t expect that journal reviewers and editors will triple-check all of our double-checking. 

I hope MHRN has established a reputation for trustworthiness, but I’m ambivalent about the scientific community relying too much on individual or institutional reputation.  I do believe that the quality of MHRN’s prior work is relevant when evaluating both the integrity of MHRN data and the capability of MHRN researchers.  Past performance does give some indication of future performance.  But relying on reputation to assess new research will lead to the same voices being amplified.  And those loudest voices will tend to be older white men (insert my picture here).

I’d prefer to rely on transparency rather than reputation, but there are limits to what mental health researchers can share.  Our MHRN projects certainly share detailed information about the health systems that contribute data to our research.  The methods we use to process original health system records into research-ready databases are well-documented and widely imitated.  Our individual research projects publish the detailed computer code that connects those research databases to our published results.  But we often cannot share patient-level data for others to replicate our findings.  Our research nearly always considers sensitive topics like mental health diagnoses, substance use, and suicidal behavior.  The more we learn about risk of re-identifying individual patients, the more careful we get about sharing sensitive data.  At some point, people who rely on our research findings will need to trust the veracity of our data and the integrity of our analyses.

We can, however, be transparent regarding shortcomings of our data and mistakes in our interpretation.  I can remember some of our own “incredible” research that could have turned into Surgisphere-style embarrassments.  For example: We found an alarmingly high rate of death by suicide in the first few weeks after people lost health insurance coverage.  But then we discovered we were using a database that registered months of complete insurance coverage – so people were counted as not insured during the month they died.  I’m grateful we didn’t publish that dramatic news.  Another example: We found a disturbingly high number of people seen in emergency department visits for a suicide attempt had another visit with a diagnosis of self-harm during the next few days.  That was alarming and disturbing.  But then we discovered that the bad news was actually good news.  Most of those visits that looked like repeat suicide attempts turned out to be timely follow-up visits after a suicide attempt.  Another embarrassing error avoided.

While we try to publicize those lessons, the audience for those stories is narrow.  There is no Journal of Silly Mistakes We Almost Made, but there ought to be.  If that journal existed, research groups could establish their credibility by publishing detailed accounts of their mistakes and near-mistakes.  As I’ve often said in our research team meetings:  If we’re trying anything new, we are bound to get some things wrong.  Let’s try to find or mistakes before other people point them out to us.

Greg Simon

Delivering on the Real Promise of Virtual Mental Healthcare

Many of our MHRN investigators were early boosters of telehealth and virtual mental health care.  Beginning over 20 years ago, research in our health systems demonstrated the effectiveness and value of telehealth follow-up care for depression and bipolar disorder, telephone cognitive-behavioral or behavioral activation psychotherapy, depression follow-up by online messaging, online peer support for people with bipolar disorder, and telephone or email support for online mindfulness-based psychotherapy

Even in our MHRN health systems, however, actual uptake of telehealth and virtual care lagged far behind the research.  Reimbursement policies were partly to blame; telephone visits and online messaging were usually not “billable” services.  But economic barriers were only part of the problem.  Even after video visits were permitted as “billable” substitutes for in-person visits, they accounted for only 5 to 10 percent of mental health visits for either psychotherapy or medication management.  Only a few of our members seemed to prefer video visits, and our clinicians didn’t seem to be promoting them.

Then the COVID-19 pandemic changed everything almost overnight.  At Kaiser Permanente Washington, virtual visits accounted for fewer than 10% of all mental health visits during the week of March 9th.  That increased to nearly 60% the following week and over 95% the week after that.  I’ve certainly never seen anything change that quickly in mental health care.  Discussing this with our colleague Rinad Beidas from U Penn, I asked if the implementation science literature recognizes an implementation strategy called “My Hair is on Fire!”  Whatever we call it, it certainly worked in this case.

My anecdotal experience was that the benefits of telehealth or virtual care included the expected and the unexpected.  As expected, even my patients who had been reluctant to schedule video visits enjoyed the convenience of avoiding a trip to our clinic through Seattle traffic (or the way Seattle traffic used to be).  Also as expected, there were no barriers to serving patients in Eastern Washington where psychiatric care is scarce to nonexistent.  It was as if the Cascade mountains had disappeared.  One unexpected bonus was meeting some of the beloved pets I’d only heard about during in-person visits.  And I could sometimes see the signs of those positive activities we mental health clinicians try to encourage, like guitars and artwork hanging on the walls.  It’s good to ask, “Have you been making any art?”.  But it’s even better to ask, “Could you show me some of that art you’ve been making?”

As is often the case in health care, the possible negative consequences of this transformation showed up more in data than in anecdotes.  My clinic hours still seemed busy, but the data showed that our overall number of mental health visits had definitely decreased.  Virtual visits had not replaced all of the in-person visits.  That’s concerning, since we certainly don’t expect that the COVID-19 pandemic has decreased the need for mental health care.  So we’re now digging deeper into those data to understand who might be left behind in the sudden transition to virtual care.  Some of our questions:  Is the decrease in overall visits greater in some racial or ethnic groups?  Are we now seeing fewer people with less severe symptoms or problems – or are the people with more severe problems more likely to be left behind?

As our research group was starting to investigate these questions, I got a message from one of our clinical leaders asking about a new use for our suicide risk prediction models.  They were also thinking about people with the greatest need being left behind.  And they were thinking about remedies.  Before the COVID-19 outbreak, we were testing use of suicide risk prediction scores in our clinics, prompting our clinicians to assess and address risk of self-harm during visits.  But some people at high risk might not be making visits, even telephone or video visits, during these pandemic times.  So our clinical leaders proposed reaching out to our members at highest risk rather than waiting for them to appear (virtually) in our clinics. 

Collaborating with out clinical leaders on this new outreach program prompted me to look again at our series of studies on telehealth and virtual care.  Those studies weren’t about simply offering telehealth as an option.  Persistent outreach was a central element of each of those programs. Telehealth is about more than avoiding Seattle traffic or Coronavirus infection.  Done right, telehealth and virtual care enable a fundamental shift from passive response to active outreach.

Outreach is especially important in these chaotic times.  During March and April, video and telephone visits were an urgent work-around for doing the same work we’ve always done.  Now in May, we’re designing and implementing the new work that these times call for.

Greg Simon

Pragmatic Trials for Common Clinical Questions: Now More than Ever

Adrian Hernandez, Rich Platt, and I recently published a Perspective in New England Journal of Medicine about the pressing need for pragmatic clinical trials to answer common clinical questions.  We started writing that piece last summer, long before any hint of the COVID-19 pandemic.  But the need for high-quality evidence to address common clinical decisions is now more urgent than we could have imagined.

Leaving aside heated debates regarding the effectiveness of hydroxychloroquine or azithromycin, we can point to other practical questions regarding use of common treatments by people at risk for COVID-19.  Laboratory studies suggest that ibuprofen could increase virus binding sites.  Should we avoid ibuprofen and recommend acetaminophen for anyone with fever and respiratory symptoms?   Acetaminophen toxicity is not benign.  Laboratory studies suggest that ACE inhibitors, among the most common medications for hypertension, could also increase virus binding sites.  Should we recommend against ACE inhibitors for the 20% of older Americans now using them daily?  Stopping or changing medication for hypertension certainly has risks.  Laboratory studies can raise those questions, but we need clinical trials to answer them with any certainty.

Pragmatic or real-world clinical trials are usually the best method for answering those practical questions.  Pragmatic trials are embedded in everyday practice, involve typical patients and typical clinicians, and study the way treatments work under real-world conditions.  Compared to traditional clinical trials, done in specialized research centers under highly controlled conditions, pragmatic trials are both more efficient (we can get answers faster and cheaper) and more generalizable (the answers apply to real-world practice). 

Pragmatic trials are especially helpful when alternative interventions could have different balances of benefits and risks for different people – like ibuprofen and acetaminophen for reducing fever.  Laboratory studies – or even highly controlled traditional clinical trials – can’t sort out how that balance plays out in the real world.

In our perspective piece, we pointed out financial, regulatory, and logistical barriers to faster and more efficient pragmatic clinical trials.  But the most important barrier is cultural.  It’s unsettling to acknowledge our lack of evidence to guide common and consequential clinical decisions.  Clinicians want to inspire hope and confidence.  Patients and families making everyday decisions about healthcare might be dismayed to learn that we lack clear answers to important clinical questions.  We all must do the best we can with whatever evidence we have, but we should certainly not be satisfied with current knowledge.  If we hope to activate our entire health care system to generate better evidence, we’ll probably need to provoke more discomfort with the quality of evidence we have now.

Inadequate evidence can also lead to endless conflict.  My colleague Michael Von Korff used to use the term “German Argument” to describe people preferring to argue about a question when the answer is readily available to those willing to look.  Michael was fully entitled to use that expression, since his last name starts with “Von.”  Germany, however, now stands out for success in mitigating the impact of the COVID-19 pandemic.  Even if Michael’s ethnic joke no longer applies, the practice of arguing rather than examining evidence is widespread.  The best way to end those arguments is to say “I really don’t know the answer.  How could we find out as quickly as possible?” 

Greg Simon

“H1-H0, H1-H0” is a song we seldom hear in the real world

Arne Beck and I were recently revising the description of one of our Mental Health Research Network projects. We really tried to use the traditional scientific format, specifying H1 (our hypothesis) and H0 (the null hypothesis). But our research just didn’t fit into that mold. We eventually gave up and just used a plain-language description of our question: For women at risk for relapse of depression in pregnancy, how do the benefits and costs of a peer coaching program compare to those of coaching from traditional clinicians? 


Our real-world research often doesn’t fit into that H1-H0 format. We aim to answer practical questions of interest to patients, clinicians, and health systems. Those practical questions typically involve estimation (How much?), classification (For whom?), comparison (How much more or less?) or interaction (How much more or less for whom?). None of those are yes/no questions. While we certainly care whether any patterns or differences we observe might be due to chance, a p-value for rejecting a null hypothesis does not answer the practical questions we hope to address.


When we talk about our research with patients, clinicians, and health system leaders, we never hear questions expressed in terms of H1 or H0. I imagine starting a presentation to health system leaders with a slide showing H1 and H0—and then hearing them all break into that song from the Walt Disney classic Snow White. Out in the real world of patients, clinicians, and health system leaders, “H1-H0” most likely stands for that catchy tune sung by the Seven Dwarfs.

As someone who tries to do practical research, my dissatisfaction with the H1-H0 format is about more than just language or appearance. There are concrete reasons why that orientation doesn’t fit with the research our network does.

Pragmatic or real-world research often involves multiple outcomes and competing priorities.  For example, we hope our Suicide Prevention Outreach Trial will show that vigorous outreach reduces risk of suicidal behavior. But we already know some people will be upset or offended by our outreach. Our task is to accurately estimate and report both the beneficial effects on risk of suicidal behavior and the proportion of people who object or feel harmed.  We may have opinions about the relative importance of those competing effects, but different people will value those outcomes differently. It’s not a yes/no question with a single answer. Our research aims to answer questions about “How much?”. Each user of our research has to consider “How important to me or the people I serve?”   

The H1-H0 approach becomes even less helpful as sample sizes grow very large.  For example, our work on prediction of suicide risk used records data for millions of patients. With a sample that large, we can resoundingly reject hundreds of null hypotheses while learning nothing that’s useful to patients or clinicians. In fact, our analytic methods are designed to “shrink” or suppress most of those “statistically significant” findings. We hope to create a useful tool that can reliably and accurately identify people at highest risk of self-harm. For that task, too many “significant” p-values are really just a distraction.

I’m very fond of how the Patient-Centered Outcomes Research Institute’s describes the central questions of patient-centered research: What are my options? What are the potential benefits and harms of those options? Those patient-centered questions focus on choices that are actually available and the range of outcomes (both positive and negative) that people care about. I think those are the questions our research should aim to answer, and they can’t be reduced to H1-H0 terms.

I am certainly not arguing that real-world research should be less rigorous than traditional hypothesis-testing or “explanatory” research. Pragmatic or real-world researchers are absolutely obligated to clearly specify research questions, outcome measures, and analytic methods—and declare or register those things before the research starts. Presentations of results should stick to what was registered, and include clear presentations of confidence limits or precision. In my book, that’s actually more rigorous than just reporting a p-value for rejecting H0. There’s no conflict between being rigorous and being practical or relevant.

To be fair, I must admit that the Seven Dwarfs song is more real-world than I gave it credit for. We often misremember the lyrics “Hi-Ho, Hi-Ho, it’s off to work we go.” But the Seven Dwarfs were actually singing “Hi-Ho, Hi-Ho, it’s home from work we go.” That’s the sort of person-centered song we might actually hear in the real world.

Greg Simon

Read Marsha Linehan’s Book!

If the Mental Health Research Network had a book club, we’d start with Marsha Linehan’s memoir, Building a Life Worth Living.

Marsha is the creator of Dialectical Behavior Therapy or DBT, a treatment approach once seen as heretical that’s now the standard of care for people at risk of self-harm or suicide. Her memoir is much more than an academic autobiography, describing her groundbreaking research at the University of Washington. It is her personal story of recovery and spiritual evolution. The history of DBT doesn’t begin with a post-doctoral fellowship or a research grant for a pilot study. Instead, it begins with Marsha’s descent into severe depression and relentless suicidal ideation, leading to a two-year inpatient psychiatric stay as “one of the most disturbed patients in the hospital.”

Marsha’s remarkable story pushed me to think about all sorts of questions regarding mental health research and mental health treatment: Why does our clinical research often focus on reconfirming the modest benefits of treatments that are so disappointing? How could the scientific peer review system welcome true innovation rather than comfortable confirmation? Do mental health clinicians’ traditional “boundaries” really serve to protect patients from harm – or more to protect clinicians from upset or inconvenience? How central are spirituality and religion to recovery from mental illness? And – where would we be today if Marsha Linehan had chosen a traditional religious order over psychology graduate school?

For me, the book’s most valuable lesson was understanding the dialectical center of DBT. Dialectical thinking – holding two seemingly opposite ideas at the same time – is central to Marsha’s treatment approach and her life story. Following the first rule of good writing (“Show, don’t tell”), Marsha generously describes her own intellectual and spiritual journey to dialectically embracing the tension between radical acceptance and hunger for change. Her message to her clients is “I accept you as you are; how you feel and what you do make perfect sense.” And her message is also “There is a more effective way to be who you are and feel what you feel. Wouldn’t you like to learn about it?” Both are completely true, at the very same time. The tension between acceptance and change is not a problem to be solved but a creative space to inhabit – or even dance inside of. Marsha also has some important things to say about dancing!

Marsha reveals her dialectical approach most clearly in describing her own mental health treatment. She endured over-medication, forcible restraint, and weeks spent in a seclusion room. Her descriptions of those traumas are vivid, but they include not the slightest tinge of blame or resentment. Instead, she gracefully expresses gratitude and compassion for her caregivers, knowing they were doing the best they could with the knowledge and skills they had at the time. That is truly radical acceptance. At the same time, Marsha was passionate for change. She vowed that “I would get myself out of hell – and that once I did, I would find a way to get others out of hell, too.” It seems that the mental health care system is Marsha’s last client. And I think she is practicing a little DBT on mental health clinicians like me – compassionately accepting all of our failings and flailings while showing us a better way.

Greg Simon

Let’s not join the Chickens**t Club!

Long before he became famous or infamous (depending on your politics) as FBI Director, James Comey served as US Attorney for the Southern District of New York.  That’s the office responsible for prosecuting most major financial crimes in the US.  Jesse Eisinger’s book, The Chickens**t Club, recounts a speech Comey made to his staff after assuming that high-profile post.  He asked which of his prosecutors had never lost a case at trial, and many proudly raised their hands.  Comey then said, “You are all members of what we like to call The Chickens**t Club.”  By that he meant:  You are too timid to take a case to trial unless you already know you will win.

I worry that our clinical trials too often follow the same pattern as those white-collar criminal trials.  When we evaluate new treatments or programs, we may only pursue the trials likely to give the answer we hope for.  That might mean testing an intervention only slightly different from one already proven effective.  Or testing a treatment in an environment where it’s almost certain to succeed.  Our wishes come through in our language.  Trials showing that a new treatment is superior are “positive”, while trials finding no advantage for a new treatment or program are “negative.”

Our preference for trials with “positive” results reflects how researchers are rewarded or reinforced.  We know that positive trials are more likely to be published than so-called negative trials.  Grant review panels prefer positive research that yields good news and gives the appearance of continuous progress.

Looking back on my career, I can see some trials that might have put me in The Chickens**t Club.  For example, we probably didn’t need to do several clinical trials of collaborative care for depression in one health system before taking that idea to scale.

But here’s where the analogy between investigators and prosecutors does not apply:  A prosecutor in a criminal trial is supposed to take sides.  An investigator in a clinical trial shouldn’t have a strong preference for one result or the other.  In fact, clinical investigators have the greatest obligation to pursue trials when the outcome is uncertain.  “Equipoise” is the term to describe that balanced position.

Greg Simon

From 17 years to 17 months (or maybe 14)

Most papers and presentations about improving the quality of health care begin by lamenting the apocryphal 17-year delay from research to implementation.   The original evidence for that much-repeated 17-year statistic is pretty thin: a single table published in a relatively obscure journal.  But the 17-year statistic is frequently cited because it quantifies a widely recognized problem.  Long delays in the implementation of research evidence are the norm throughout health care.

Given that history, I’ve been especially proud of our MHRN health systems for their rapid and energetic implementation of suicide prevention programs.  This month, Kaiser Permanente Washington started implementing MHRN-developed suicide risk prediction models.  Therapists and psychiatrists at our Capitol Hill Mental Health clinic in Seattle now see pre-visit alerts in the electronic health record for patients who are at high risk of a suicide attempt over the next 90 days.  Implementation started only 17 months ─ not 17 years ─ after we published evidence that those models accurately predict short-term risk of suicide attempt or suicide death.

I certainly have experience developing other evidence-based interventions that are still waiting for wide implementation – some for longer than 17 years.  For those interventions, our published papers, PowerPoint presentations, and impressive p-values never prompted much action.  Our experience with suicide risk prediction models has been almost the opposite: health system leaders are pushing for faster implementation.  For months we’ve received frequent queries from regional and national Kaiser Permanente leaders: “When are you rolling out those risk prediction scores in our clinics?” 

In the middle of celebrating Kaiser Permanente’s rapid implementation of risk prediction models, I learned that our colleagues at HealthPartners were three months ahead of us.  I hadn’t realized that health plan care managers at HealthPartners have been using MHRN risk prediction scores to identify people at high risk for suicidal behavior since July.  In true Minnesota fashion, they just did the right thing with no boasting or fanfare.  Reducing the publication-to-implementation gap from 17 years to 14 months doesn’t have quite the same poetic symmetry as reducing it to 17 months.  But I’ll take faster progress over poetic symmetry.

Why did we see so little delay in our health systems’ implementation of suicide risk prediction models and other suicide prevention programs?  We didn’t need to do any marketing to create demand for better suicide prevention.  Health system leaders were clear about priorities and problems needing solutions.  The specific demand regarding risk prediction models was clear:  The questionnaires providers were using to identify suicide risk were better than nothing, but far short of satisfactory.  Knowing the limitations of the tools they were using, clinicians and health system leaders asked:  Can you build us something better?  We weren’t telling health system leaders what we thought they needed.  We were trying to build what they told us they needed.

When I now hear researchers or intervention developers lament the 17-year delay, I ask myself how that complaint might sound in some other industry.  I imagine giving this report to a board of directors: “We developed and tested an innovative product that our customers really should want.  We’ve spent 17 years telling them about it, but our market share is still trivial.  What’s wrong with our customers?”  I doubt the board of directors would say, “Let’s keep doing the same thing!”  Instead, they’d probably ask, “What are our customers trying to tell us about what they need?”

Greg Simon

Outreach is Meant for the People Left Outside!

Several years ago, Evette Ludman and I undertook a focus group study to learn about early dropout from psychotherapy.  We invited health system members who had attended a first therapy visit for depression and then did not return.  Only about one-third of people we invited agreed to speak with us, but that’s a pretty good success rate for focus group recruitment.

We soon learned, however, that the one-third of people who joined our focus group were not the people we needed to hear from.  Many were veterans of long-term psychotherapy who had returned for a single “refresher” visit.  Some had been seeing therapists in private practice (using other insurance) and scheduled a visit with a new therapist to explore other options.  None were people starting treatment for depression who gave up after a single visit.  Our first focus group turned into a hypothetical discussion of why some other people might give up on therapy after just one visit. 

In retrospect, we should have realized that we wouldn’t learn much about giving up on psychotherapy from people who volunteer to join a focus group about psychotherapy.  People living with depression who are frustrated or discouraged about treatment don’t tend to become motivated research volunteers.  We probably should have published something about that experience, but I’m still waiting for someone to establish The Journal of Instructive Failure.

That instructive failure, however, did shape our subsequent research about outreach to increase  engagement in mental health treatment.  Outreach and engagement interventions have been a major focus of our research, but we don’t study engagement interventions among people who are already engaged.  We aim to reach people who are disconnected, discouraged, and convinced that treatment has nothing to offer.  Volunteering to participate in research to increase engagement in treatment should probably make someone ineligible for research on that topic.  For example:  If we hope to learn whether outreach to people at risk can reduce suicide attempts, we certainly shouldn’t limit our research to people who volunteer for a study of outreach to prevent suicide attempt.

If we hope to find those who have been lost, we’ll have to look outside of the bright light under the lamppost.  So our studies of outreach or engagement interventions follow a “randomized encouragement” design.  We identify people who appear to need services but are not receiving them.  We randomly assign some people to receive extra outreach, such as messages and phone calls to offer support and problem-solve barriers to getting mental health care.  The rest continue to receive their usual care. 

That real-world research design answers the question we care about:  Among people who appear to have unmet need, will implementing an outreach intervention increase engagement in treatment – and ultimately lead to better outcomes.  That’s the design of our MHRN Suicide Prevention Outreach Trial, testing two outreach interventions for people at risk of suicidal behavior.  And it’s the design of our MHRN pilot study of Automated Outreach to Prevent Depression Treatment Dropout, testing systematic outreach to people who appear to have discontinued medication or dropped out of psychotherapy.

That real-world randomized encouragement design does impose some requirements, but I think they are features rather than bugs.  First, we must be able to identify people with unmet need before they ask us for help.  That’s been a central focus of our MHRN research, including our recent research on predicting suicidal behavior.  Second, we must be able to use health system records to assess any impact or benefit.  Relying on traditional research interviews or surveys would take us back to the problem of assessing outreach or engagement among people who volunteer to participate in research interviews or surveys.  Third, any benefit of an outreach or engagement intervention is diluted by absence of benefit in those who do not participate.  But that diluted effect is the true effect, if what we care about is the real-world effect of an outreach program.

Outreach interventions might seem to work well right under the lamppost, but that’s not where people get lost or left out.

Greg Simon