MHRN Blog World Cup Edition: What Soccer Referees Know about Causal Inference

When Nico Lodeiro falls down in the penalty area, I hold my breath waiting for the referee’s call.  Was it really a foul – or just Nico simulating a foul?  The stakes are high.  If the referee calls it a foul, it’s a penalty kick and likely goal for my Seattle Sounders.  If she calls it a simulation or a dive, then it’s a yellow card warning for Nico.  After the call, we all watch the slow-motion replay.  I used to be surprised at how often the refs got it right until a referee friend of mine explained what the refs are looking for.

It goes back to Newton’s first law from high school physics: a body in motion will remain in motion unless acted on by an external force.  If Nico was actually shoved, then his speed or direction will change.  If instead he just threw himself to the ground, then his overall momentum won’t change.  If he throws his shoulders in one direction to simulate a foul, then his arms or his hips have to move in the other direction.  With no external force, his center of gravity continues on the same path.  That’s how the experienced referee distinguishes a foul from a dive.

When we use health records data to infer causation (e.g. Did a specific treatment make things better or worse?), we also depend on Newton’s first law.  We believe that an individual’s mental health path will maintain the same momentum unless it’s acted on by some external force.  That force could be a positive or negative effect of treatment or some positive or negative life event.  We try to predict an individual’s mental health path or trajectory so we can detect a change in speed or direction.

But mental health is more complicated than high school physics.  Any individual’s center of gravity is harder to pin down.  Our measures of mental health speed and direction are much less accurate.  Mental health paths are often not straight lines.  And many forces are often acting at the same time.  So we would be foolish to infer causation from a single case.  For any individual, our calls about causal inference will never be as accurate as a good soccer referee.  But we hope to make better calls by improving the accuracy of our measurement and by aggregating data across many, many cases.  

We’ve been surprised by some of our recent efforts to predict mental health paths or trajectories.  Risk of suicide attempt or suicide death after a mental health visit is surprisingly predictable.  So we expect to be able to detect the effects of external forces – like positive or negative effects of treatments – on risk of suicide attempt.  In contrast, probability of significant improvement after starting psychotherapy for depression is surprisingly unpredictable.  If we hope to detect effects of more or less effective psychotherapy, we’ll have to first make better predictions.  Going back to the soccer story, we would need a clearer view of exactly how Nico was moving before he fell.

Greg Simon

Suicide Risk Prediction Models: I Like the Warning Light, but I’ll Keep My Hands on the Wheel

Our recent paper about predicting risk of suicidal behavior following mental health visits prompted questions from clinicians and health system leaders regarding practical utility of risk predictions.  Our clinical partners asked, “Are machine learning algorithms accurate enough to replace clinicians’ judgment?”  My answer was, “No, but they are accurate enough to direct clinicians’ attention.”

Our 12-year old Subaru is now passed down to my son, and we bought a 2018 model.  The new car has a “blind spot” warning system built in.  When it senses another vehicle in my likely blind spot, a warning light appears on the outside rearview mirror.  If I start to merge in that direction anyway, the light starts blinking, and a warning bell chimes.  I like this feature a lot.  Even if I already know about the other car behind and to my right, the warning light isn’t too annoying or distracting.  The warning light may go on when there isn’t a car in my blind spot – a false positive.  Or it may not light up when there is a car in that dangerous area – a false negative.  But the warning system doesn’t fall asleep or get distracted by something up ahead.  It keeps its “eye” on one thing, all the time.

I hope that machine learning-based suicide risk prediction scores could work the same way.  A high-risk score would alert me and my clinical colleagues to a potentially unsafe situation that we might not have noticed.  If we’re already aware of the risk, the extra notice doesn’t hurt.  If we’re not aware of risk, then we’ll be prompted to pay attention and ask some additional questions.  There will be false positives and false negatives.  But risk prediction scores don’t get distracted or forget relevant facts.

We are clear that suicide risk prediction models are not the mental health equivalent of a driverless car.  We anticipate that alerts or warnings based on calculated risk scores would always be delivered to actual humans.  Those humans might include receptionists who would notice when a high-risk patient cancels an appointment, nurses who would call when a high-risk patient fails to attend a scheduled visit, or treating clinicians who would be alerted before a visit that a more detailed clinical assessment is probably indicated.  A calculated risk score alone would not be enough to prompt any “automated” clinical action – especially a clinical action that might be intrusive or coercive.  I hope and expect that our risk prediction methods will improve.  But I am skeptical they will ever replace a human driver.

My new car’s blind spot warning system does not have control of the steering wheel, the accelerator, or the brake pedal.  If the warning is a false positive, I can choose to ignore or override it.  But that blinking light and warning chime will get my attention.  If that alarm goes off, I’ll take a close look behind me before deciding that it’s actually safe to change lanes.

Greg Simon