Thursday, October 10, 2013

Do not use unstructured interviews !!!

Warning to anyone who uses unstructured interviews!!!!

I just read a paper with a strong warning against using unstructured interviews.  They tested the method as a screening tool rather than as a formative testing tool, but I think the conclusions and insights apply across the board. 

The paper is by Jason Dana at Yale and Robyn Dawes (recently deceased) and Nathaniel Peterson at Carnegie Mellon and is published in the most recent issue of Judgment and Decision Making.  They reviewed the use of unstructured interviews in domains like job hiring, clinical diagnosis (doctor interviewing patients), and university admissions.  Their data collection looked at students making predictions of other students’ future GPA based on just biographical information or with biographical information plus an interview. 

In each of these situations, the authors report that the interviews degraded predictive performance.  Doctors were less able to diagnose a patient when they combined an interview of the patient with the medical record compared to the medical record alone.  Job hiring and college admissions were more predictive of future success based just on the application/resume than when an interview was added.

The reasons they found are actually not surprising. One real problem with unstructured interviews is that we ask different questions of each candidate/patient/applicant.  This means we are comparing apples to oranges when putting two candidates against each other.  A second major problem is confirmation bias.  For candidates where we have an initial positive impression, we ask questions where we know the answer will be positive or avoid questions that could have a negative answer.  For those where our initial impression is negative we do the reverse.  A third problem is that we think the answers to our questions are more predictive of future success than they really are.  We ask about things that really don’t matter and give points to the preferred candidate and deduct them from the others.

What is troubling about these results is that we have a false sense of confidence.  Even though unstructured interviews are completely unpredictive and even degrade prediction in most cases, we think they are helpful so we put real value in them.  We feel more confident in our selections. 

The most concerning condition, although one that probably shouldn’t surprise us, is one where they explicitly told participants that the interviewee was just repeating random answers that they were instructed to give.  The participant knew in advance that the interview was garbage.  And yet still they used the results of the interview to make their decision, had greater confidence in that decision, and even reported that the interview was helpful. 

How could the interview be helpful if the candidate was spouting pre-arranged answers?  Motivated reasoning rearing its ugly head.  As we have seen before, sports fans are more likely to bet on their favorite team, even after being told that the odds-maker significantly biased the betting line against their team.  When told that debaters were assigned to present the case opposite to what they really believe, observers still report that they think the debater believes what he is saying – in direct contradiction to what they were just told.  Our brains don’t seem to have an “ignore and forget” function.  Even when we know information is false, we can’t prevent ourselves from using it when making subsequent decisions. 

Do you find this as scary as I do?

Dispositional attitude and user testing

Here is another great reason to read journals outside your main focus area.  I just read an article from the Journal of Personality and Social Psychology that has great implications for human factors testing.  First let me tell you a little about the paper and then I will bring it home to HF.

There are some things that just about everyone likes (maybe chocolate, love, or the smell of freshly cut grass).  There are also some things that just about everyone dislikes (poverty, cruelty, missing the last piece of a 5000 piece jigsaw puzzle).  Then there are things that some people like and others dislike (smoking, roller coasters, sausage pizza).  But here is a different kind of question: are there some people, if you average across the board of everything, who come out as generally positive about things?  And other people, if you average across the board of everything, who come out as generally negative about things?  And is this a reliable, valid, useful metric?

That is what a paper by Justin Hepler at the University of Illinois and Dolores Albarracin at UPenn wanted to know.  And psychometrically, this is a really solid paper.  They took great pains to develop a metric and test its convergent validity, discriminant validity, predictive validity, and reliability.  Here are what I found to be the most useful takeaways for human factors.

First, yes – there is a general tendency towards dispositional attitude.  They started with 100-items and had people rate them on a 7-point Likert scale from extremely unfavorable to extremely favorable.  They included a wide variety of items, from “abortion on demand” to “mullets.”  Then they were able to winnow it down to a 16-item list (without the political hot buttons) that reliably predicts someone’s general dispositional attitude.  The final list includes some items that are generally considered strongly negative, mildly negative, mildly positive, and strongly positive (check out the paper for the full list).  Using this instrument, they were able to predict people’s dispositional tendency towards new items not on the list.  Not for a single item, because obviously even a positively disposed personal might still hate paying taxes or sausage pizza.  And a negatively disposed person might still like chocolate or World Cup soccer.  But on average it can predict whether you are a fuddy duddy or a Pollyanna. 

Then they tested the convergent and discriminant validity.  It turns out that this dispositional attitude is correlated with other scales like openness, extraversion, and self-esteem.  It is negatively correlated with neuroticism, behavioral inhibition, and prevention-focus.  It is uncorrelated with conscientiousness and imagination.  So we can start to see why this is an important finding in personality psychology.

But here is how it can be used in HF.  We often test systems by asking users whether they like a design or a system – often on a similar 7-point Likert scale.  If they rate it above neutral (say a 5 out of 7) then we assume they like it and if they rate it below neutral (say a 3 out of 7) then we assume they dislike it.  But this kind of measure can be made even more accurate (sensitive, precise, and valid) if we spend two minutes getting a customized baseline for each person using the dispositional attitude scale.  If a person who has a dispositional baseline of 2 rates your design as a 3, that is actually positive.  And if a person who has a dispositional baseline of 6 rates your design as a 5, that is actually negative.  And perhaps more importantly, a person with a dispositional baseline of 6 who rates your design as a 2, that indicates a serious dislike that you might not realize otherwise.  So instead of recording the raw Likert scale ratings of your system, you should record the difference between their rating of your system and the dispositional baseline.  It only adds 2 minutes to your study but it can make your results much more useful.