Tuesday, March 08, 2011

"I Opt" Style Reliability Stress Test

By: Gary J. Salton, Ph.D.
Chief: Research & Development
Professional Communications, Inc.

A natural experiment offered an opportunity to test “I Opt”reliability. It used a “worst case possible” design. The experiment was biased AGAINST "I Opt" reliability. The outcome was compared to industry standards. The standards accepted were the most favorable reported by those that had a vested interest in these traditional tools. The worst possible outcomes of “I Opt” were compared the best reported reliability results of alternative tools. This created a natural “stress” test.

The study found that the worst “I Opt” results exceeded the best results of alternatives. These results give the practitioner and scholar confidence that “I Opt” is a tool that can be relied upon even in difficult field situations. You can access a video summarizing the research by clicking the icon to the right.

A program to increase the visibility of “I Opt”® technology created a natural experiment. Random people were offered a free Advanced Leader, Career or Emotional Impact Management Report. They could take the “I Opt Survey” on-line without user codes or passwords. The use was anonymous (i.e., fake names were an option). The report was automatically generated and sent to any email address designated.

An accompanying email invited people to use it as they wished. They could retake the survey without penalty. Table 1 outlines common reasons for retest.

Table 1

The reasons cited in Table 1 involve TRYING to change original outcome. Most retest protocols seek to eliminate this possibility. They try to insure that motivations and conditions are constant between test and retest. This creates a bias toward consistency (i.e., reliability). This experiment does just the opposite. It burdens “I Opt” with a bias towards inconsistency (i.e., unreliability).

Thus the structure of the experiment acts as a stress test. It measures “I Opt” reliability under the "worst case" conditions. Passing this stress test offers strong evidence of “I Opt” technologies inherent reliability.

Any test requires a standard of judgment. The natural standard would be the results of the reliability studies of comparable tests. The most stringent would be results published by those with a vested interest in the success of these tests. Accepting this standard takes the “stress” of the stress test up to the another level.

The Center for the Application of Psychological Type (2011) reports that using MBTI® “on retest, people come out with three to four type preferences the same 75-90% of the time.” That means that the best that can be expected under controlled conditions is 90%. In addition, this result applies to only “three to four” of the 16 possible type preferences (i.e., ESTJ, INFP, etc.). That means that an ESTJ could be retested as a STJE and still qualify as successful retake—3 of the 4 stayed the same. A practitioner who had to explain to a client why their dominant style changed might view this as less than a “success.”

The Consulting Psychologist Press, the publisher of MBTI®, does not cite reliability data on their website. However, a book published by the organization does cite results (Harvey, 1996). About 50% of people tested within nine months remain the same overall type, and 36% remain the same type after more than nine months (Wikipeida, 2011). Averaging MBTI results gives a overall standard of about 63% (i.e., average 75%, 90%, 36%, 50%).

Internet research on DiSC® provided no simply described evidence on test-retest reliability. However, Inscape Publishing (the publisher of DiSC) does provide a table of correlation coefficients (Inscape Publishing, 2005). That table reports correlation coefficients of between .89 and .71 depending on the time between retests. The timing was ~1 week (n=142), 5-7 months (n=174) and 10-14 months (n=138).

A correlation coefficient measures the difference between things, not the things themselves. To make it meaningful it has to be converted. Squaring the correlation coefficient (e.g., .89 x.89) does this. The result is called the Coefficient of Determination or r2. (called "r squared" - see Biddle, p.14). Applying this to the highest DiSC correlation yields a r2 of 79%.

The same method can be applied to the lowest DiSC correlation reported. The corresponding r2 would be 50% (i.e., .71 x .71). Averaging all of the correlation coefficients reported by Inscape Publishing (12 in total) yields an overall correlation coefficient of .763. Squaring that to gives a meaningful Coefficient of Determination of about 58%. Overall, DiSC can be expected to retest differently about 42% of the time (100% -58%).

FIRO-B® is published by CPP, Inc. On their website (CPP, 2009) they report a test-retest reliability as “ranging from .71 to .85—for three different samples as reported in the FIRO-B® Technical Guide (Hammer & Schnell, 2000).” Using r2 this translates into an expected test-retest success rate of between 50% and 72%. Averaging these numbers gives an overall retest consistency of 61%.

The Sixteen Personality Factor or 16PF® publishers (IPAT, Inc.) do not cite reliability statistics on their site. However, Cantrell and Mead in the Sage Handbook of Personality Theory and Assessment do quote statistics. These were taken from the 16PF Fifth Edition Technical Manual. The Institute for Personality and Ability Testing—the predecessor of IPAT—published this manual. They report a 2-week test-retest reliability of .8 and .7 over a two-month interval. This translates to r2 percentages of 49% and 64% for an average r2 of about 57%.

There are many more personality tests of this character. However, Table 2 shows a developing a pattern.

Table 2

All of the instruments appear to approximate 60% test-retest repeatability. Since these results were published by organizations with a vested interest, it is a very high standard. It is likely that these organizations have published the most favorable rates available.

Participants accessed the free reports via an internet connection. The internet server used in this experiment recorded the timing, score and email address of users. Table 3 is an outline of origins of the sample from the server data.

Table 3

The diversity of origins suggests that this is a fair sample of the universe of potential “I Opt” users. It is unlikely that there is a selection bias that might contaminate the results (e.g., all college students, all members of a single firm, etc.).

Participants could choose to rerun a report at any the time. The service was entirely automatic. People could retake the survey without any worry about having to defend the retest to an administrator. The response could be anonymous giving a further level of comfort. Users were effectively unconstrained.

Graphic 1

Graphic 1 shows the usage. A total of 6,298 reports were run. There were 171 retests—a 2.7% retest rate. This is a very low rate given the multiple possible reasons for retest (see Table 1), the ease of access and the penalty free nature of a retake.

This result confirms the high “I Opt” face validity found in the original validity study (Soltysik, 2000). Face validity is an “unscientific” measure of validity. However, reliability is not a measure of validity. Reliability is a measure of consistency. It is meant to provide assurance that you are not using a “rubber ruler.”

Most people did not choose to retest even though they could do it with ease. This suggests that they found the results consistent with their internal estimates. In other words, the low retest rate can be viewed as evidence of the reliability judgment of the participants as measured by their own internal standards. At 2.7% it is very high.

The diversity of data sources (Table3) indicates that there is little likelihood of an external selection bias (e.g., all college students). However, a question could arise on whether there is a particular “I Opt” style inclined to take a retest? The answer is no.

Graphic 2 shows the profiles of re-testers (n=171) are a mirror image of the general test taker (n=6,298). Statistical tests confirm that there is no significant difference (p<.05) in any “I Opt” dimension. The motivation for retesting does not reside in the “I Opt” style. Thus the chance of auto-correlation confounding the results is minimized.

Graphic 2
(n = 171 versus n=6,298)

The sample size is large and diverse and is a fair representation of people likely to use “I Opt.” The number of retests (n=171) is enough to give meaningful insights. The “mirror image” profiles between all testers and re-testers mean results are unlikely to be confounded by this dimension of auto-correlation. Finally, self-selected retesting means that all of the possible motives (Table 1) can operate thus maximizing the “stress” in the stress test. The study rests on a firm foundation.

The time between test and retest is relevant to the stress test. Short time periods maximize the chance producing an inconsistent result. Over short time periods people are likely to remember their responses to the original survey. If the motive is to change the result (see Table 1) a short retake cycle makes this much easier. Table 4 shows the retest timing of the experiment.

Table 4

Fully two-thirds of people retested almost immediately. This is a strong indication that they wanted to explore variation in the results. This reduces the likelihood of consistency (i.e., reliability). This short-cycle retake further increases the “stress” of the stress test.

A person’s dominant style is the practitioners’ most important measure. It is the one that the client is likely to see as characterizing their behavior. The last thing a practitioner wants is to argue with a client over a discrepant result.

Other tools (i.e., MBTI, DiSC, Pf16, etc) do not specify dominant style repeatability rates. Rather, they tended to mix all of the styles (i.e., primary, secondary, peripheral, etc.). This strategy implies that all had equal importance. If the dominant style had fared better it is likely to have been celebrated. It was not. The “I Opt” stress test does not avoid dominant style visibility as is shown in Graphic 3.

Graphic 3

In spite of a strong bias against consistency fully 74% of the “I Opt” retest surveys yielded exactly the same dominant style as obtained in the initial test. This substantially exceeds the implied ~60% repeatability of the other “non-stressed” tools.

Graphic 4

Graphic 4 shows a deeper examination of the 26% that changed styles. It further improves the outcome. Eighteen of 45 people who changed dominant style took the survey 3 or more times (for a combined total 48 surveys). Ultimately, 14 of these 18 “determined” people (77%) finally managed to change their primary style. If these 14 people were removed on the basis of gross distortion the repeatability rate would jump from 74% to 81%.

Table 5

Table 5 shows that whether considered in its raw (74%) or refined (81%) form, “I Opt” clearly passes the stress test. It exceeds the ~60% average repeatability standard. It accomplishes this even with a experimental design heavily biased against it.

The natural experiment arising from an “I Opt” visibility program provides strong evidence of the inherent reliability of “I Opt” technology. This study confirms and extends the similar findings of the original Validity Study (Soltysik, 2000) of over a decade ago.

The result was that “I Opt” substantially exceeded the average reported reliability of traditional tools used in the field under heavily “stressed” conditions. It is reasonable to judge “I Opt” technology to be the most reliable tool available in the field. If there is an equal or superior, it has yet to make itself visible.

® IOPT is a registered trademark of Professional Communications Inc.
® MBTI, Myers-Briggs Type Indicator, and Myers-Briggs are registered trademarks of the MBTI Trust, Inc.
® FIRO-B is a registered trademark of CPP, Inc.
® DiSC is a registered trademark of Inscape Publishing, Inc.
® 16PF is a registered trademark of the Institute for Personality and Ability Testing, Inc.

Biddle, Daniel (Publication date not provided). Retrieved from http://www.biddle.com/documents/bcg_comp_chapter2.pdf, January 1, 2011.

Cattell, Heather and Mead, Alan (2000).“The Sixteen Personality Factor Questionnaire (16PF)” in The SAGE Handbook of Personality Theory and Assessment: Personality Theories and Models (Volume 1). Retrieved from http://www.gl.iit.edu/reserves/docs/psy504f.pdf, January 1, 2011.

Center for the Application of Psychological Type, 2010. “The Reliability and Validity of the Myers-Briggs Type Indicator® Instrument” Retrieved from http://www.capt.org/mbti-assessment/reliability-validity.htm, January 1, 2011.

Conn, S.R. and Rieke, M.L. (1994) The 16PF Fifth Edition Technical Manual. Champaign, IL: Institute for Personality and Ability Testing.

CPP (2009). Retrieved from https://www.cpp.com/products/firo-b/firob_info.aspx, January 1, 2011

Harvey, R J (1996). Reliability and Validity, in MBTI Applications A.L. Hammer, Editor. Consulting Psychologists Press: Palo Alto, CA. p. 5- 29.

Inscape Publishing (2005). DiSC Validation Research Report. Inscape Publishing, Minneapolis, MN. Retrieved from http://www.discprofile.com/downloads/DISC/ResearchDiSC_ValidationResearchReport.pdf January 4, 2011.

Schnell, E. R., & Hammer, A. (1993). Introduction to the FIRO-B in organizations. Palo Alto, CA: Consulting Psychologists Press, Inc.

Wikipeida (2011). “Myers-Briggs Type Indicator.” Retrieved from http://en.wikipedia.org/wiki/Myers-Briggs_Type_Indicator#cite_note-39, January 4, 2011.

Soltysik, Robert (2000), Validation of Organizational Engineering: Instrumentation and Methodology, Amherst: HRD Press.