Best Practices In Professional Certification Testing

September

Best Practices In Professional Certification Testing

Kathy | September 1, 2008

These authors discuss the arduous process organizations go through in order to put accredited testing in place.

Certification has been the talk of the Maintenance and Reliability (M&R) community for the past several years. This issue is being fueled by several factors that affect today’s M&R industry and its practitioners:

Global competition;
The increasing need for more highly skilled workers, on the mill decks, operating fl oors and in management;
Shortages of skilled workers at all levels of organizations in various regions of the world.

These factors are causing employers, employees and M&R product and service providers to desire professionally “certified” individuals to carry out particular tasks and manage and lead organizations in the maintenance and reliability field. This trend is projected to grow and accelerate over the next five years.

Tasks where managers traditionally expected certified people to execute work involved predictive, non-destructive testing employing technologies such as radiography, magnetic particle testing, eddy current testing and pulseecho ultrasonics. In the last decade, certifications for infrared thermography, vibration analysis, passive ultrasonic analysis and lubricant and wear particle analysis have come into being. Some predictive technologies, including electrical motor and circuit testing, don’t have certification testing schemes in place today, although the possibility for them is being explored by interested groups.

Both written and practical testing is generally applicable to the previously mentioned types of technologies. Technical certification typically is combined with “task qualification” to help an M&R organization assure that a person is “competent.” Technical certifications usually are narrowly focused on a specific technology, tool or method. Professional certification in overall management and leadership of M&R programs typically involves written testing on a broader array of subjects and years of practical experience on the job. This article focuses on the written test portion of professional certification programs with emphasis on testing in the M&R field.

Characteristics of credibility
To be credible and to gain accreditation as a certifying organization, the certification-testing programs that the organization develops must have—at a minimum—the following characteristics:

Content of material upon which tests are based (i.e., the Body of Knowledge – BoK) must be current and reasonably available to candidates.
The BoK must be based on extensive input by recognized practitioners and potential candidates for certification in the field upon which the testing is to be based.
The test must be comprehensive in coverage relative to the BoK upon which candidates are tested.
The test—absolutely—must be graded fairly and impartially to determine who passes and who does not.
For tests administered to an international candidate pool, the terms used in test questions must, to the extent possible, be universally understandable.

Basis of a test
Many organizations are supplying certification testing in the M&R area today. Scanning just a segment of the field, it is apparent that what these various organizations supply— and how they do it—differs greatly.

Some organizations employ best practice techniques in certification testing. Others may not. For example, many professionals frown upon sponsors that create a training course, then provide a final exam with “certification” conferred on course attendees who “pass” it. (The determination of what constitutes a “passing” score is addressed later in this article, in the section on making a test fair, comprehensive and universal.) The general consensus is that there is an apparent confl ict of interest between training and certification, and that there should be an arms-length (i.e., non-profit-dependent) relationship between training and certification providers. The same problem exists for professional societies that make membership a condition for certification schemes they sponsor.

When the Body of Knowledge (BoK) upon which a test is based is established by too narrow a segment of the profession (e.g., a single commercial training firm or exclusively by members of a professional society), a very serious question is raised: Whose interest is being served when a candidate is certified upon passing a test on the related subject matter, no matter the motivation of the originator(s) of the BoK? This is a particular concern when the BoK is obtainable legitimately only by training course participation or purchase of course training materials.

The most credible certifying activities do not offer or endorse any training courses or even “recommend” any literature covering segments of the BoK. They may, however, provide listings of readily-available books or other sources of information related to the BoK, particularly those from which a number of questions on the certification exam may have been developed—with no guarantee, however, that the questions developed from those sources will appear on any future exams.

The BoK used as the basis for a certification exam must be open to input solicited from a large segment of the affected group of candidates in the profession. For example “proposed” contents of a BoK may be initiated by a group within a professional society. For the initial attempt at listing the “skill-sets” needed for certification, a group of recognized experts in that field should be used. When an update is required later, those who have become “certified” may then be used for initiation of any proposed changes.

To be truly credible, however, the BoK’s initiating or updating group should make a determined effort to solicit comment on and validation of proposed final content from the widest possible professional segment that might be affected at the point of certification and beyond. This would require the proposed BoK to be open to review by all interested members of the profession—not just the members of the initiating organization. In today’s Internet-connected world, this is readily achieved and rather easily documented for future reference. BoK updates typically are done every five to seven years in many professions. The process of initiation or updating can take a year or more to complete and document.

Fair, comprehensive and universal testing
A range of certification tests for reliability engineers, maintenance engineers, quality assurance specialists, maintenance managers and maintenance and reliability leaders have surfaced in the marketplace over the past several years. In addition, several certification tests exist for mechanical, electrical and instrument crafts people in the marketplace— but wide acceptance of these has not yet occurred. There is little to no testing for other roles such as work planners or work schedulers or other specialized roles in the M&R field.

To understand certification test best practices, one must start with the discussion of “psychometrics.” According to Wikopedia, “psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the study of differences between individuals and between groups of individuals. It involves two major research tasks, namely: 1) the construction of instruments and procedures for measurement; and 2) the development and refinement of theoretical approaches to measurement.”

Simply put, the field of psychometrics creates a framework from which reliable and valid comparison on individuals in the whole field of study—such as maintenance and reliability or a particular segment of it. Application of this framework and its rules is the foundation for valid and reliable certification processes.

Using a validated BoK described here as a basis, test developers originate exam questions or “items” emphasizing major points that candidates should “master” in order to be certified in a whole field or segment of it. Certification organizations may desire to have various levels of difficulty in their certification scheme and would design their test questions accordingly from easy to diffi- cult. For example, easy questions may have stems or introductions that are general or quite broad and possible responses that are dissimilar enough so that the correct answer is easily recognized. Difficult questions will have stems that are specific, with a narrow focus and have possible responses that are similar, but only one of which is correct (without being “tricky”).

A more complex scheme may have multiple levels, referred to as “Taxonomic Levels,” that are characterized in the Table I [Ref 1]:

Test developers have found that multiple choice questions lend themselves best and most objectively to the use of psychometrics. These are many types of multiple choice questions that may be used, including but not limited to [Ref. 2]:

Those requiring the candidate to choose the one best or correct answer from four or five choices provided (used generally for Knowledge and Comprehension level questions);
Questions that require the candidate to match all of the given terms or phrases correctly in order to get credit (used generally for Knowledge, Comprehension and Application level questions);
Multiple true/false questions where a candidate must evaluate four or five statements concerning a particular required skill for veracity and select the answer that correctly states which are true and false in the proper order (used generally for Comprehension, Application, Analysis and Synthesis level questions);
Questions that require the candidate to choose from a group of items the correct set that solves the problem presented (used generally for Application, Analysis. Synthesis and Evaluation level questions).

Typically, an exam for overall comprehensive certification will have a mixture of questions from all levels indicated in Table I. More specific segment certification exams may have fewer taxonomic levels.

Development of exam questions requires a thorough review for proper grammar and punctuation, as well as the meeting of about 20 other rules for writing fair, accurate and truly valid items that test competence. A unique requirement for exams prepared for more than one nation speaking the same language is that the terminology be commonly understood internationally. In addition, questions that require knowledge of one country’s laws, customs or local practices are excluded from tests developed for international use.

Once a question has been “developed” (it is reviewed and modified as needed for adherence to written rules for preparing them), it is subjected to statistical evaluation. A typical evaluation regimen has at least three or four interrelated processes. (For a diagram of the various processes, contact the authors via their e-mails at the end of this article.)

One example of the most fundamental statistical assessment for proposed exam items is called the “Cut Score Workshop Process.” A group of at least seven (preferably more, up to about 20) recognized experts (e.g., persons already certified at the level the questions are being prepared for) are given a set of “post development” questions in the form of a time-limited exam, so they can “feel the pain” in a manner similar to what a candidate for certification would experience. After taking the exam and scoring themselves with an exam “key,” participants in the Cut Score Workshop are asked to perform the following evaluation of each question and record their “estimates” in writing:

We are trying to define the acceptable minimum level of competency necessary for a candidate to pass the exam and to describe the minimum knowledge, skills and qualifications of the candidate for which these questions have been proposed.
With this in mind, please review each question presented on this exam and estimate the percentage of minimally qualified candidates that would answer the question correctly, entering your estimate in the columns provided. Estimates from all certified people who took this exam will be aggregated and evaluated to determine validity using the processes and criteria established in [our] procedures.”

During the Cut Score Workshop each group of (volunteer) experts also is provided an opportunity to discuss each question and make recommendations for further development, if needed. Some questions may be abandoned or require modification to a degree that forces them to be subjected to another Cut Score Workshop, as with a newly proposed item. The estimates provided by the group on questions that survive are averaged and evaluated for the variance or standard error of the responses received. The resulting number, called the Angoff Score, is established [Ref. 3]. The average of all estimates for each question must be within a reasonable range of the expected passing score for a full set of questions on any form of an actual certification exam. Each item also must have a reasonably low standard error in order to be accepted. That is, the experts who evaluated the item are in reasonable agreement on it, as refl ected by the small distribution of their estimates. Large standard error of responses signifies little or no agreement, even when no adverse comments are received during discussion of an item.

Note that the Angoff Score is determined for minimally qualified candidates for certification. A second statistical evaluation is performed using a group of average candidates for certification before any question is used for actual scoring on a “real” certification exam. This “Beta Assessment Process” is related to Cut Score Workshop Process. In Beta Assessment, the questions are exposed to between 75 and 150 “average” candidates in an exam setting. Each question is evaluated for statistics that indicate:

Degree of difficulty—as measured by number of candidates that select its correct answer;
Effectiveness—a metric relating the number of examinees in the upper and lower third of the score distribution who answered the item correctly (should be positive);
Correlation—between a given test item and knowledge of the profession that the entire examination (all items on the test taken together) is trying to measure (also should be positive).

When all three of the foregoing statistical requirements meet established criteria, the item may be used for scoring on an actual exam to determine whether or not a candidate will be certified. The Angoff Score stays with its related question throughout the item’s useful life. It is used in conjunction with the Angoff Scores for all questions used for grading a given certification exam to determine the passing score or “mastery point.” To do this, a test developer will determine the average of all Angoff Scores for all the items used for grading on a given form of an exam. Remember that the Angoff estimates were made considering minimally qualified candidates, so the result for an entire exam-set of questions also refl ects this assessment by experts in the professional field. That means that the minimally qualified candidate should receive a (barely) passing grade on the exam.

Statistical analysis of items actually in use for scoring on exams is performed on a continuing basis as part of the “Active Item Management Process.” If any item begins to perform statistically in a manner that does not conform to the established criteria set for difficulty, effectiveness and correlation, it is removed from the active pool of questions and subjected to the “Former Active Under Review Process.” If not abandoned outright, the item will be revised, given a new identification number and subjected again to the Cut Score Workshop and Beta Assessment Processes. Only when it meets the statistical criteria for acceptance as an active item will any semblance of its former content appear and be used for scoring on an actual certification exam.

Any exam used for certification must also be comprehensive. That means the candidate is exposed during the exam to portions of as much material covered by the validated BoK as time and content of the exam question bank allows. Accrediting bodies, such as the American National Standards Institute using guidelines established by the International Standards Organization (ISO) of Geneva, Switzerland, mandate that an activity seeking accreditation for a certification scheme be able to prove the comprehensiveness of its exams. Certifying organizations may use one or more methods to prove that their exams are comprehensive. In the broadest sense, a group of questions from each “skill set” described in the BoK are chosen based on the relative importance criteria established during the latest BoK validation process. For example, if there are five skill sets listed in the BoK and they are all considered to be of equal importance, the number of questions selected from each skill set in the question pool will be equal. Within each skill set there may be a different number of skills required. For example, there may be 30 individual skills spread unequally among the five skill sets. Common practice is to require that a high percentage (say 90%, or 27 of the total of 30 skills) be represented on any given form of an exam. When both of the above criteria (all skill sets covered in proportion to their relative importance and all skills covered to the 90th percentile) an exam may be considered comprehensive.

Another element of fairness is established between forms of an exam. For a variety of reasons, mostly related to security of exam content and administration, different forms of an exam will be provided to candidates within the same exam venue. There may be some overlap between forms (e.g., 60% of the questions on one exam form may be the same as on another). The other 40% of questions will be different. However, to be fair to all candidates, the mastery point of different forms must be very close and actual performance of (average) candidates taking the exams also must be close. This latter requirement must be so not only on the overall exam but within the various skill sets upon which questions are asked. Exam evaluators perform a psychometric statistical analysis process called “equating” to assure this is happening. Exam forms that don’t equate are adjusted by modifying the sets of questions selected until they do. To compensate for forms that don’t quite “equate,” the lower number of the mastery points for all current forms of the exam is used to determine who gets certified and who doesn’t, regardless of the form of the exam to which a candidate is exposed.

Conclusions
Best practice in certification testing requires that developers pay careful attention to and assure that examination content relates closely to a thoroughly validated, current and relevant Body of Knowledge that is easily accessible to candidates. Lists of sources from which exam questions are developed should be openly and easily available to candidates.

Furthermore, there must be no confl ict of interest between those who teach on subjects that may be covered on a certification exam and those that confer the certification upon successful candidates. There also should be no confl ict of interest between the certifying activity and its sponsor, be it a commercial entity or professional society.

Exam questions must be developed following many rules to assure that the content of each can be understood by those who speak and read various dialects of the language in which an exam is taken. Questions must be subjected to statistical assessments that clearly establish Angoff Scores, degree of difficulty, effectiveness and correlation with entire sets of questions used— before actual use to determine the outcome for any candidate.

Exams must be constructed in a manner that assures fairness and comprehensiveness—even when different sets of questions are used to produce various forms. The forms must equate to each other, both overall and internally.

All-in-all, the development, documentation, administration and on-going execution of certification schemes is a timeconsuming and increasingly precise art that demands continuous attention to detail. It cannot be undertaken lightly. The application of psychometrics may demand the use of experts for consultation from the beginning of any certification scheme that an organization desires to be credible in the long-term. MT