Guideline for Constructing Effective Test Items

Making test items is not an easy task
Making test items is not an easy task
Tamanna Kalim
Written by Tamanna Kalim

Measurement, assessment, and evaluation mean very different things, and yet most of the students are unable to adequately explain the differences. It is important to know about these terms to construct effective test items.

Measurement refers to the process by which the attributes or dimensions of some physical object are determined. One exception seems to be in the use of the word measure in determining the IQ of a person. The phrase, “this test measures IQ” is commonly used. Measuring such things as attitudes or preferences also applies. However, when we measure, we generally use some standard instrument to determine how big, tall, heavy, voluminous, hot, cold, fast, or straight something actually is. Standard instruments refer to instruments such as rulers, scales, thermometers, pressure gauges, etc.

We measure to obtain information about what is. Such information may or may not be useful, depending on the accuracy of the instruments we use, and our skill at using them. We measure how big a classroom is in terms of square feet, we measure the temperature of the room by using a thermometer, and we use Ohmmeters to determine the voltage, amperage, and resistance in a circuit. In all of these examples, we are not assessing anything; we are simply collecting information relative to some established rule or standard. Assessment is therefore quite different from measurement and has uses that suggest very different purposes. When used in a learning objective, the definition provided for the behavioral verb measure1 is: To apply a standard scale or measuring device to an object, series of objects, events, or conditions, according to practices accepted by those who are skilled in the use of the device or scale.

Assessment is a process by which information is obtained relative to some known objective or goal. Assessment is a broad term that includes testing. A test is a special form of assessment. Tests are assessments made under contrived circumstances especially so that they may be administered.  In other words, all tests are assessments, but not all assessments are tests. We test at the end of a lesson or unit. We assess progress at the end of a school year through testing, and we assess verbal and quantitative skills through such instruments as the SAT and GRE. Whether implicit or explicit, assessment is most usefully connected to some goal or objective for which the assessment is designed.

A test or assessment yields information relative to an objective or goal. In that sense, we test or assessment to determine whether an objective or goal has been obtained. Assessment of understanding is much more difficult and complex. Skills can be practised; understandings cannot. We can assess a person’s knowledge in a variety of ways, but there is always a leap, an inference that we make about what a person does in relation to what it signifies about what he knows. According to behavioral verbs, assess2 means to stipulate the conditions by which the behavior specified in an objective may be ascertained. Such stipulations are usually in the form of written descriptions.

Evaluation is perhaps the most complex and least understood of the terms. Inherent in the idea of evaluation is “value.” When we evaluate, what we are doing is engaging in some process that is designed to provide information that will help us make a judgment about a given situation. Generally, any evaluation process requires information about the situation in question. A situation is an umbrella term that takes into account such ideas as objectives, goals, standards, procedures, and so on. When we evaluate, we are saying that the process will yield information regarding the worthiness, appropriateness, goodness, validity, legality, etc., of something for which a reliable measurement or assessment has been made.

For example, I often ask my students if they wanted to determine the temperature of the classroom they would need to get a thermometer and take several readings at different spots, and perhaps average the readings. That is simple measuring. The average temperature tells us nothing about whether or not it is appropriate for learning. In order to do that, students would have to be polled in some reliable and valid way.

Teachers, in particular, are constantly evaluating students, and such evaluations are usually done in the context of comparisons between what was intended (learning, progress, behavior) and what was obtained. When used in a learning objective, the definition the behavioral verbs Evaluate3 is: To classify objects, situations, people, conditions, etc., according to defined criteria of quality. Indication of quality must be given in the defined criteria of each class category. Evaluation differs from general classification only in this respect.

To sum up, we measure distance, we assess learning, and we evaluate results in terms of some set of criteria. These three terms are certainly connected, but it is useful to think of them as separate but connected ideas and processes.

Steps for Writing Test Items

1) Planning
a.    Selection of objectives and sub-units (content);
b.    Weightage to objectives;
c.    Weightage to different areas of content;
d.    Weightage to different forms of questions;
e.    Weightage to difficulty level;
f.    Scheme of options;
g.    Sections in the question paper.

2) Preparing
a.    Selection of test items;
b.    Grouping of test items;
c.    Sections in the question paper;
d.    Directions for the test, if necessary, directions for individual items or sections.
e.    Preparing a marking scheme and scoring key.

3) Analyzing and Revising
a.    Questionwise analysis to determine difficulty, discrimination, and reliability.
b.    Retain, edit as necessary, or discard items based on analysis outcomes.
c.    Revise the test as a whole if necessary.

To begin with, the design of the test is prepared so that it may be used as an effective instrument of evaluation. A proper design should increase validity, reliability, objectivity and suitability of the test. The specimens of the weightage-tables and corresponding blue print are given in the annexure.

Test Construction
Step 1: General Steps

1.    Outline either a) the unit learning objectives or b) the unit content major concepts to be covered by the test.

2.    Produce a test blueprint, plotting the outline from step 1 against some hierarchy representing levels of cognitive difficulty or depth of processing.

3.    For each check on the blueprint, match the question level indicated with a question type appropriate to that level.

4.    For each check on the blueprint, jot down on a 3×5 card three or four alternative question ideas and item types which will get at the same objective.

5.    Put all the cards with the same item type together and write the first draft of the items following guidelines for the chosen type(s). Write these on the item cards.

6.    Put all the cards with the same topic together to cross check questions so that no question gives the answer to another question.

7.    Put the cards aside for one or two days.

8.    Reread the items from the standpoint of a student, checking for construction errors.

9.    Order the selected questions logically:
a.    Place some simpler items at the beginning to ease students into the exam
b.    Group item types together under common instructions to save reading time
c.    If desirable, order the questions logically from a content standpoint (e.g. chronologically, in conceptual groups, etc.).

10.    Put the questions away for one or two days before rereading them or have someone else review them for clarity.

11.    Time yourself in actually taking the test and then multiply that by four to six depending on the level of the students.  Remember, there is a certain absolute minimum amount of time required to simply physically record an answer, aside from the thinking time.

12.    Once the test is given and graded, analyze the items and student responses for clues about well-written and poorly written items as well as problems in understanding of instruction.

Step 2: Blueprint
1.    Don’t make it overly detailed.

2.    It’s best to identify major ideas and skills rather than specific details.

3.    Use a cognitive taxonomy that is most appropriate to your discipline, including non-specific skills like communication skills or graphic skills or computational skills if such are important to your evaluation of the answer.

4.    Weigh the appropriateness of the distribution of checks against the students’ level, the importance of the test, the amount of time available.  Obviously one can have more low-level questions in a given time period, for example.

Step 3 Question Types
The following array shows the most common questions types used at various cognitive levels.

Factual KnowledgeApplicationAnalysis and Evaluation
Multiple Choice
Short Answer
Multiple Choice
Short Answer
Multiple Choice

Guidelines for Constructing Effective Test Items
1.    Multiple Choice Questions

Many users regard the multiple-choice item as the most flexible and probably the most effective of the objective item types. A standard multiple-choice test item consists of two basic parts:

1) Problem (stem) and

2) List of suggested solutions (alternatives).

The stem may be in the form of either a question or an incomplete statement, and the list of alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior alternatives (distractors).

The purpose of the distractors is to appear as plausible solutions to the problem for those students who have not achieved the objective being measured by the test item. Conversely, the distractors must appear as implausible solutions for those students who have achieved the objective. Only the answer should appear reasonable to these students.

Multiple-Choice Items are flexible in measuring all levels of cognitive skills. It permits a wide sampling of content and objectives, provide highly reliable test scores and reduced guessing factor compared with true-false items and can be machine-scored quickly and accurately. Again, Multiple-Choice Items are difficult and time-consuming to construct, depend on a student’s reading skills and instructor’s writing ability. The simplicity of writing low- level knowledge items leads instructors to neglect writing items to test higher-level thinking. These questions may encourage guessing (but less than true-false).

Guidelines for Writing Multiple Choice Questions

1.    Design each item to measure an important learning outcome; present a single, clearly formatted problem in the stem of the item; but the alternatives at the end of the question, not in the middle and put as much of the wording as possible in the stem.

Poor: Skinner developed programmed instruction in _____.
a.    1953
b.    1954 (correct)
c.    1955
d.    1956
Skinner developed programmed instruction in _____.
a.    1930s
b.    1940s
c.    1950s (correct)
d.    1970s

2.    All options should be homogenous and reasonable and punctuation should be consistent, make all options grammatically consistent with the stem.

Poor: What is a claw hammer?
a. a woodworking tool (correct)
b. a musical instrument
c. a gardening tool
d. a shoe repair tool
What is a claw hammer?
a. a woodworking tool (correct)
b. a metalworking tool
c. an auto body tool
d. a sheet metal tool

3.    Reduce the length of the alternatives by moving as many words as possible to the stem. The justification is that additional words in the alternatives have to be read four or five times.

Poor: The mean is _____.
a.    a measure of the average (correct).
b.    a measure of the midpoint
c.    a measure of the most popular score
d.    a measure of the dispersion scores.
The mean is a measure of the _____.
a.    average (correct)
b.    midpoint
c.    most popular score
d.    dispersion of scores

4.    Construct the stem so that it conveys a complete thought and avoid negatively worded items like “Which of the following is not…….?” textbook wording and unnecessary words.

Poor: Objectives are _____.
a. used for planning instruction (correct)
b. written in behavioral form only
c. the last step in the instructional design process
d. used in the cognitive but not affective domain
Better: The main function of instructional objectives is _____.
a. planning instruction (correct)
b. comparing teachers
c. selecting students with exceptional abilities
d. assigning students to academic programs.

5.    Do not make the correct answer stand out because of its phrasing or length. Avoid overusing always and never in the alternatives and overusing all of the above and none of the above. When all of the above is used, students can eliminate it simply by knowing that one answer is false. Alternatively, they will know to select it if any two answers are true.

Poor: A narrow strip of land bordered on both sides of water is called an _____.
a. isthmus (correct)
b. peninsula
c. bayou
d. continent
(Note: Do you see why a would be the best guess given the phrasing?)
A narrow strip of land bordered on both sides by water is called a (n) _____.

2. True-False Items

The true false items typically present a declarative statement that the student must mark as either true or false. Instructors generally use true- false items to measure the recall off actual knowledge such as names, events, dates, definitions, etc. However, this format has the potential to measure higher levels of cognitive ability, such as comprehension of significant ideas and their application in solving problems.

They are relatively easy to write and can be answered quickly by students. Students can answer 50 true- false items in the time it takes to answer 30 multiple-choice items. They provide the widest sampling of content per unit of time.

Again, the problem of guessing is a major weakness. Students have a fifty-per cent chance of correctly answering an item without any knowledge of the content. Items are often ambiguous because of the difficulty of writing statements that are unequivocally true or false.

Rules for Writing True/False Items:

1.    Be certain that the statement is completely true or false.

Poor: A good instructional objective will identify a performance standard. (True/False) (Note: The correct answer here is technically false. However, the statement is doubtful. While a performance standard is a feature of some “good” objectives, it is not necessary to make an objective good).

Better: A performance standard of an objective should be stated in measurable terms. (True/False). (Note: The answer here is clearly true.)

2.    Convey only one thought or idea in a true/false statement and avoid verbal clues (specific determiners like “always”) that indicate the answer.

Poor: Bloom’s cognitive taxonomy of objectives includes six levels of objectives, the lowest being knowledge. (True/False)
Bloom’s cognitive taxonomy includes six levels of objectives. (True/False)

Knowledge is the lowest-level objective in Bloom’s cognitive taxonomy. (True/False)

3.    Do not copy sentences directly from textbooks or other written materials and keep the word-length of true statements about the same as that of false statements. Require learners to write a short explanation of why false answers are incorrect. This is to discourage students from cramming or memorizing

Poor: Abstract thinking is intelligence. (True/False)
According to Garett, abstract thinking is intelligence. (True/False)

3. Completion Items

Items provide a wide sampling of content; they minimize guessing compared with multiple-choice and true false. They are rarely can be written to measure more than simple recall of information;  more time-consuming to score than other objective types and difficult to write so there is only one correct answer and no irrelevant clues.

Rules for Writing Completion Items:
1.    Start with a direct question, switch to an incomplete statement, and place the blank at the end of the statement.
What is another name for cone-bearing trees? (Coniferous trees)
Better: Cone-bearing trees are also called______. (Coniferous trees)

2.    Leave only one blank. This should relate to the main point of the statement. Provide two blank if there are two consecutive words.
The ___ is the ratio of the ___ to the___.
Better: The sine is the ration of the ___ to the ___. (opposite side, hypotenuse).

3.    Make the blanks in uniform length and avoid giving irrelevant clue to the correct answer.
The first president of the United States was _____. (Two words)
(Note: The desired answer is George Washington, but students may write “from Virginia”, “a general”, and other creative expressions.)
Better: Give the first and last name of the first president of the United States: _____.

4. Matching Items

A matching exercise typically consists of a list of questions or problems to be answered along with a list of responses. The examinee is required to make an association between each question and response. A large amount of material can be condensed to fit in fewer space  Students have substantially fewer chances for guessing correct associations than on multiple-choice and true/false tests  Matching tests cannot effectively test higher-order intellectual skills

Rules for Writing Matching Items

1.    Teacher should Use homogeneous material in each list of a matching exercise. Mixing events and dates with events and names of persons, for example, makes the exercise two separate sets of questions and gives students a better chance to guess the correct response.

2.    Put the problems or the stems (typically longer than the responses) in a numbered column at the left and the response choices in a lettered column at the right. Always include more responses than questions. If the lists are the same length, the last choice may be determined by elimination rather than knowledge.

3.    Arrange the list of responses in alphabetical or numerical order if possible in order to save reading time. All the response choices must be likely, but make sure that there is only one correct choice for each stem or numbered question.

Column A                           Column B
1) Invented radium            a) Lal Bahadur Shastri
2) Discovered America     b) Bell
3) First P.M. of India           c) Vasco Da Gama
4) Invented Telephone      d) Marconi
e) Columbus
f) Madam Curie Here, column B consists of explorer, scientists and politicians. For higher-class students, the list would be very heterogeneous.


Column A                   Column B            
Eye                               Digestion
Tongue                        Hearing
Stomach                      Breathing
Lung                            Smelling
Ear                               Seeing
Tasting                        Chewing

5. Short Answer Type Items

Short-answer questions should be restrictive enough to evaluate whether the correct answer is given.  Allow a small amount of answer space to discourage the shotgun approach. These tests can test a large amount of content within a given time period. Again, these test items are limited to testing lower-level cognitive objectives, such as the recall of facts. Scoring may not be as straightforward as anticipated.

Poor: What is the area of a rectangle whose length is 6m and breadth 75 cm?
Better: What is the area of a rectangle whose length is 6m and breadth 75 cm? Express your answer in sq. m.

6. Constructing Essay Type Items

“A test item which requires a response composed by the examinee, usually in the form of one or more sentences, of a nature that no single response or pattern of responses can be listed as correct, and the accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject.”4

The difference between short-answer and essay questions is more than just in the length of response required. On essay questions there is more emphasis on the organization and integration of the material, such as when marshaling arguments to support a point of view or method.  Essay questions can be used to measure attainment of a variety of objectives. Stecklein (1955) has listed 14 types of abilities that can be measured by essay items:

1.    Comparisons between two or more things

2.    The development and defense of an opinion

3.    Questions of cause and effect

4.    Explanations of meanings

5.    Summarizing of information in a designated area

6.    Analysis

7.    Knowledge of relationships

8.    Illustrations of rules, principles, procedures, and applications

9.    Applications of rules, laws, and principles to new situations

10.    Criticisms of the adequacy, relevance, or correctness of a concept, idea, or information

11.    Formulation of new questions and problems

12.    Reorganization of facts

13.    Discriminations between objects, concepts, or events

14.    Inferential thinking.

All these involve the higher-level skills mentioned in Bloom’s Taxonomy. So essay questions provide an effective way of assessing complex learning outcomes

Through essay questions, when a paper-and-pencil test is necessary (e.g., assessing students’ ability to make judgments that are well thought through and that are justifiable). Essay questions require students to demonstrate their reasoning and thinking skills, which gives teachers the opportunity to detect problems students may have with their reasoning processes. When educators detect problems in students’ thinking, they can help them overcome those problems

Rules for Constructing Essay Questions:

1.    Ask questions that are relatively specific and focused and which will elicit relatively brief responses.
Describe the role of instructional objectives in education. Discuss Bloom’s contribution to the evaluation of instruction.

Better: Describe and differentiate between behavioral (Mager) and cognitive (Gronlund) objectives with regard to their (1) format and (2) relative advantages and disadvantages for specifying instructional intentions.

2.    If you are using many essay questions in a test, ensure reasonable coverage of the course objectives. Follow the test specifications in writing prompts. Questions should cover the subject areas as well as the complexity of behaviors cited in the test blueprint. Pitch the questions at the students’ level.

Poor: What are the major advantages and limitations of essay questions?

Better: Given their advantages and limitations, should an essay question be used to assess students’ abilities to create a solution to a problem? In answering this question, provide brief explanations of the major advantages and limitations of essay questions. Clearly state whether you think an essay question should be used and explain the reasoning for your judgment.

Example A assesses recall of factual knowledge, whereas Example B requires more of students. It not only requires students to recall facts, but also to make an evaluative judgment, and to explain the reasoning for the judgment. Example B requires more complicated thinking than Example A.

3.    Formulate questions that present a clear task to perform and indicate the point value for each question, provide ample time for answering, and use words which themselves give directions e.g., define, illustrate, outline, select, classify, summaries etc.

Poor: Discuss the analytical method of teaching Mathematics.
Better: Discuss the analytical method of teaching Mathematics, giving its characteristics merits, demerits and practicability. Give illustration.

The construction of clear, unambiguous essay questions that call forth the desired responses is a much more difficult task than is commonly presumed.

Rules for Scoring Essay Type Tests

As we noted earlier, one of the major limitations of the essay test is the subjectivity of the scoring. That is, the feeling of the scorers is likely to enter into the judgments they make concerning the quality of the answers. This may be a personal bias toward the writer of the essay, toward certain areas of content or styles of writing, or toward shortcomings in such extraneous areas as legibility, spelling, and grammar. These biases, of course, distort the results of a measure of achievement and tend to lower their reliability.

The following rules are designed to minimize the subjectivity of the scoring and to provide as uniform a standard of scoring from one student to another as possible. These rules will be most effective, of course, when the questions have been carefully prepared in accordance with the rules for construction.

Evaluate answers to essay questions in terms of the learning outcomes being measured. The essay test, like the objective test, is used to obtain evidence concerning the extent to which clearly defined learning outcomes have been achieved. Thus, the desired student performance specified in these outcomes should serve as a guide both for constructing the questions and for evaluating the answers.

If a question is designed to measure “The Ability to Explain Cause-Effect Relations,” for example, the answer should be evaluated in terms of how adequately the student explains the particular cause-effect relations presented in the question.

All other factors, such as interesting but extraneous factual information, style of writing, and errors in spelling and grammar, should be ignored (to the extent possible) during the evaluation. In some cases, separate scores may be given for spelling or writing ability, but these should not be allowed to contaminate the scores that represent the degree of achievement of the intended learning outcomes.

Score restricted-response answers by the point method, using a model answer as a guide. Scoring with the aid of a previously prepared scoring key is possible with the restricted-response item because of the limitations placed on the answer. The procedure involves writing a model answer to each question and determining the number of points to be assigned to it and to the parts within it. The distribution of points within an answer, of course, takes into account all score able units indicated in the learning outcomes being measured. For example, points may be assigned to the relevance of the examples used and to the organization of the answer, as well as to the content of the answer, if these are legitimate aspects of the learning outcome. As indicated earlier, it is usually desirable to make clear to the student at the time of testing the bases on which each answer will be judged (content, organization, and so on).

Grade extended-response answers by the rating method, using defined criteria as a guide. Extended-response items allow so much freedom in answering that the preparation of a model answer is frequently impossible. Thus, the test maker usually grades each answer by judging its quality in terms of a previously determined set of criteria, rather than scoring it point by point with a scoring key. The criteria for judging the quality of an answer are determined by the nature of the question and thus by the learning outcomes being measured.

Evaluate all of the students’ answers to one question before proceeding to the next question. Scoring or grading essay tests question by question, rather than student by student, makes it possible to maintain a uniform standard for judging the answers to each question. This procedure also helps offset the halo effect in grading. When all of the answers on one paper are read together, the grader’s impression of the paper as a whole is apt to influence the grades he assigns to the individual answers. Grading question by question, of course, prevents the formation of this overall impression of a student’s paper. Each answer is more appropriate to judge on its own merits when it is read and compared with other answers to the same question, than when it is read and compared with other answers by the same student.

Evaluate answers to essay questions without knowing the identity of the writer. This is another attempt to control personal bias during scoring. Answer to essay questions should be evaluated in terms of what is written, not in terms of what is known about the writers from other contacts with them. The best way to prevent our prior knowledge from biasing our judgment is to evaluate each answer without knowing the identity of the writer. This can be done by having the students write their names on the back of the paper or by using code numbers in place of names.

Whenever possible, have two or more persons grade each answer. The best way to check on the reliability of the scoring of essay answers is to obtain two or more independent judgments. Although this may not be a feasible practice for routine classroom testing, it might be done periodically with a fellow teacher (one who is equally competent in the area). Obtaining two or more independent ratings becomes especially vital where the results are to be used for important and irreversible decisions, such as in the selection of students for further training or for special awards.

Be on the alert for bluffing. Some students who do not know the answer may write a well organized coherent essay but one containing material irrelevant to the question. Decide how to treat irrelevant or inaccurate information contained in students’ answers. We should not give credit for irrelevant material. It is not fair to other students who may also have preferred to write on another topic, but instead wrote on the required question.

Write comments on the students’ answers. Teacher comments make essay tests a good learning experience for students. They also serve to refresh your memory of your evaluation should the student question the grade.

Reviewing Test Items5

Once the teacher has constructed his test items, regardless of the type and the format, he should ask himself the following questions:

1. Do the items truly measure what I am trying to measure?

2. Will the intent of the items be clear to someone reading it for the first time?

3. Do my learners have all of the information they need to answer the items?

4. Is the wording as clear and concise as possible? If not, can the item be revised and still understood?

5. Is the correct answer clearly correct and up-to-date according to experts in the field?

Checklist for Reviewing Multiple-Choice Items

1.    Has the item been constructed to assess a single written objective?

2.    Is the item based on a specific problem stated clearly in the stem?

3.    Does the stem include as much of the item as possible, without including

4.    Irrelevant material?

5.    Is the stem stated in positive form?

6.    Are the alternatives worded clearly and concisely?

7.    Are the alternatives mutually exclusive?

8.    Are the alternatives homogeneous in content?

9.    Are the alternatives free from clues as to which response is correct?

10.    Have the alternatives “all of the above” and “none of the above” been avoided?

11.    Does the item include as many functional distractors as are feasible?

12.    Does the item include one and only one correct or clearly best answer?

13.    Has the answer been randomly assigned to one of the alternative positions?

14.    Is the item laid out in a clear and consistent manner?

15.    Are the grammar, punctuation, and spelling correct?

16.    Has unnecessarily difficult vocabulary been avoided?

17.    If the item has been administered before, has its effectiveness been analyzed?




4 Based on Stalnaker’s definition.

5 Morrison, G.R., Ross S.M., Kemp J.E. (2004). Designing Effective Instruction. Fourth Edition

6Cantor J.A. (2001). Delivering Instruction to Adult Learners. Revised Edition

About the author

Tamanna Kalim

Tamanna Kalim

Tamanna Kalim is working in the developmental sector (Health and Education) for over 10 years in different multicultural settings, possess post-graduation both in Public Health (MPH) and Education (MEd). Her current position is Clinical Administrator in Vancouver, BC, Canada.

Leave a Comment