Measurement, Assessment and Evaluation: Measurement, assessment, and evaluation mean very different things, and yet most of the students are unable to adequately explain the differences.
Measurement refers to the process by which the attributes or dimensions of some physical object are determined. One exception seems to be in the use of the word measure in determining the IQ of a person. The phrase, “this test measures IQ” is commonly used. Measuring such things as attitudes or preferences also applies. However, when we measure, we generally use some standard instrument to determine how big, tall, heavy, voluminous, hot, cold, fast, or straight something actually is. Standard instruments refer to instruments such as rulers, scales, thermometers, pressure gauges, etc. We measure to obtain information about what is. Such information may or may not be useful, depending on the accuracy of the instruments we use, and our skill at using them. We measure how big a classroom is in terms of square feet, we measure the temperature of the room by using a thermometer, and we use Ohmmeters to determine the voltage, amperage, and resistance in a circuit. In all of these examples, we are not assessing anything; we are simply collecting information relative to some established rule or standard. Assessment is therefore quite different from measurement, and has uses that suggest very different purposes. When used in a learning objective, the definition provided for the behavioral verb measure1 is: To apply a standard scale or measuring device to an object, series of objects, events, or conditions, according to practices accepted by those who are skilled in the use of the device or scale.
Assessment is a process by which information is obtained relative to some known objective or goal. Assessment is a broad term that includes testing. A test is a special form of assessment. Tests are assessments made under contrived circumstances especially so that they may be administered. In other words, all tests are assessments, but not all assessments are tests. We test at the end of a lesson or unit. We assess progress at the end of a school year through testing, and we assess verbal and quantitative skills through such instruments as the SAT and GRE. Whether implicit or explicit, assessment is most usefully connected to some goal or objective for which the assessment is designed. A test or assessment yields information relative to an objective or goal. In that sense, we test or assess to determine whether an objective or goal has been obtained. Assessment of understanding is much more difficult and complex. Skills can be practiced; understandings cannot. We can assess a person’s knowledge in a variety of ways, but there is always a leap, an inference that we make about what a person does in relation to what it signifies about what he knows. According to behavioral verbs, to assess2 means to stipulate the conditions by which the behavior specified in an objective may be ascertained. Such stipulations are usually in the form of written descriptions.
Evaluation is perhaps the most complex and least understood of the terms. Inherent in the idea of evaluation is “value.” When we evaluate, what we are doing is engaging in some process that is designed to provide information that will help us make a judgment about a given situation. Generally, any evaluation process requires information about the situation in question. A situation is an umbrella term that takes into account such ideas as objectives, goals, standards, procedures, and so on. When we evaluate, we are saying that the process will yield information regarding the worthiness, appropriateness, goodness, validity, legality, etc., of something for which a reliable measurement or assessment has been made. For example, I often ask my students if they wanted to determine the temperature of the classroom they would need to get a thermometer and take several readings at different spots, and perhaps average the readings. That is simple measuring. The average temperature tells us nothing about whether or not it is appropriate for learning. In order to do that, students would have to be polled in some reliable and valid way.
Teachers, in particular, are constantly evaluating students, and such evaluations are usually done in the context of comparisons between what was intended (learning, progress, behavior) and what was obtained. When used in a learning objective, the definition the behavioral verbs Evaluate3 is: To classify objects, situations, people, conditions, etc., according to defined criteria of quality. Indication of quality must be given in the defined criteria of each class category. Evaluation differs from general classification only in this respect.
To sum up, we measure distance, we assess learning, and we evaluate results in terms of some set of criteria. These three terms are certainly connected, but it is useful to think of them as separate but connected ideas and processes.
Steps for Writing Test Items
a. Selection of objectives and sub-units (content);
b. Weightage to objectives;
c. Weightage to different areas of content;
d. Weightage to different forms of questions;
e. Weightage to difficulty level;
f. Scheme of options;
g. Sections in the question paper.
a. Selection of test items;
b. Grouping of test items;
c. Sections in the question paper;
d. Directions for the test, if necessary, directions for individual items or sections.
e. Preparing a marking scheme and scoring key.
3) Analyzing and Revising
a. Questionwise analysis to determine difficulty, discrimination, and reliability.
b. Retain, edit as necessary, or discard items based on analysis outcomes.
c. Revise the test as a whole if necessary.
To begin with, the design of the test is prepared so that it may be used as an effective instrument of evaluation. A proper design should increase validity, reliability, objectivity and suitability of the test. The specimens of the weightage-tables and corresponding blue print are given in the annexure.
Step 1: General Steps
2. Produce a test blueprint, plotting the outline from step 1 against some hierarchy representing levels of cognitive difficulty or depth of processing.
3. For each check on the blueprint, match the question level indicated with a question type appropriate to that level.
4. For each check on the blueprint, jot down on a 3×5 card three or four alternative question ideas and item types which will get at the same objective.
7. Put the cards aside for one or two days.
8. Reread the items from the standpoint of a student, checking for construction errors.
9. Order the selected questions logically:
a. Place some simpler items at the beginning to ease students into the exam
b. Group item types together under common instructions to save reading time
c. If desirable, order the questions logically from a content standpoint (e.g. chronologically, in conceptual groups, etc.).
10. Put the questions away for one or two days before rereading them or have someone else review them for clarity.
Step 2: Blueprint
1. Don’t make it overly detailed.
2. It’s best to identify major ideas and skills rather than specific details.
Step 3 Question Types
The following array shows the most common questions types used at various cognitive levels.
|Factual Knowledge||Application||Analysis and Evaluation|
|Multiple Choice |
Guidelines for Constructing Effective Test Items
1. Multiple Choice Questions
1) Problem (stem) and
2) List of suggested solutions (alternatives).
The stem may be in the form of either a question or an incomplete statement, and the list of alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior alternatives (distractors).
The purpose of the distractors is to appear as plausible solutions to the problem for those students who have not achieved the objective being measured by the test item. Conversely, the distractors must appear as implausible solutions for those students who have achieved the objective. Only the answer should appear reasonable to these students.
Guidelines for Writing Multiple Choice Questions
Poor: Skinner developed programmed instruction in _____.
b. 1954 (correct)
Better: Skinner developed programmed instruction in _____.
c. 1950s (correct)
Poor: What is a claw hammer?
a. a woodworking tool (correct)
b. a musical instrument
c. a gardening tool
d. a shoe repair tool
Better: What is a claw hammer?
a. a woodworking tool (correct)
b. a metalworking tool
c. an auto body tool
d. a sheet metal tool
Poor: The mean is _____.
a. a measure of the average (correct).
b. a measure of the midpoint
c. a measure of the most popular score
d. a measure of the dispersion scores.
Better: The mean is a measure of the _____.
a. average (correct)
c. most popular score
d. dispersion of scores
Poor: Objectives are _____.
a. used for planning instruction (correct)
b. written in behavioral form only
c. the last step in the instructional design process
d. used in the cognitive but not affective domain
Better: The main function of instructional objectives is _____.
a. planning instruction (correct)
b. comparing teachers
c. selecting students with exceptional abilities
d. assigning students to academic programs.
Poor: A narrow strip of land bordered on both sides of water is called an _____.
a. isthmus (correct)
(Note: Do you see why a would be the best guess given the phrasing?)
Better: A narrow strip of land bordered on both sides by water is called a (n) _____.
2. True-False Items
Rules for Writing True/False Items:
1. Be certain that the statement is completely true or false.
Better: A performance standard of an objective should be stated in measurable terms. (True/False). (Note: The answer here is clearly true.)
Better: Bloom’s cognitive taxonomy includes six levels of objectives. (True/False)
Knowledge is the lowest-level objective in Bloom’s cognitive taxonomy. (True/False)
Poor: Abstract thinking is intelligence. (True/False)
Better: According to Garett, abstract thinking is intelligence. (True/False)
3. Completion Items
Rules for Writing Completion Items:
1. Start with a direct question, switch to an incomplete statement, and place the blank at the end of the statement.
Poor: What is another name for cone-bearing trees? (Coniferous trees)
Better: Cone-bearing trees are also called______. (Coniferous trees)
2. Leave only one blank. This should relate to the main point of the statement. Provide two blank if there are two consecutive words.
Poor: The ___ is the ratio of the ___ to the___.
Better: The sine is the ration of the ___ to the ___. (opposite side, hypotenuse).
3. Make the blanks in uniform length and avoid giving irrelevant clue to the correct answer.
Poor: The first president of the United States was _____. (Two words)
(Note: The desired answer is George Washington, but students may write “from Virginia”, “a general”, and other creative expressions.)
Better: Give the first and last name of the first president of the United States: _____.
4. Matching Items
Rules for Writing Matching Items
Column A Column B
1) Invented radium a) Lal Bahadur Shastri
2) Discovered America b) Bell
3) First P.M. of India c) Vasco Da Gama
4) Invented Telephone d) Marconi
f) Madam Curie Here, column B consists of explorer, scientists and politicians. For higher-class students, the list would be very heterogeneous.
Column A Column B
5. Short Answer Type Items
Poor: What is the area of a rectangle whose length is 6m and breadth 75 cm?
Better: What is the area of a rectangle whose length is 6m and breadth 75 cm? Express your answer in sq. m.
6. Constructing Essay Type Items
“A test item which requires a response composed by the examinee, usually in the form of one or more sentences, of a nature that no single response or pattern of responses can be listed as correct, and the accuracy and quality of which can be judged subjectively only by one skilled or informed in the subject.”4
The difference between short-answer and essay questions is more than just in the length of response required. On essay questions there is more emphasis on the organization and integration of the material, such as when marshaling arguments to support a point of view or method. Essay questions can be used to measure attainment of a variety of objectives. Stecklein (1955) has listed 14 types of abilities that can be measured by essay items:
1. Comparisons between two or more things
2. The development and defense of an opinion
3. Questions of cause and effect
4. Explanations of meanings
5. Summarizing of information in a designated area
7. Knowledge of relationships
8. Illustrations of rules, principles, procedures, and applications
9. Applications of rules, laws, and principles to new situations
10. Criticisms of the adequacy, relevance, or correctness of a concept, idea, or information
11. Formulation of new questions and problems
12. Reorganization of facts
13. Discriminations between objects, concepts, or events
14. Inferential thinking.
All these involve the higher-level skills mentioned in Bloom’s Taxonomy. So essay questions provide an effective way of assessing complex learning outcomes
Through essay questions, when a paper-and-pencil test is necessary (e.g., assessing students’ ability to make judgments that are well thought through and that are justifiable). Essay questions require students to demonstrate their reasoning and thinking skills, which gives teachers the opportunity to detect problems students may have with their reasoning processes. When educators detect problems in students’ thinking, they can help them overcome those problems
Rules for Constructing Essay Questions:
1. Ask questions that are relatively specific and focused and which will elicit relatively brief responses.
Poor: Describe the role of instructional objectives in education. Discuss Bloom’s contribution to the evaluation of instruction.
Better: Describe and differentiate between behavioral (Mager) and cognitive (Gronlund) objectives with regard to their (1) format and (2) relative advantages and disadvantages for specifying instructional intentions.
2. If you are using many essay questions in a test, ensure reasonable coverage of the course objectives. Follow the test specifications in writing prompts. Questions should cover the subject areas as well as the complexity of behaviors cited in the test blueprint. Pitch the questions at the students’ level.
Poor: What are the major advantages and limitations of essay questions?
Better: Given their advantages and limitations, should an essay question be used to assess students’ abilities to create a solution to a problem? In answering this question, provide brief explanations of the major advantages and limitations of essay questions. Clearly state whether you think an essay question should be used and explain the reasoning for your judgment.
Example A assesses recall of factual knowledge, whereas Example B requires more of students. It not only requires students to recall facts, but also to make an evaluative judgment, and to explain the reasoning for the judgment. Example B requires more complicated thinking than Example A.
3. Formulate questions that present a clear task to perform and indicate the point value for each question, provide ample time for answering, and use words which themselves give directions e.g., define, illustrate, outline, select, classify, summaries etc.
Poor: Discuss the analytical method of teaching Mathematics.
Better: Discuss the analytical method of teaching Mathematics, giving its characteristics merits, demerits and practicability. Give illustration.
The construction of clear, unambiguous essay questions that call forth the desired responses is a much more difficult task than is commonly presumed.
Rules for Scoring Essay Type Tests
As we noted earlier, one of the major limitations of the essay test is the subjectivity of the scoring. That is, the feeling of the scorers is likely to enter into the judgments they make concerning the quality of the answers. This may be a personal bias toward the writer of the essay, toward certain areas of content or styles of writing, or toward shortcomings in such extraneous areas as legibility, spelling, and grammar. These biases, of course, distort the results of a measure of achievement and tend to lower their reliability.
The following rules are designed to minimize the subjectivity of the scoring and to provide as uniform a standard of scoring from one student to another as possible. These rules will be most effective, of course, when the questions have been carefully prepared in accordance with the rules for construction.
Evaluate answers to essay questions in terms of the learning outcomes being measured. The essay test, like the objective test, is used to obtain evidence concerning the extent to which clearly defined learning outcomes have been achieved. Thus, the desired student performance specified in these outcomes should serve as a guide both for constructing the questions and for evaluating the answers.
If a question is designed to measure “The Ability to Explain Cause-Effect Relations,” for example, the answer should be evaluated in terms of how adequately the student explains the particular cause-effect relations presented in the question.
All other factors, such as interesting but extraneous factual information, style of writing, and errors in spelling and grammar, should be ignored (to the extent possible) during the evaluation. In some cases, separate scores may be given for spelling or writing ability, but these should not be allowed to contaminate the scores that represent the degree of achievement of the intended learning outcomes.
Score restricted-response answers by the point method, using a model answer as a guide. Scoring with the aid of a previously prepared scoring key is possible with the restricted-response item because of the limitations placed on the answer. The procedure involves writing a model answer to each question and determining the number of points to be assigned to it and to the parts within it. The distribution of points within an answer, of course, takes into account all score able units indicated in the learning outcomes being measured. For example, points may be assigned to the relevance of the examples used and to the organization of the answer, as well as to the content of the answer, if these are legitimate aspects of the learning outcome. As indicated earlier, it is usually desirable to make clear to the student at the time of testing the bases on which each answer will be judged (content, organization, and so on).
Grade extended-response answers by the rating method, using defined criteria as a guide. Extended-response items allow so much freedom in answering that the preparation of a model answer is frequently impossible. Thus, the test maker usually grades each answer by judging its quality in terms of a previously determined set of criteria, rather than scoring it point by point with a scoring key. The criteria for judging the quality of an answer are determined by the nature of the question and thus by the learning outcomes being measured.
Evaluate all of the students’ answers to one question before proceeding to the next question. Scoring or grading essay tests question by question, rather than student by student, makes it possible to maintain a uniform standard for judging the answers to each question. This procedure also helps offset the halo effect in grading. When all of the answers on one paper are read together, the grader’s impression of the paper as a whole is apt to influence the grades he assigns to the individual answers. Grading question by question, of course, prevents the formation of this overall impression of a student’s paper. Each answer is more appropriate to judge on its own merits when it is read and compared with other answers to the same question, than when it is read and compared with other answers by the same student.
Evaluate answers to essay questions without knowing the identity of the writer. This is another attempt to control personal bias during scoring. Answer to essay questions should be evaluated in terms of what is written, not in terms of what is known about the writers from other contacts with them. The best way to prevent our prior knowledge from biasing our judgment is to evaluate each answer without knowing the identity of the writer. This can be done by having the students write their names on the back of the paper or by using code numbers in place of names.
Whenever possible, have two or more persons grade each answer. The best way to check on the reliability of the scoring of essay answers is to obtain two or more independent judgments. Although this may not be a feasible practice for routine classroom testing, it might be done periodically with a fellow teacher (one who is equally competent in the area). Obtaining two or more independent ratings becomes especially vital where the results are to be used for important and irreversible decisions, such as in the selection of students for further training or for special awards.
Be on the alert for bluffing. Some students who do not know the answer may write a well organized coherent essay but one containing material irrelevant to the question. Decide how to treat irrelevant or inaccurate information contained in students’ answers. We should not give credit for irrelevant material. It is not fair to other students who may also have preferred to write on another topic, but instead wrote on the required question.
Write comments on the students’ answers. Teacher comments make essay tests a good learning experience for students. They also serve to refresh your memory of your evaluation should the student question the grade.
Reviewing Test Items5
Once the teacher has constructed his test items, regardless of the type and the format, he should ask himself the following questions:
1. Do the items truly measure what I am trying to measure?
2. Will the intent of the items be clear to someone reading it for the first time?
3. Do my learners have all of the information they need to answer the items?
4. Is the wording as clear and concise as possible? If not, can the item be revised and still understood?
5. Is the correct answer clearly correct and up-to-date according to experts in the field?
Checklist for Reviewing Multiple-Choice Items
1. Has the item been constructed to assess a single written objective?
2. Is the item based on a specific problem stated clearly in the stem?
3. Does the stem include as much of the item as possible, without including
4. Irrelevant material?
5. Is the stem stated in positive form?
6. Are the alternatives worded clearly and concisely?
7. Are the alternatives mutually exclusive?
8. Are the alternatives homogeneous in content?
9. Are the alternatives free from clues as to which response is correct?
10. Have the alternatives “all of the above” and “none of the above” been avoided?
11. Does the item include as many functional distractors as are feasible?
12. Does the item include one and only one correct or clearly best answer?
13. Has the answer been randomly assigned to one of the alternative positions?
14. Is the item laid out in a clear and consistent manner?
15. Are the grammar, punctuation, and spelling correct?
16. Has unnecessarily difficult vocabulary been avoided?
17. If the item has been administered before, has its effectiveness been analyzed?
4 Based on Stalnaker’s definition.
5 Morrison, G.R., Ross S.M., Kemp J.E. (2004). Designing Effective Instruction. Fourth Edition
6Cantor J.A. (2001). Delivering Instruction to Adult Learners. Revised Edition
Author: Material Developer, BRAC Education Programme, BRAC, Dhaka, Bangladesh.