The Examiners Examined
An Examination of Examinations. By Sir Philip Hartog, K.B.E., C.I.E., and E. C. Rhodes, D.Sc. (Macmillan. Is.)
Tins lucid and fascinating pamphlet is a summary of the results of a series of investigations on examinations carried out by committees appointed for the purpose at the International Conference on Examinations held in May, 1981, under the auspices of the Carnegie Corporation, the Carnegie Foundation, and the International Institute of Teachers College, Columbia University. The object of the investigations was to make a systematic comparison of the marks allotted by a number of independent examiners to sets of scripts actually written at certain public examinations. The examiners were all qualified for their task by previous experience of the kind of examination investigated and, in order to reproduce the psychological conditions of a real examination, were all paid for their work at the usual rates. The scripts used were scripts which had been written at School Certificate Examinations, at Special Place Examinations (the examinations, held between the ages of 10 and 12, on the results of which elementary school children are admitted to secondary and central schools), at a College Scholarship Examination in English Essay, and at University Honours Examinations in History and in Mathematics. The results of these investigations are almost uniformly disquieting, and make it clear that in the verdicts given at the various public examinations, on which the start of a career or the chance of pursuing a particular kind of education may depend, a predominant part is played by pure chance.
Some of the most interesting results were obtained in the investigation on the School Certificate History Examination. Fifteen papers which had received exactly the same moderate mark from the original School Certificate examiner were marked independently by fourteen examiners, who were asked to assign to them both numerical markings and awards of Failure, Pass and Credit. After an interval of a year the same scripts, bearing of course no traces of previous examination, were sent to the same examiners for re-marking. On the first occasion, though the scripts had all received exactly the same mark from the original School Certificate authority, they received from the fifteen examiners marks varying from 21 to 70 out of a maximum of 96. On the second occasion the marks varied from 16 to 71. The marks allotted by the same ex- aminer to the same candidate on the two occasions were frequently notably dissimilar, in one case the difference being as much as 30 marks. Moreover, in assigning the verdicts of Failure, Pass or Credit, in 92 cases out of 210 an examiner in judging a paper gave a different verdict on the second occasion from that which he had given on the first. In nine cases a candidate was moved two classes up or down, and one examiner was revealed to have changed his verdict in eight cases out of the fifteen. It is difficult to believe in the reliability of a process of measurement which, applied to the same material by the same examiner, can yield such conflicting results.
Yet this is the kind of result that was produced in each of the investigations. In the investigation on the School Certifi- cate examination in English, 48 candidates were examined by 7 independent examiners, and in only one case out of the 48 were all the examiners agreed about the class in which the can- didate should be placed—let alone the numerical marking which he should receive. In the investigation on the College Scholarship Examination in English Essay, the candidates were examined by five independent examiners, who assigned irks and divided them into classes of merit. It was found that the average range of difference in the marks awarded to a particular candidate was 19.6, that in one case there was a difference of 36 marks out of a maximum of 100, and that one of the candidates was placed in the 1st class by one examiner, and in the 4th class by another. Moreover, though 17 candi- eates were placed in the 1st class by one or other of the examiners, not a single candidate was placed in this class by more than three out of the five examiners. In the investigation on the University History Honours Examination, for which the Oxford system of literal marking was used, wide and apparently irreconcilable differences of standard were revealed among the examiners—who were, as in all of these investigations, all men of position and experience, and inci- dentally included nine university professors among them. Thus in marking a paper on Mediaeval and Modern History
one examiner marked'all the scripts as 13 or better, and two examiners all as 13 or worse. For one paper a candidate was awarded a by one examiner and y+ by another, a difference of
18 grades out of a possible range of 24. In the marking of other candidates' papers there were differences of 17, 16, and 15 grades in the awards of different examiners. Thus in several cases a candidate placed in the third class by one examiner was given a first by another, and on the average there is a difference of a whole class in the marks awarded by the different examiners to the same candidate. In the investigation on the Special Place Examination the examiners worked in pairs, but it was shown that this method did not substantially diminish the element of chance. Although the examinations were of an elementary nature, and the examiners used carefully drawn-up marking schemes, there was a range of 63 marks out of 200 in the marks given to one candidate.
In addition to written examinations on a particular subject, the viva voce examination to test "alertness, intelligence,
and general outlook," so dear to the Civil Service Com- missioners and other authorities, was systematically investi- gated. A group of candidates, consisting of university graduates certified by their university authorities to be suitable as candidates for the Home Civil Service, was collected and a prize of IWO offered to the candidate who in the opinion of the examiners should be placed at the head of the list. Two independent boards of examiners were formed, before both of which each candidate appeared. Each of the members of each board gave each candidate is mark, and a mark representing the view of the board as a whole was also awarded, either as the result of discussion among the members of the board or, if agreement could not be obtained in this manner, by taking an average of their separate marks. The results reveal quite astounding differ- ences between the views of the two boards. Thus the candi- date placed 1st by the first board was placed 13th by the other, and the candidate placed 1st by the second board was placed 11th by the first. The average difference between the marks awarded to a candidate by the two boards was 37 marks out of a maximum of 300, with an extreme disagree- ment between them of 92 marks. In the marks awarded to a candidate by the separate members of the boards there were differences of up to 140 marks.
There would be little point in emphasising the conclusions to which these results must unequivocally lead. It is clear from them that examinations, as they are at present con- ducted, are always unreliable and generally unfair as tests of intellectual achievement. Since examinations today dominate the educational world, and as a result most of the professions and careers, detailed and systematic consideration should therefore be given to these investigations. But it is unfortunately easier to criticise than to suggest a practicable substitute for examinations. Even the authors of this pamphlet, and the committee on whose behalf they have produced it, have little definite to suggest. They do no more than proclaim the need of "careful and systematic experiment" through which methods of examination "not liable to the distressing uncertainties of the present system" can be devised. If, as a result of this experiment, a reliable and equitable method of examination is ever achieved, no small part of the credit will undoubtedly go to those who, like the promoters of the investigations described in this pamphlet, have provided the bases from which experiment could proceed. But the ideal seems so distant that one regretfully doubts