WILLIAM

British Educational Research Journal, Vol. 22, No. 5, 1996
537






Meanings and Consequences: a basis for distinguishing formative and summative

functions of assessment?





DYLAN WILIAM & PAUL BLACK, King's College London




ABSTRACT The assessment process is characterised as a cycle involving elicitation of evidence, which when interpreted appropriately may lead to action, which in turn, can yield further evidence and so on. An assessment is defined as serving a formative function when it elicits evidence that yields construct-referenced interpretations that form the basis for successful action in improving performance, whereas summative functions prioritise the consistency of meanings across contexts and individuals. Aspects of the interplay of meanings and consequences are exploredfor each of the three phases, and it is suggested that this interplay may be fruitful in distinguishing the two functions. Tensions between summative and formative functions of assessment are illustrated in the context of the National Curriculum, and although it is shown that such tensions will always exist, it is suggested that the separation of the elicitation of evidence from its interpretation can mitigate that tension.

Introduction

The terms formative and summative assessment are not very common in the technical literature on assessment. For example, the third edition of the classic text in the field, Educational Measurement (Linn, 1989), indexes only a single mention of each, both in the chapter by Nitko (1989). Part of this is explained by differences in the terms used on the two sides of the Atlantic, but in our view a much more significant factor is that the day-to-day activities of teachers have historically been of little interest to academic researchers in this area.

The term 'formative evaluation' had first been used by Michael Scriven (1967) in connection with the improvement of curriculum, but Bloom et al. (1971) were the first to extend the usage to its generally accepted current meaning. They defined as summative evaluation tests those assessments given at the end of units, mid-term and at the end of a course, which are designed to judge the extent of students' learning of the material in a course, for the purpose of grading, certification, evaluation of progress or even for researching the effectiveness of a curriculum (Bloom et al., p. 117). They contrasted

0141-1926/96/050537-12 ©1996 British Educational Research Association


538             D. Wiliam & P. Black

these with 'another type of evaluation which all who are involved—student, teacher, curriculum maker—would welcome because they find it so useful in helping them improve what they wish to do' (p. 117), which they termed 'formative evaluation'.

From the earliest use of these terms, it was stressed that the terms applied not to the assessments themselves, but to the functions they served. On the one hand, the results of an assessment that had been designed originally to fulfil a summative function might be used formatively, as is the case when a teacher administers a paper from a previous year in order to help students to prepare for an examination. On the other hand, one does not have to go far to find examples of assessments intended to have some formative value whose results are used simply as summative judgements of the achievement of students.

Subsequently, Airaisian & Madaus (1972) augmented the classification by the addition of diagnostic and placement functions for assessment. However, this fourfold classification is neither particularly useful nor illuminating since the relationship between the four kinds of function is not clearly drawn out, and in any case, the terms are used in rather different senses from those currently accepted, at least in the UK. Nitko (1989) does not use the term 'formative' for types of assessments, preferring to classify assessment decisions as placement decisions, diagnostic decisions, monitoring decisions, and attainment decisions.


Characteristics of Formative Assessment

Whatever the labels that are used to describe it, formative assessment itself is, of course, nothing new. Almost all successful teaching (and certainly any teaching that is successful over a sustained period) relies heavily on adapting the teaching in the light of evidence about the success of previous episodes. These adaptations vary in terms of both their scope (e.g. the number of students involved) and time-scale.

At one extreme, a teacher explaining something to an individual student may amend his or her approach almost instantaneously in response to a frown of puzzlement on the student's face or some other aspect of body language. The adaptations may occur in teaching situations with just as narrow a focus, but over a much longer time-scale, as happens when a music teacher decides that a particular approach to teaching the violin, which has worked with many students, is not working with a particular individual (narrow focus, longer time-scale). Conversely, when teaching to a whole class, many teachers use a 'reference group' of students (Dahllof, 1971) to judge the pacing of their lessons (broader focus, short time-scale). At the other extreme, an adaptation can have both a broad scope and take place over a long time-scale, as in the case of a 5-yearly review of an undergraduate degree programme.

The common element in all of these assessment processes is that of feedback, defined by Ramaprasad (1983) as 'information about the gap between the actual level and the reference level of a system parameter which is used to alter the gap in some way' (p. 4). As Sadler (1989) notes:

An important feature of Ramaprasad's definition is that information about the gap between actual and reference levels is considered as feedback only when it is used to alter the gap. If the information is simply recorded, passed to a third party who lacks either the knowledge or the power to change the outcome, or is too deeply coded (for example, as a summary grade given by the teacher) to lead to appropriate action, the control loop cannot be closed,


Formative and Summative Functions of Assessment         539

and 'dangling data' substituted for effective feedback, (p. 121, emphasis in original)

Formative functions of assessment are therefore validated in terms of their consequences as much as their meanings. Until recently, however, the consequences of summative assessments have been excluded from validity arguments, but reformulations by Messick (1980), Cronbach (1988) and Madaus (1988) have emphasised that assessments are validated not simply by their outcomes, but also by what happens as a result of those outcomes:

Validity is an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989, p. 13, emphases in original)

A consideration of the relationship between formative and summative functions of assessment must therefore take place in the light of discussion of how the more general concepts of validity and reliability impinge upon the relationship.

Validity and Reliability

If a response to (say) a single multiple-choice question or to a question asked in a class discussion is held to be inadequate, it is by no means clear what we are entitled to conclude. This is because the answer is a measure not only of pupil's ability in relation to the standard of understanding intended by the questioner, but also of the response to a variety of other features such as the language, the context of the question, or the pupil's tiredness at the time. The response is therefore a measure of a variety of properties and its validity is contaminated, because it does not measure only the feature intended (i.e. it embodies construct-irrelevant variance), or, to follow a different view of validity, action based on the response may not be appropriate.

Such difficulties may be dealt with in a variety of ways. One may ask a variety of questions, which differ in language and in context and have in common only the particular understanding that is of interest; one might go further and set such questions on different occasions so that the pupils may be in different states of mind. In such manoeuvres, the aim is to average out all those effects which are not desired and so enhance the signal to noise ratio for the effect of interest. This can be seen as enhancement of validity by repetition or averaging. This is what is commonly called reliability, and this first response to the problem of the mixed or contaminated validity of an item is the classical one for summative testing.

A different response may be more appropriate in the case of the question asked in the class discussion. The teacher may respond with a correction, or with a follow-up question that has a diagnostic purpose. This is a quite different type of response. It may be justified for two reasons. First, if the pupil's response were due (say) to a misunderstanding or misinterpretation of the original question, this may quickly become apparent. That is to say, the consequences allow correction for invalidity of the evidence. Secondly, for the teacher to ask the pupil several other questions in different language and contexts, in order to be sure that the interpretation of misunderstanding was well founded, would be impracticable, except in extraordinary circumstances—for example, where the teacher might judge that follow-up responses might simply create confusion if the issue had not been properly identified.

Thus, in the fluid action of the classroom where rapid feedback is important, optimum


540             D. Wiliam & P. Black

validity depends upon the self-correcting nature of the consequent action. In the formal test, which is on a different time-scale and for which correction by follow-up is out of the question, optimum validity requires a collection of items. Consequently, in the way that this discussion is developed, reliability is not a separate issue—it is subsumed into validity.

However, these two scenarios, the classroom discussion and the formal test, are best regarded as two ends of a spectrum. There are many intermediates. One would be an oral examination—where fewer questions are usually possible than in a formal test but validity can be enhanced through the use of responsive and interactive dialogue. Another would be a reflective appraisal of a pupil's last three essays in order to give an overall grade. On one interpretation, the average over the three is 'reliable' because it might iron out influences on performance which are not relevant to interpreting the evidence as a prediction for future writing (i.e. it would iron out the invalidities). On another interpretation, the best of the three would be more valid as it might indicate what could be achieved if the pupil were strongly interested or motivated in the task. Or again, it might be the last of the three because this would indicate where the pupil had arrived after having the full advantage of the teacher's feedback. In this example, the choice between these is not to do with any tension between formative and summative functions, either in terms of the data collected or in terms of the criteria for interpretation. Furthermore, only a teacher who knew the particular student and the context of the work could make that best choice—he or she would be giving meaning to the data in the light of the predictive consequences that might follow from the result. There might also be feedback consequences. Sadler (1989) points out that a mechanical approach in which students' course-work marks are accumulated to give a summative result can have the effect of making the student unwilling to repeat work in order to improve it, because the same investment of time might add more to the total through production of yet another mediocre essay.


The Assessment Cycle

Both formative and summative functions of assessment require that evidence of perform-ance or attainment is elicited, interpreted and acted upon, in some way. These actions may then directly or indirectly generate more evidence so that the cycle is repeated. The key agents in this process are, of course, the assessed and the assessor (often called the teacher and the student below, for simplicity), although sometimes the assessed and the assessor will be the same individual. However, many aspects of the process cannot be understood without acknowledging that this relationship between the assessed and the assessor is itself influenced by the relationships that each of the agents has to a wider social context.

Although there is no natural beginning or ending to the process of assessment, for the sake of discussion, it is convenient to start with the elicitation of evidence.

Eliciting Evidence

Before any inferences can be made, or actions taken, some evidence about the level of performance must be generated and observed. We can immediately distinguish between purposive and incidental evidence. Purposive evidence is that which is elicited as a result of a deliberate act by someone (usually the teacher) that is designed to provide evidence about a student's knowledge or capabilities in a particular area. This most commonly


Formative and Summative Functions of Assessment         541

takes the form of direct questioning (whether orally or in writing). Of course, this will not guarantee that if the student has any knowledge or understanding in the area being assessed, then evidence of this attainment will be elicited. One way of asking a question might produce no answer from the student, while a slightly different approach may elicit evidence of achievement. We can never be absolutely sure that we have exhausted all the possibilities, so that we can never be sure that the student does not know something, but some assessments will be better than others in this respect. The extent to which an assessment can be relied upon to yield evidence of attainment where it exists has been called the disclosure of the assessment (Wiliam, 1992a).

Disclosure can be regarded as a technical issue—essentially an aspect of test-retest reliability—but this would ignore the crucial fact that all assessments take place in essentially social settings. For a variety of reasons; students may choose to fabricate or withhold evidence of attainment (MacNamara & Roper, 1992), and this has led some to conclude that true formative assessment can occur only when no external agency is involved:

The indispensable conditions for improvement are that the student comes to hold a concept of quality roughly similar to that held by the teacher, is able to monitor continuously the quality of what is being produced during the act of production itself, and has a repertoire of alternative moves or strategies from which to draw at any given point. (Sadler, 1989, p. 121, emphases in original)

Another example of difficulty with disclosure arises in the assessment of teaching practice for students engaged in courses of initial teacher education, where students often state to their tutors that everything is going smoothly even when it is not, because the potential gain (in terms of advice about possible solutions) is outweighed by the loss entailed in admitting that they are having difficulties.

Certainly locating the responsibility for elicitation, interpretation and action within the individual obviates some of the difficulties of disclosure (particularly that of the. conscious withholding of evidence), although we should not assume that incomplete disclosure ceases to be a problem when no external agency is involved. We all have considerable capacity for deluding ourselves about the actual level of performance, particularly in emotionally-charged settings.

In the absence of any direct probing (whether by the individual interrogating his or her own performance or by an external agency), evidence of achievement is also sponta-neously and continuously generated. This 'windfall' evidence, because it has not been generated as a result of particular probing, can often be more robust than evidence gained purposively: the fact that a student chose to use a particular skill, and did so successfully, is often evidence of deeper understanding than being able to apply that particular skill when told to do so. However, as the name 'windfall' implies, we may have to wait a very long time (or forever!) for the appropriate evidence to be generated. In this sense, robustness and disclosure are in tension: the more we make it clear what we want, the more likely we are to get it, but the less likely it is to mean anything.

As well as the means by which it is generated, evidence also differs in the form in which it is generated. Traditionally, only evidence that exists in some permanent form (as writing, artefacts, or on audio- or videotape) has been relied upon in formal assessment settings, with its concern to establish consistency across raters, while ephemeral evidence has been largely discounted. However, as far as formative assess-ment is concerned, inter-rater consistency is of secondary importance, and ephemeral evidence can be an entirely appropriate form of evidence.


542             D. Wiliam & P. Black

Unfortunately, the evanescent nature of ephemeral evidence means that it must be captured immediately or lost. Where students are working in small groups within a classroom, they may well demonstrate very high-quality speaking and listening skills, but this may not be observed by the teacher, because she was in another part of the room at the time. For this reason, many teachers encourage students to write down, or record in some permanent form, the otherwise ephemeral evidence of their attainment. It is tempting to regard this as a process of conversion, from one form of evidence to another, but many students have difficulties in, for example, expressing in writing what they have articulated quite fluently in oral form, while others are 'tongue-tied' in classroom discussions, and can only really express themselves in writing. It is, therefore, perhaps more appropriate to regard the two forms of assessment as existing in parallel, with each being an imperfect representation of the quality of thought that gave rise to it (these can be regarded as issues of fidelity; see Wiliam, 1992a).

Even when disclosure and fidelity are not problematic, the nature or timing of the data can limit the kinds of function that it can serve. For example, the information in the data might be too coarse-grained to be useful, or might just come at the wrong time. As Sadler (1989) points out, evidence that is elicited at the end of a course cannot serve a formative function for the students involved (although it could be formative in terms of the course for future students).


Interpretation

Of course evidence by itself is not information until it is interpreted, and the same evidence can be interpreted in different ways. In most classrooms, the interpreter is the teacher. She has a notion of what she would like the students to be able to do, and by examining the evidence, determines whether there is, in fact, a gap. For example, in a Key Stage 1 science lesson on 'floating and sinking', a student may correctly predict that a 'boat' made of metal foil floats, while predicting that a pellet of the same foil sinks. The teacher may conclude that the student has 'understood' floating and sinking. However, it could be that the student has a rather different set of conceptions: the student might believe that compressing the foil into a ball makes it heavier, which is why it sinks. The same pattern of responses is consistent with many different sets of student conceptions.

Wiliam (1992b) describes a situation where a pupil had generated some data about a mathematical relationship in which the number of free edges in an arrangement of octagonal tiles can be found by multiplying the number of tiles by three and adding eight.

The student had stated:

If you want to get the number of free edges, then you take the number of tiles, like 6, and times it by three, so you get 18 and then add eight so you get 26.

One teacher inferred from this that the student had derived a general rule, but was presenting the example by virtue of a 'generic example', and felt that the activity demonstrated attainment of a particular statement in the mathematics national curriculum (make generalisations). The other teacher concluded that the student had only presented a specific example, and thus one could not infer that the student had made a general statement, (p. 11)

Both teachers observed the same evidence of attainment, but they disagreed about its


Formative and Summative Functions of Assessment         543

interpretation. For the first teacher there was no gap between the actual level and the reference level (in this case, whether the student could, in fact, make generalisations), but for the second teacher, there was a gap.

Questioning as a Turing test. The mathematician Alan Turing once proposed a simple test for deciding whether a machine was intelligent (Turing, 1950). He proposed that a person (the 'judge') should sit in a room with two keyboard terminals, one connected to a computer, and the other connected to a human operator. The judge would then ask a series of questions of the computer and the human operator, trying to discover which was which. If the judge were unable to determine which was which, then the computer could be said to be intelligent. In many ways, classroom questioning has this character. The teacher elicits and examines evidence of attainment based on his or her model of what it is to 'understand' the ideas in question, trying to establish whether the student(s) share this model. Provided the students' answers are consistent with the teacher's model, they will be regarded as having 'understood' the topic. However, as von Glasersfeld (1987,

p.   13) has pointed out, all such a process establishes is that the teacher's schemas and the students' schemas both 'fit' the frame established by the questions, not that they 'match'. No amount of probing can establish conclusively that the schemas match, but the more demanding the 'Turing test', the more likely there is to be a match.


Action

Assessments yield evidence that can be interpreted in different ways for different purposes, but these interpretations are means to an end rather than ends in themselves. With very few exceptions, assessments are conducted for a purpose, and certain actions are contingent on the outcomes. Indeed, if nothing different can happen as the result of an assessment, there can be little point in conducting the assessment in the first place.

For placement decisions, the consequences can be acceptance for, or rejection from, employment or a course of study. Separate from, but related to these direct consequences are the social consequences of the decisions and the way they are made. However, such a function of an assessment is not formative according to the view presented here because Ramaprasad's definition of feedback requires that the information generated is actually used to close the gap between actual and desired levels of performance. If we discover that there is a gap, but have no idea about the nature of the discrepancy between actual and desired performance, then this (almost inevitably norm-referenced) infor-mation does not help us close the gap, and therefore fails to qualify as feedback. Such a process would be better described as simply monitoring.

To qualify as feedback, as well as alerting us to the existence of a gap, the information must actually be useful in closing the gap between actual and desired levels of performance. The information must therefore have embedded within it some degree of prescription about what must be done. The information must be related to a developmental model of growth in the domain being addressed—in short, it must be construct-referenced (Messick, 1975).

To sum up, in order to serve a formative function, an assessment must yield evidence that, with appropriate cwztfrKcf-referenced interpretations, indicates the existence of a gap between actual and desired levels of performance, and suggests actions that are in fact successful in closing the gap. Crucially, an assessment that is intended to be formative (i.e. has a formative purpose) but does not, ultimately, have the intended effect


544             D. Wiliam & P. Black

(i.e. lacks a formative function), would not, with this definition, be regarded as formative.



The Relationship between Formative and Summative Functions

The definition of the formative function of assessment adopted here places conditions on both the interpretations made of the evidence and on the consequent actions. Any assessment must elicit evidence of performance, which is capable of being interpreted (however invalidly). Whether or not these interpretations and actions satisfy the condi-tions for formative functions, the fact that interpretable evidence has been generated means that the assessment can serve a summative function. Therefore all assessments can be summative (i.e. have the potential to serve a summative function), but only some have the additional capability of serving formative functions. The question is not, therefore, can an assessment serve both functions, but the extent to which serving one has an adverse effect on its ability to serve the other.

As noted above, summative and formative functions are, for the purpose of this discussion, characterised as the ends of a continuum along which assessment can be located. At one extreme (the formative) the problems of creating shared meanings beyond the immediate setting are ignored: assessments are evaluated by the extent to which they provide a basis for successful action. At the other extreme (summative) shared meanings are much more important, and the considerable distortions and undesirable consequences that arise are often justified by appeal to the need to create consistency of interpretation. Presenting this argument somewhat starkly, when forma-tive functions are paramount, meanings are validated by their consequences, and when summative functions are paramount, consequences are validated by meanings.


Formative Assessment in the National Curriculum

In 1988, the British Government's National Curriculum Task Group on Assessment and Testing (TGAT) published its proposals for an assessment and reporting structure for the National Curriculum in England and Wales. The group took the view (NCTGAT, 1988) that a single assessment system could serve both summative and formative functions, provided the formative function was the foundation of the system:

It is possible to build up a comprehensive picture of the overall achievements of a pupil by aggregating, in a structured way, the separate results of a set of assessments designed to serve a formative purpose. However, if assessments were designed only for summative purposes, then formative information could not be obtained, since the summative assessments occur at the end of a phase of learning and make no attempt at throwing light on the educational history of the pupil. It is realistic to envisage, for the purpose of evaluation, ways of aggregating the information on individual pupils into accounts of the success of a school, or LEA [local education authority] in facilitating the learning of those for whom they are responsible; again the reverse is an impossibility, (para. 25)


This view has been criticised by many who have asserted that formative and summative functions cannot co-exist in any meaningful way. It is certainly true that the involvement of external agencies creates difficulties in terms of disclosure as noted above, but many


Formative and Summative Functions of Assessment    545

other difficulties can be alleviated by separating the elicitation of the evidence from the interpretation of the evidence.

For example, in the first version of the National Curriculum, the attainment targets for mathematics and science were presented in terms of statements of attainment (296 for mathematics and 407 for science), each of which was allocated to one of the 10 levels of the National Curriculum. Many teachers devised elaborate record sheets that would allow them to indicate, for each statement of attainment, whether it had been achieved by a student. Originally, such a record sheet served a formative function: it gave detailed construct-referenced information on a student's current attainment, and, just as impor-tantly, what had not yet been attained. While some teachers did question the notion of progression inherent in the allocation of the statements of attainment to levels, most seemed happy to accept that the student's next objectives were defined in terms of those statements just beyond the 'leading edge' of attained statements.

When a student produced evidence that indicated that she or he had partially achieved a statement (perhaps by demonstrating a skill in only a limited variety of contexts), then teachers would often not 'tick off the statement, so that they would be reminded to re-evaluate the student's performance in this area at some later date. Since there are typically many opportunities to 'revisit' a student's understanding of a particular area, this seems a good strategy, especially since a false-negative attribution (assuming that a student does not know something they do, in fact, know) is, in an educational setting, likely to be far less damaging than a false-positive (assuming that they do know something they do not).

However, many schools subsequently chose to derive the summative levels required in National Curriculum assessment by the inflexible application of a formula—more often than not the 'n-1' rule suggested by the School Examinations and Assessment Council (SEAC) [2]. This immediately created a tension between formative and summative functions of the assessment. Where teachers had left statements 'unticked' in order to prompt them to return to those aspects at a later date, students who had relatively complete understandings were often regarded as not having met the criterion. In order to prevent this happening, teachers then stopped using the record sheets in this formative way, and started using them to record when the student had achieved a sufficient proportion of the domain addressed by the statement. The record sheets became entirely summative records of statements covered (a process that has been described as 'scalphunting'!)

The tension between summative and formative functions arose in this situation because of the inflexible application of a mechanical rule for aggregation that had the effect of conflating the elicitation of evidence with its interpretation. The distorting effect of the summative assessment can be mitigated if, instead of using an algorithmic formula, aggregation is by a process of reassessment (Wiliam, 1995). In other words, instead of relying on the results that already incorporate interpretations in order to serve a formative function, the teacher could look at the original evidence. The same point can be made in the context of the 'three-essay' example discussed earlier. Using some predetermined algorithm applied to the three scores already allocated to the three essays is one method of deriving an aggregate score, which may serve some purposes, but others may be better served by going back to the original essays. Evidence that was interpreted one way to serve a formative function can be interpreted quite differently to serve a summative function.

As long as a distinction is maintained between the elicitation and the interpretation of evidence, formative functions need not be incompatible with National Curriculum


546             D. Wiliam & P. Black

assessment. However, having said this, it is worth noting that there is a large and growing body of research evidence that shows how difficult it is to introduce effective formative assessment into classroom practice (see, for example, Andrews, 1987, 1988; Torrance, 1991; Pole, 1993; Fairbrother, 1995).


Conclusion

There can be no doubt that significant tensions are created when the same assessments are required to serve both formative and summative functions. One response to this would be to say that the two functions require completely different approaches to elicitation, so that there is no prospect that different interpretations of the same evidence, or different actions based on the same interpretation can ever serve formative and summative functions adequately. This would have one of two effects. One possible outcome is that teachers' activities would be restricted to formative functions, with all summative assessment being undertaken by external agencies—an option teachers do not want (Brown et al., 1995). The other possible outcome is that teachers are required to administer two parallel but completely separate assessment systems. Quite apart from the workload, which is likely to be inordinate, it seems likely that there would be serious backwash into teaching and learning.

The other response, and the one we have explored here, is to regard the two functions as the extremes of a continuum. Clearly, the problems identified above with regard to elicitation of evidence mean that not all evidence generated to serve a formative function can be used to serve a summative one. It would be very difficult to argue that responses to an 'off-the-cuff question to a class in the middle of an episode of teaching would have any significance beyond the immediate context of the classroom. Conversely, evidence elicited at the end of a sequence of teaching can have very little formative influence on the students assessed. However, between these clear cases, it seems to us that there may be some common ground between the formative and summative functions. Finding this common ground will be difficult, since the issues are subtle and complex, and we have made only a small contribution here.

We are more confident that we have established that there are clear benefits to be gained in separating the interpretation of evidence from its elicitation, and the conse-quent actions from the interpretations, although we are still aware that more needs to be done in clarifying these issues. We hope that others will join in this debate, for we believe the potential advantages are significant, while the alternatives—teachers com-pletely disconnected from all summative assessment or required to administer two separate assessment systems—would be disastrous.

Correspondence: Dylan Wiliam, King's College London, School of Education, Corn-wall House, Waterloo Road, London SE1 8WA, UK. Email:dylan.wiliam@kcl.ac.uk


NOTES

This paper was presented as part of a Symposium at the 1995 British Educational Research Association conference entitled 'Formative and summative assessment: resolving the tension'.

[1]  In much of the technical literature a distinction is drawn between selection decisions, which are made when deciding whether a candidate has the necessary potential to benefit from a college degree course on the one hand, and placement decisions, which are taken subsequently to decide which course would be most suitable.


Formative and Summative Functions of Assessment    547

[2]     This rule suggested that where there were one or two statements of attainment at a particular level in an attainment target, all had to be attained in order for the student to be awarded that level, and where there were three or more statements, all but one hadto be attained. It was widely assumed by schools that the use of this rule was a legal requirement, although this was not, in fact, the case (Wiliam, 1992c).


REFERENCES

AIRASIAN, P.W. & MADAUS, G.F. (1972) Functional types of student evaluation, Measurement and Evaluation in Guidance, 4, pp. 221-233.

ANDREWS, S. (1987) The Achievements of Robert Arthur Essex: sixty employers give their perceptions of the Essex Records of Achievement Summary Portfolio (London, Industrial Society).

ANDREWS, S. (1988) Records of Achievement: R. A. Essex leaves school (sixty school/college leavers speak about involvement in Records of Achievement (London, Industrial Society).

BLOOM, B.S., HASTINGS, J.T. & MADAUS, G.F. (Eds) (1971) Handbook on the Formative and Summative Evaluation of Student Learning (New York, McGraw-Hill).

BROWN, M.L., MCCALLUM, E., TAGGART, B., BRANSON, J. & GIPPS, C.V. (1995) Validity and impact of

national tests in the primary school: the teacher's view, paper presented at the twenty first Annual Conference of the British Educational Research Association held at University of Bath, September (London, King's College London School of Education).

CRONBACH, L.J. (1988) Five perspectives on validity argument, in:H. WAINER & H. I. BRAUN (Eds) Test Validity, pp. 3-17 (Hillsdale, NJ,Lawrence Erlbaum Associates).

DAHLLÖF, U. (1971) Ability Grouping, Content Validity and Curriculum Process Analysis (New York, NY, Teachers College Press).

FAIRBROTHER, R.W. (1995) Pupils as learners, in: R. W. FAIRBROTHER, P. J. BLACK & P. N. G. GILL (Eds)

Teachers Assessing Pupils: lessons from science classrooms, pp. 105-120 (Hatfield, Association for Science Education).

LINN, R.L. (Ed.) (1989) Educational Measurement, 3rd edn (Washington DC,American Council on Education/Macmillan).

MACNAMARA, A. & ROPER, R. (1992) Attainment target 1 --is all the evidence there? Mathematics Teaching, 140, pp.26-27.

MADAUS, G.F. (1988) The influence of testing on the curriculum, in: L.N. TANNER (Ed.) Critical Issues in Curriculum: the 87th yearbook of the National Society for the Study of Education (part 1), pp. 83-121 (Chicago, IL, University of Chicago Press).

MESSICK, S. (1975) The standard problem: meaning and values in measurement and evaluation, American Psychologist, 30, pp. 955-966.

MESSICK, S. (1980) Test validity and the ethics of assessment, American Psychologist, 35, pp. 1012— 1027.

MESSICK, S. (1989) Validity, in: R. L. LINN (Ed.) Educational Measurement, pp. 13-103 (Washington DC, American Council on Education/Macmillan).

NATIONAL CURRICULUM TASK GROUP ONASSESSMENT AND TESTING (1988) A Report (London, Department

of Education and Science).

NITKO, A.J. (1989) Designing tests that are integrated with instruction, in: R. L.LINN (Ed.) Educational Measurement, pp.447-474 (Washington DC, American Council on Education/Macmillan).

POLE, C.J. (1993) Assessing and Recording Achievement: implementing a new approach in School (Buckingham, Open University Press).

RAMAPRASAD, A. (1983) On the definition of feedback, Behavioural  Science, 28, pp. 4-13.

SADLER, D.R. (1989) Formative assessment and the design of instructional systems, Instructional Science, 18, pp. 145-165.

SCRIVEN, M. (1967) The Methodology of Evaluation (Washington DC, American Educational Research Association).

TORRANCE, H. (1991). Records of achievement and formative assessment: some complexities of practice, in: R. E. STAKE (Ed.) Advances in Program Evaluation: using assessment to reform education (Greenwich, CT, JAI Press).

TURING, A.M. (1950) Computing Machinery andIntelligence, Mind, 59(236), pp. 433-460.

VON GLASERSFELD, E. (1987) Learning as a constructive activity, in: C. JANVIER (Ed.) Problems of Representation in the Teaching and Learning of Mathematics (Hillsdale, NJ, Lawrence Erlbaum).

WILIAM, D. (1992a) Some technical issues in assessment: a user's guide, British Journal for Curriculum and Assessment, 2(3), pp. 11-20.


548              D. Wiliam & P. Black

WILIAM, D. (1992b) Inset for national curriculum assessment: lessons from the key stage 3 SATs trials and pilots, British Journal for Curriculum and Assessment, 2(2), pp. 8-11.

WILIAM, D. (1992C) National curriculum assessment arrangements -- the legal minimum, Educationand the Law, 4, pp. 135-144.


WILIAM, D. (1995) Combination, aggregation and reconciliation: evidential and consequential bases, Assessment in Education:principles policy and practice, 2, pp. 53-73.

Komentar

Postingan populer dari blog ini

POSTULAT AKUNTANSI

MAKALAH PSIKOLOGI PENDIIDKAN INTELEGENSI, BAKAT DAN MINAT

AKUNTANSI KEUANGAN LANJUTAN 1 “LIKUIDASI PERSEKUTUAN USAHA”