Language Testing 1. Achievement test
It refers to the mastery of what has been learnt, what has been taught or what is in the syllabus, textbook, materials, etc. It therefore is an instrument designed to measure what a person has learned within or up to a given time. It is based on a clear and public indication of the instruction that has been given. The content of achievement tests is a sample of what has been in the syllabus during the time under scrutiny and as such they have been called parasitic on the syllabus. Because achievement tests are typically used at the end of a period of learning, a school year or a whole school or college career, their results are often used for decision making purposes, notably selection.
2. Aptitude test
An instrument to measure the extent to which an individual possesses specific language learning ability. Such tests are usually used for selection and diagnosis and for prediction of language learning success. Research is somewhat unclear on the existence of a general aptitude variable and the tests that exist normally claim to predict success only in terms of defined learning outcomes or distinct methodologies.
3. Central tendency
A term used to summarise the central point in the distribution of values in a data set. There are three common measures of central tendency, the mode, the median and the mean. The choice of which measure to use will depend on the type of data (and hence the meaningfulness of any one of these measures) and the purpose of the analysis. Different statistical
procedures may require the use of one or other measure of central tendency.
4. Communicative language tests Tests
of
communicative
skills,
typically
used
in
contradistinction to tests of grammatical knowledge. Such tests of ten claim to operationalise theories of communicative competence, although the form they take will depend on which dimensions they choose to emphasise, be it specificity of context, authenticity of materials or the simulation of real-life performance.
5. Diagnostic test
Used to identify test takers’ strengths and weaknesses, by testing what they know or do not know in a language, or what skills they have or do not have. Information obtained from such tests is useful at the beginning of a language course, for example, for placement purposes, for selection, for planning of courses of instruction or for identifying areas where remedial instruction is necessary.
6. Difficulty
The extent to which a test or test item is within the ability range of a particular candidate or group of candidates. Most tests are designed in such a way that the majority of items are not too difficult or too easy for the relevant sample of test candidates. Preliminary trialling is often undertaken to ensure this is the case. The notion of test difficulty is always relative to the underlying ability that is being measured and is therefore a key consideration in establishing a test’s validity or discriminability.
7. Discrimination
A fundamental property of language tests in their attempt to capture the range of individual abilities. On that basis the more
widely discriminating the test the better it is. In classical test theory item discrimination is an important indicator of a test’s reliability.
8. Distractor
Any response in a forced-choice item which is not the key, ie it is not the correct choice but is offered as a means of ascertaining whether candidates are able enough to distinguish the right answer from a range of alternatives. While distractors should not make greater demands on candidates’ ability than the key, they should be sufficiently plausible to be selected as the correct option by a good number of candidates. Distractors which are chosen by very few candidates contribute little to the item because they effectively reduce the number of alternatives, thereby increasing the probability that candidates will arrive at the correct answer by guessing alone.
9. Criterion-referenced test
A test that examines the level of knowledge of, or performance on, a specific domain of target behaviors (ie the criterion) which the candidate is required to have mastered. The test domain is typically, but not necessarily, a specific course of instruction. In this case criterion-referenced tests are useful for teachers both in clarifying teaching objectives and in determining the degree to which they have been met. Criterion-referenced tests are also often used for professional accreditation purposes. Test scores report a candidate’s ability in relation to the criterion. Strictly speaking, it is only concerned with whether candidates have reached a given point rather than with how far above or below the criterion they may be.
10. Median
A descriptive statistic, measuring central tendency: the
middle score or value in a set If the distribution contains an even number of scores, the median is the average of the middle two. Half of the scores in the set are higher than the median, and half are lower. Although the median is more subject to chance variations than the mean, it may be appropriate to use the median in preference to the mean when the set of scores includes a small number of outliers which would distort the mean.
11. Mode
A descriptive statistic, measuring central tendency: the most frequently occurring score or score interval in a distribution. The mode is the easiest measure of central tendency to locate, but it is also the least stable and the least used, since a chance variation of a single mark might make a considerable difference to the mode. Unlike the mean and the median, it must of necessity be an actually occurring score. The mode is useful for requirements such as reporting the most common score amongst members of a particular group.
12. Normal distribution
Also normal distribution curve, Gaussian curve, normal probability curve.
A theoretical concept central to most statistical thinking, based on the common observation that events or physical characteristics show a similar symmetrical pattern or distribution. This distribution is be-shaped, and in language tests, the distribution of scores in a population is similarly normally distributed with most test takers scoring around the average and progressively fewer scoring towards the extremes. The mid-point of the normal curve is both median and mode as well as the mean. The distribution on either side of the mean is indicated by the standard deviation. The assumption that the data are normally
distributed underlies the application of parametric statistical tests.
13. Observed score
A test candidate’s actual test score. This observed score is assumed to imperfectly represent the true score due to measurement error. Classical test theory is based on the recognition of the fact that abilities, being abstract, can never be measured directly. It assumes that the observed score consists of two components, the true score and the error score (variation which is not due to ability and which is unsystematic). Thus the variance of a set of test scores consists of the observed score variance plus the measurement error.
14. Reliability
The actual level of agreement between the results of one test with itself or with another test. Such agreement, ideally, would be the same if there were no measurement error, which may arise from bias of item selection (parallel forms, split-half, rational equivalence), from bias due to time of testing(test-retest) or from examiner bias (inter-rater reliability checks). It is common to say that reliability is a necessary but not a sufficient quality of a test. While reliability focuses of the empirical aspects of the measurement process, validity focuses on the theoretical aspects and seeks to interweave these concepts with the empirical ones. For this reason it is easier to assess reliability than validity.
15. Classical testing theory
Also known as classical true score measurement theory, according to which an observed score (on a test) is made up of a true score and an error score. The standard error of measurement of a test is an index of the extent to which the observed score is influenced by the error score. Since the purpose of a test is to
achieve reliable observed scores, ie as close as possible to true scores, much of the effort put into test construction concerns ways of promoting and estimating test reliability. Although classical theory is still much in vogue, its inability to handle different types of error and its total reliance on the sample under test have been criticized.
16. Validity
The quality which most affects the value of a test, prior to, though dependent on, reliability. A measure is valid if it does what it is intended to do, which is typically to act as an indicator of an abstract concept (for example height, weight, time, etc.) which it claims to measure. The validity of a language test therefore is established by the extent to which it succeeds in providing an accurate concrete representation of an abstract concept (for example proficiency, achievement, aptitude).The most commonly referred to types of validity are: Face validity, Content validity, Construct validity, Empirical validity
17. Standard deviation
The standard deviation is a property of the normal curve. Mathematically, it is the square root of the variance of a test. Along with the mean, it is one of the more widely used statistics in language testing. The standard deviation provides an informative summary of the variation or distribution of a set of scores around the mean. In principle, for norm-referenced test, the larger the standard deviation the better, on the grounds that what such a test aims to describe is the range of abilities in the group under test. It follows that in norm-referenced tests a small standard deviation indicates that the true variation of the ability in the population has not been captured by the test. And since the standard deviation is an indicator of reliability, the larger the
standard deviation the more reliable the test.
18. Standardized score
A transformation of raw scores which provides a measure of relative standing in a group and allows comparison of raw scores from different distributions. It does this by converting a raw score into a standard frame of reference which is expressed in terms of its relative position in the distribution of scores. The Z score is the most commonly used standardized score.
19. T score
A transformation of a Z score equivalent to it but with the advantage of avoiding negative values, and hence often used for reporting purposes.
20. Norm-referenced test
A type of test whereby a candidate’s scores are interpreted with reference to the performance of the other candidates. Thus the quality of each performance is judged not in its own right, or with reference to some external criterion, but according to the standard of the group as a whole. In other words, norm-referenced tests are more concerned with spreading individuals along an ability continuum, the normal curve, than with the nature of the task to be attained, which is the focus of criterion-referenced tests.
因篇幅问题不能全部显示,请点此查看更多更全内容