“ONLY WHEN THEY PROVEN TO BE RELIABLE AND VALID, DO MEASUREMENTS CONTAIN INFORMATION, WHILE OTHERWISE THEY ARE INTENDED TO PROVIDE ONLY NUMBERS OR CATEGORIES WHICH INDICATE A FALSE IMPRESSION OF CREDIBILITY”. ROTHSTEIN (1985)
The development of measures of effectiveness has become a relevant issue in rehabilitation and has contributed to a better understanding of how outcomes are linked to specific elements of treatment. Functional assessment and outcome analysis require a measurement process that involves the assignment of numerical values or categories to "latent" variables (since they manifest themselves through various behaviors at different times: functional independence, mood, communication, participation social, etc.) (Franchignoni 2008).
It should also be remembered that the relevance of the outcomes can be judged from different perspectives (for example of the disabled person or of society, of the researcher or of the public administrator) among which that of the patient must always be central, intended to provide different scales of values in relation to a certain outcome. In practice, what we sometimes wish to measure and analyze is a complex phenomenon, the causal factors of which cannot be captured by a single evaluation tool. Often it would instead be more appropriate to analyze different outcomes, each with the most appropriate measurement tool.
The most common methods of collecting these variables involve the use of rating scales or questionnaires. In rating scales, an examiner observes and assigns scores to a specific parameter based on his or her own judgment, with minimal patient involvement. The questionnaires instead directly collect the point of view of the patient, who reports the experience of subjective phenomena (pain, fatigue, etc.) or his own evaluations, also in relation to personal perspectives/expectations (for example regarding satisfaction indices) .
What are they for?
The outcome evaluation represents the starting point to allow the clinician to:
identify and characterize signs and symptoms, structural and functional limitations resulting from a clinical picture;
plan the therapeutic program, establishing realistic rehabilitation objectives;
monitor changes over time, also in order to verify the validity of the treatments used and formulate reliable prognoses;
increase the number and quality level of interventions carried out, with the same resources used.
How are they structured?
Recognition of the level of measurement being carried out is absolutely necessary to understand the level of information that can be obtained from the measurement in progress. Basically, the data produced by measurement processes can be grouped into four categories or levels: nominal, ordinal, interval and ratio (Portney, 2000; Domholdt, 2005). The first two categories can be defined as "discrete" (where the quantity between two intervals is not constant), the other two as "continuous" (i.e. with values capable of varying with known gradualness).
“Discrete” categories
• Nominal level: describes relationships of equality and diversity, such as for race, sex, nationality or clinical diagnosis, providing “identification labels” without defining hierarchical orders of importance or priority (this is not a true and own measurement). The only mathematical operation that can be performed is the numerical count of the members of each category.
• Ordinal level: in which the variable is ordered according to a progressive rank and classified with a major-minor criterion. The data can therefore be inserted into adjacent categories, without the intervals between them being known (for example: none, minimal, moderate, severe; good, fair, poor and so on). The ordinal level therefore does not allow a quantification of the variable in question (it is not possible to state that the item with score 2 corresponds to double the item with score 1), but only a definition of a relative position with respect to a distribution.
“Continuous” categories
• Interval level: an interval scale has the characteristics of an ordinal scale and furthermore demonstrates that it has known and equal distances between the units of measurement. Examples of interval measurements are jointness measured in angular degrees, temperature, time. An interval scale allows you to carry out arithmetic operations (additions/subtractions, calculating the average) but not ratios since zero, corresponding to the absence of the quantity in question, is not absolute, but rather an arbitrarily chosen reference value.
• Ratio level: in addition to the characteristics of interval scales, in this case there is a non-arbitrary zero which represents the total absence of the quantity examined. Starting from this type of scale it is therefore possible to carry out all arithmetic operations and statistical analyses, obtaining the maximum level of information from the data. Examples are length, strength, speed.
The use of ordinal scales or questionnaires (the most common) does not actually lead to real measurements and limits the logical inferences that can be made, in particular those relating to the progress obtained following treatment.
To overcome this limitation, very rigorous statistical procedures can be used, such as Rasch analysis (Wright, 1982; Tesio, 2003; Bond, 2001). It is a statistical model based on the "item-response theory": in simple terms, if the ordinal scale measures only one latent variable, then the more skilled subjects are more likely to obtain better scores in the more difficult items than the others. The model transforms each individual's raw ordinal scores into true interval measures expressed in logits (natural logarithm), which can be presented together with the estimate of the standard error of measurement (Tesio, 2003).
Rasch analysis allows a detailed validation of the measurement instrument, identifying items written in a poor or improvable manner, or not homogeneous with the construct of interest. Furthermore, people with anomalous scores may be highlighted, due to specific sectoral inabilities, incorrect answers, compilation or transcription errors, and so on (Bond, 2001).
What are the selection criteria?
The choice between competing outcome measures is based on the psychometric and practical properties that each has been demonstrated to possess. These are psychometric requirements (such as reliability, validity and responsiveness) and technical and practical attributes (appropriateness, precision, interpretability, acceptability and feasibility).
Psychometric requirements
The presence of adequate levels of reliability and validity is sufficient for discriminative purposes (differences between subjects or groups) and predictive (classification of subjects into predefined classes for prognostic purposes), while for evaluative purposes (i.e. to detect changes within subjects over time, such as in the case of analysis of the effectiveness of therapeutic interventions) a good level of responsiveness is also necessary.
Reliability is the degree to which a measurement is free from error and therefore the observed score is close to the "true" one.
It refers to the ability of the measurement system to provide constant results, even if carried out at different times and by different operators, obviously provided that the quantity under examination has not undergone variations. The reliability evaluation includes two aspects: internal consistency (or homogeneity) and reproducibility (or stability).
Internal consistency: represents the degree to which the items of a scale measure the same characteristic. This property is estimated in various ways: the main ones are the Cronbach's alpha coefficient and the item-total correlation.
Reproducibility: evaluates the degree to which an instrument provides the same results in repeated administrations, provided that no real changes have occurred in the variable under examination. Different types can be distinguished:
1. test-retest, which evaluates the stability of a measurement obtained without the involvement of external evaluators, for example in a self-administered questionnaire;
2. intra-operator and inter-operator, which evaluate the stability of the data recorded respectively by a single observer at different times or by two or more observers who evaluate the same variable separately;
3. alternate forms reliability, i.e. that between different forms of administration of an instrument (for example direct interview, pen and paper test, telephone questionnaire, etc.).
Validity is the degree to which the test actually measures what it is intended to measure.
Three main types of validity are recognised:
1. of content: represents the degree to which the content of the items covers all the domains and actually significant aspects of the area that the instrument intends to measure. It is usually judged using the opinion of experts;
2. criterion or concurrent: refers to the ability with which an instrument predicts the results obtained from another that measures the same concept, administered at the same time (concurrent validity) or at a later time (predictive validity). It is evaluated through the calculation of a correlation coefficient or by sensitivity and specificity analysis;
3. construct: evaluates how much a measurement instrument fits into a previously defined theoretical construct that is not directly observable (such as strength, functional independence, pain, quality of life, etc.). The validation process is never properly concluded, as it is possible to search, in an ever more extensive and precise way, for phenomena of convergence or divergence of the instrument under examination with other variables considered representative of respectively similar or different constructs.
Responsiveness is defined as the ability of an instrument to identify clinically significant changes in the measured variable.
This attribute represents a fundamental property for an outcome measure, both in the clinical and research fields. The magnitude of the change in a score that should be considered clinically important (minimal clinically important difference) should be specified a priori in each study and be known by the clinician. It should be remembered, however, that it can vary depending on various factors (for example the type, severity or duration of the pathology) and therefore does not represent an absolute value attributable to the instrument in itself. Furthermore, the methodologies used to define responsiveness still lack unambiguous consensus and the resulting results must be interpreted with caution due to the possible presence of multiple sources of error.
The concept of responsiveness differs from that (sometimes erroneously used as a synonym) of sensitivity to change which represents the ability of an instrument to measure changes in a state (via indices such as effect size, standardized response mean, etc. ) regardless of whether they are clinically significant or not (Franchignoni, 2006).
Technical and practical attributes
Appropriateness: represents the degree to which the instrument responds to clinical or scientific needs.
The qualities of reliability and validity possessed by each instrument are not valid in an absolute sense, but in relation to its application within a specific context. Therefore, it is always advisable to analyze the items of the scale or questionnaire to judge whether its construct allows you to measure exactly what you intend to measure, or to rely on a review of the literature to analyze to what extent the psychometric properties of the instrument have already been verified. under the specific conditions of interest.
Precision: refers to the accuracy with which the instrument is able to capture real differences.
In "discrete" type measurements this property is linked to the type and number of response categories (dichotomous responses, Likert-type scales, visual analogue scales, etc.), but also to the relationship between the range of difficulties analyzed by the various items and the true distribution of what is measured (detectable through analysis of scalability, hierarchical order and distribution of items based on their relative difficulty, provided by Guttman or Rasch analysis). Furthermore, the "ceiling" and "floor" effects - characterized by >20% of subjects in the analyzed sample with respectively maximum or minimum scores - reflect a limited precision of the instrument in discriminating between subjects and evaluating their variations over time.
Interpretability is the possibility of being understood with relative ease by all those who come into contact with it professionally, and of being reusable by any operator with adequate characteristics and training.
Instruments with interpretability problems or in any case too complicated tend to produce greater variability in the responses, with a consequent reduction in the reliability values of the measurement.
Acceptability by patients is judged through direct and indirect parameters, such as the interview of the interested party, the percentage of responses to questions or the completion time.
Feasibility refers to the simplicity in overall data management, i.e. it analyzes the technical-administrative load and, more generally, the global costs necessary to administer the tool. Among the numerous variables to consider there are also the times necessary to train the staff in correct analysis of the parameters, to explain to the patient the rules for compiling or to collect and process the data (think of questionnaires distributed by post or complex procedures calculation of the final scores).
The physiotherapist who is about to use a measurement instrument must orient his choice not only on the basis of the presence of the psychometric characteristics necessary for the specific objective and context (preferring instruments whose application has already been tested in conditions similar to those of interest), but also paying attention to aspects of a practical-applicative nature.
The language problem: process of cultural adaptation of an instrument.
In Italy as in numerous other nations the application of most measuring instruments
more established requires - having been developed in English-speaking countries - an accurate transcultural adaptation that guarantees maximum semantic, idiomatic, conceptual and practical equivalence between the original and the new version (Franchignoni, 2003). This process is complex and involves the following phases:
• production of some independent translations of the scale, possibly each performed by a small group of individuals who translate into their mother tongue, only partially aware of the objectives and concepts connected with the scale in question;
reverse translation of the scale into the source language (back-translation), made by other operators who translate into their mother tongue, not aware of the objectives and concepts connected with the material to be translated;
• definition of the final version by a multidisciplinary committee of experts, who compares previous translations (highlighting errors and inconsistencies), applies structured techniques to resolve discrepancies and doubts, evaluates the opportunity to modify or eliminate irrelevant, inadequate or ambiguous entries and to generate the substitutions that best adapt to the cultural situation of reference, using simple and easily understandable language;
• an accurate pilot study on a representative sample of subjects, as field verification (through targeted interviews) of any residual linguistic, stylistic and cultural problems in the population subject to clinical use of the instrument (Hagell, 2003).
The instruments translated and validated in Italian, which are constantly increasing, are collected and classified on this site (link). For some of them, chosen on the basis of strict criteria of clinimetric robustness and diffusion, an in-depth sheet will soon be available which will illustrate their characteristics, areas of application and methods of interpretation.
To conclude
Many advances in this area are relatively recent and there is still a need to acquire knowledge about the complex relationships between therapeutic interventions, clinical and contextual variables on the one hand, and patient outcomes on the other. Therefore, despite the continuous and significant progress, the conscious application of outcome measures to health economics procedures and to improve the quality of individual clinical decisions requires further scientific experience (Wade, 2003).
Finally, it should be remembered that there are still numerous limitations inherent in current outcome analysis methods, which require extreme caution especially in the interpretation of "raw" scores taken from ordinal rating scales. In order to limit technical problems of a psychometric nature, it is desirable that statistical models (such as Rasch analysis) are increasingly used, which derive the real metric characteristics of a scale, in particular the unidimensionality, the linearity of the scores and the degree of “difficulty” of the items (Wright, 1982; Bond, 2001; Tesio, 2003).
Essential bibliography:
- Bond TG, Fox CM. Applying the Rasch model: fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates, 2001.
- Domholdt E. Rehabilitation research - Principles and applications. St. Louis: Elsevier Saunders, 2005.
- Franchignoni F, Bazzini G. La valutazione dell’outcome in Medicina Fisica e Riabilitativa. In: Trattato di Medicina Fisica e Riabilitazione-Vol.1 (a cura di N. Valobra) - UTET Editore, Torino 2008, cap 27.
- Franchignoni F, Michail X. Selecting an outcome measure in Rehabilitation Medicine. Eura Medicophys 2003;39:67-8.
- Franchignoni F, Ring H. Measuring change in rehabilitation medicine. Eura Medicophys 2006;42:1-3.
- Hagell P, McKenna SP. International use of health status questionnaires in Parkinson’s disease: translation is not enough. Parkinsonism Relat Disord 2003;10:89-92.
- Portney LG, Watkins MP. Foundations of clinical research. Applications to practice. 2nd ed. New Jersey: Prentice Hall Health, 2000.
- Tesio L. Measuring behaviours and perceptions: Rasch analysis as a tool for rehabilitation. J Rehabil Med 2003;35:105-15.
- Wade DT. Outcome measures for clinical rehabilitation trials: impairment, function, quality of life, or value? Am J Phys Med Rehabil. 2003;82(10 Suppl):S26-31.
- Wright BD, Masters GN. Rating scale analysis. Chicago, IL: MESA Press, 1982.