062022.01

How can validity be measured

To establish content validity, you consult experts in the field and look for a consensus of judgment. Measuring content validity therefore entails a certain amount of subjectivity albeit with consensus.

Criterion-Related Validity The next part of the tripartite model is criterion-related validity, which does have a measurable component. Concurrent Validity measures correlations with our criteria that happen concurrently. Concurrent validity is often used in education, where a new test of, say, mathematical ability is correlated with other math scores held by the school.

Predictive Validity measures correlations with other criteria separated by a determined period. Construct Validity Constructs, like usability and satisfaction, are intangible and abstract concepts. Convergent Validity indicates how well a measure correlates with other measures that ostensibly measure the same thing. To measure convergent validity, have participants in a study answer your questions along with a previously validated instrument.

Discriminant Validity establishes that one measure is not related to another measure. We often create a new measure of, say, customer excitement. Ideally you are able to show both discriminant and convergent validity with your measures to establish construct validity. Criterion-related validity : Correlate the measure with some external gold-standard criterion that your measure should predict, such as conversion rates, sales, recommendation rates, or actual usage by customers.

Construct validity : Correlate the measure with other known measures. Correlate a new measure of usability with the SUS.

Correlate a new measure of loyalty with the Net Promoter Score. High correlations indicate convergent validity. If your measure is supposed to measure something different—delight versus satisfaction—then look for low or no correlation to establish discriminant validity.

In any case, a minimum of two experts from different fields for example, one content expert and one psychometrician should make the decisions together. The results of the analysis and discussion of the experts' assignments can take various forms.

Usually, some items are clearly assigned to a specific dimension, others turn out to be so equivocal that they are eliminated. In some cases, however, the conceptualization of the dimensions needs to be reconsidered. For example, as mentioned above, if a number of items are assigned to two dimensions with about equal weight, this may mean that the two dimensions need to be collapsed or that an additional dimension is required that is conceptually located between the two.

If the comments of experts provide new insights for possible dimension definitions or labels, these comments can also be included in the formulation of new definitions. In the present study, it was not possible to discuss the results with all experts.

Thus, the third author, an expert on the topic of wisdom and psychometrics, and the first author, a psychometrician not familiar with the concept of wisdom discussed the results, performed the final assignment of the items, and formulated new names and definitions for the resulting dimensions where they differed from the original ones. The results based on the assignments and the final discussion of the two experts are given in Tables 3A — C. Only important results are presented here.

The last columns present the summarized comments without any categorization because the number of comments was generally low. In the following, we describe the content of the scales that emerged from the final assignments and propose psychometric hypotheses for each subdimension.

The four items in this scale all describe aspects of knowing and accepting oneself, including possibly diverging aspects and positive and negative sides see Table 3A. Thus, the overarching theme of this subscale is knowing, accepting, and integrating the aspects of oneself and one's life. The two subdimensions were merged into one scale based on the rationale that self-knowledge can be considered as a precondition for integration. All items of this scale are about valuing and maintaining one's tranquility even in the face of reasons to get angry or upset see Table 3A.

This scale comprises items concerning the individual's independence of external things, namely, other people's opinions, a busy social life, or material possessions, and of other people and things in general see Table 3B. Thus, it clearly corresponds to Curnow's concept of non-attachment. All items in this scale were predominantly assigned to the non-attachment component, but also, with percentages ranging from 29 to 40, to the self-transcendence component.

This suggests that the experts considered the individual's independence of external sources of reinforcement as a part or precondition of self-transcendence. One reason may be that our definition of self-transcendence included the statement that self-transcendent individuals are detached from external definitions of self, which was based on the idea that self-transcendence is the last stage of a development through the other stages.

As mentioned above, for the independent measurement of the four dimensions, it would seem important to avoid such conceptual overlaps. In any case, the common characteristic of the four items is their reference to non-attachment. They are aware that things are always changing, oriented toward learning from others, and aware that they have grown through losses.

A goal of the analyses was to test whether the hypotheses gained from the expert judgments could be used to improve the psychometric functioning of the ASTI. Specifically, we wanted to test whether the ASTI as a whole formed an unidimensional scale, and if not, whether the five subdimensions derived from the expert assignments of the items would form unidimensional scales.

Also, we wanted to test whether single items within each scale diverged from the others. For the theory-based item analysis, we summarized the comments from Tables 3A — C into psychometric categories. These categories are not only useful for the interpretation of non-conforming items, but also for the construction of new additional items.

We identified three main categories of expert comments:. For one item I14 , the experts suspected differences between men and women. Sometimes researchers have theoretical assumptions about relationships between the various dimensions. Item response models can be used to test such hypotheses, e. In the current example, we only explored the latent correlations between the dimensions. In the current study, we used item response models to test the psychometric functioning of the ASTI based on the results of the expert assignments.

Data were collected individually from participants in Austria and Germany by trained students as part of their class work. Participants filled out a set of paper-and-pencil scales and answered demographic questions. Overall, participation took about 25 min on average.

The questionnaire included the ASTI and additional scales outside the scope of this paper. To test the unidimensionality of the new subscales, we used an approach from the family of Rasch models.

Rasch models Rasch, ; for an overview see Fischer and Molenaar, and their extensions for graded response categories are very useful for testing specific hypotheses about the dimensionality of items within a scale. First, they test an assumption that is usually taken for granted when a score is computed by summing up the items of a scale: the sum score is a valid indicator of a construct only if all items measure the same latent dimension Rasch, If, for example, some items in our scale measure non-attachment while others measure self-knowledge, and these two constructs can occur independently of each other, then summing up across all items is not informative about a person's actual construct levels.

One would need to know their separate scores for the two subdimensions. Only if all items measure the same construct, the raw score is a good indicator of a person's level of that construct. The main indicators that Rasch-family models use are item parameters and person parameters, which are placed on the same latent dimension. Persons' positions on the latent dimension, their so-called person parameters, are determined by their raw scores. The higher a person's score, the more likely is the person to agree to the items of the test.

Items are represented by monotonically increasing asymptotic probability curves on the same latent dimension: the probability that a person agrees to an item is dependent on the relation between the position of the item and the position of the person on the latent dimension. Each item's position is described by its item parameter, i.

For items with graded responses, as in the current case, parameters describe the thresholds between adjacent response categories. Here, we used the Partial Credit Model PCM; Masters, ; Masters and Wright, , which assumes unidimensionality of the items but does not assume that the distances between categories are equal across items.

The item parameters were estimated using marginal maximum likelihood estimation MLEs and the person parameters using weighted maximum likelihood estimation WLEs. The item analysis procedure follows Pohl and Carstensen , who outlined an approach for item analysis for use in large-scale assessment settings. We believe that this approach is also useful for smaller-scale studies. As explained earlier, a main goal of the study was to integrate the proposed dimensions and the experts' hypotheses concerning item fit with a psychometric investigation of the items.

Accordingly, in the psychometric analysis these predictors will be used to test for significant item misfit. Before starting with the actual analyses, the category frequencies for each item were assessed because low frequencies can cause estimation problems.

If the frequency of a response category was below , it was collapsed with the next category see Pohl and Carstensen, In 11 items, the two lowest categories, and in two other items, the two highest categories were merged.

In the remaining 12 items, all category frequencies were above In the development of new measures, it is often a goal to have few items with very low frequencies in some response categories. With constructs like wisdom, however, which are very positively valued, few participants disagree with positively worded items, and the variance that does exist is mostly located between the middle and the highest category. If such items represent theoretically important aspects of the construct, they may well be kept as part of the scale.

In the current case, low frequencies in the lowest categories were particularly typical for the SI and PG subdimensions four items each , and removing these items would have depleted both scales of important content. In the following, we describe the analyses that were performed. Person-item-maps display the distribution of the person parameters and the range of item parameters. These plots show whether any participants showed extreme response tendencies, which might lead to particularly high or low raw scores, and how the item parameters are distributed over the latent dimension.

Thus, it can be examined whether the items cover the whole spectrum of the latent dimension or cluster in one part of it. If there are few items in a segment of the spectrum, the latent trait cannot be measured well in that segment.

Up to now, the ASTI was scored as a unidimensional instrument, although the items were constructed so as to represent the subdimensions described earlier. Based on this theoretical background and the expert judgments, the five-dimensional model in Tables 3A — C was used as the starting point for the following analyses.

In order to test whether the five-dimensional model fit better than the one-dimensional model, they were compared using the Bayesian information criterion BIC; Schwarz, as recommended by, e. Chi-square tests were also computed see Table 6 however, they may be oversensitive due to the relatively large sample size. The five-dimensional model was estimated using a quasi-Monte Carlo integration Kiefer et al. The latent correlations between the five dimensions were estimated.

Once the dimensionality of the ASTI is established, we can test the fit of the Rasch model within each subscale, analyzing several indicators of fit for each individual item.

First, the assumption of Rasch homogeneity was tested by comparing the PCM against the generalized partial credit model GPCM, Muraki, , which includes different discrimination parameters across items. Only if the PCM does not fit significantly worse than the GPCM, the assumptions of the Rasch family hold for a scale and the raw score is a sufficient statistic for the person parameter. Additionally, the expected score curves of each item were examined. Figure 3 shows some examples of the results of this analysis.

With this kind of graphical display, it is possible to examine whether the observed score curve is different from the expected curve misfit and whether the discrimination slope of an item is higher or lower than assumed by the PCM. The fit of individual items was assessed using infit and outfit statistics, i. Following Wright and Linacre , a range between 0. Generally, a value below 1 indicates overfit the data are too predictable and a value above one indicates underfit of items the data are less predictable than expected by the PCM.

Overfit e. Thus, underfit should receive more attention in the evaluation of the items. Differential item functioning means that the pattern of response probabilities for some items differs between groups of participants.

For example, gender-related DIF would mean that men are more likely to agree to some items of a scale than women. If that were the case, the scale as a whole would be measuring a somewhat different construct for men than for women. To assess DIF, the fit of two models was compared by means of the BIC: a main-effect model, which allows only for a main effect of the DIF variable across all items, and an interaction model, which additionally includes an interaction between the DIF variable and the items.

If the interaction model fits significantly better than the main-effect model, there is a significant amount of DIF, that is, the patterns of item difficulties vary between the levels of the DIF variable. Descriptive statistics M, SD for each item were calculated of each dimension separately.

The item parameters of all items based on final scale assignment and item intercorrelations are reported in the Appendix in the Supplementary Material to this article. The person-item-map in Figure 2 shows that the item parameters mostly covered the left-hand side of the middle range of the ability parameter distribution. For the ASTI to also differentiate well among high-scoring individuals, more items should be constructed that participants are less likely to agree with.

Performance measures of wisdom, such as the Berlin Wisdom Paradigm Baltes and Staudinger, tend to produce far lower average levels of wisdom than self-report measures do. Next, four different models were estimated and compared by means of the BIC.

Furthermore, the comparison between five-dimensional and one-dimensional models suggested that the five-dimensional models generally fit the data better than the unidimensional ones. This may be due to the relatively low variance in the item responses. The latent correlations supported the assumption of a five-dimensional structure of the ASTI. Self-knowledge and integration, peace of mind, and presence in the here-and-now and growth were quite highly correlated, which may suggest that they all represent an accepting and appreciative stance toward oneself and the experiences of one's life.

Non-attachment and self-transcendence seem to be less closely related to the others except for the correlation between non-attachment and peace of mind , possibly because they both, although in different ways, represent the individual's relationship with the external world: non-attachment describes an independence from other people and material things, and self-transcendence represents a connectedness with others and the world at large.

Both may not be part of everyone's experience of inner peace. Table 5. Next, we assessed the items of each dimension separately. In general, the infit and outfit statistics showed no misfit of items see Table 6.

Because of the complexity of analyses, the following results are reported for each dimension separately. Log likelihoods for both models are also reported, although likelihood ratio tests are likely to be somewhat oversensitive due to the large sample size.

Table 6. The score curves suggest that, generally, the observed slopes were steeper than expected; the observed slope of item 10 also showed small deviations from the expected slope see Figure 3. Therefore, the PCM was considered to fit the scale sufficiently well when item 10 was excluded. As explained earlier, DIF was assessed with respect to gender, age, and professional group. However, the model comparisons in Table 7 indicated DIF for age and group. Note that Item 10 had not received an unequivocal assignment by the experts either see Table 3A.

However, the magnitude of DIF was small and could therefore be ignored. When the analyses were repeated excluding item 10, the PCM fit the data well and there was no considerable DIF for any item. Thus, the PCM was preferred. It is somewhat unclear, however, what causes the difference in fit, as the two examples of score curves in Figure 3 represent the general result for all items of this scale, indicating no substantial underfit or overfit.

It seems important to reanalyze the self-transcendence scale with new data. As Tables 6 , 7 show, no substantial DIF was found for this subscale. The score curves for an example see Figure 3 , left below showed that the observed slopes were slightly higher than the expected slopes. It was also the only negative item see Table 3C in the subdimension. This subscale should also be reanalyzed once new data are available.

A re-analysis without item 14 showed that the PCM fit the data well. In the following, we first discuss the methodological implications of our research, and then, its substantive implications concerning the use of the ASTI to measure wisdom. This paper introduced the CSS procedure for evaluating content validity and discussed its advantages for the theory-based evaluation of scale items.

In our experience, the method provides highly interesting practical and theoretical insights into target constructs. It does not only allow for evaluating and validating existing instruments and for improving the operationalization of a target construct, but it also offers advantages for constructing new items for existing instruments or even for developing whole new instruments.

The procedure can be applied in all subdisciplines of psychology and other fields, wherever the goal is to measure specific constructs. In addition, it does not matter which kinds of items e.

The in-depth examination of the target construct is likely to increase the validity of any assessment. We propose to follow certain quality criteria in studies using our approach. First, to optimize replicability, all steps should be carefully documented. A detailed documentation of procedures increases the validity of the study, irrespective of whether the data collection is more quantitative as in the present study or more qualitative e.

Second, the selection of experts is obviously crucial. Objectivity may be compromised if the group of experts is too homogeneous e. The instructions that the experts receive also need to be carefully written so as to avoid inducing any biases. Third, it is important that the expert judgments are complemented by actual data collected from a sample representative of the actual target population. Our experience is that the data are often astonishingly consistent with the expert ratings; however, experts may also be wrong occasionally, for example, if they assume more complex interpretations of item content than the actual participants use.

As we have demonstrated here, item response models may be particularly suited for testing hypotheses about individual items, but factor-analytic approaches are also very useful for testing hypotheses about the structural relationships between subscales.

For example, it would be worthwhile to test the current data for a bi-factor structure, i. Next steps in our work will include the comparison of these different methods of data analysis. Another important future goal is the definition of a quantitative content-validity index based on the current method. In addition to utilizing the ASTI to demonstrate our approach, we believe that we have gained important insights about the ASTI, as well as about self-transcendence in general, from this study. Through the exercise of assigning and reassigning the items to the dimensions of the construct and discussing the contradictions and difficulties we encountered, we gained a far deeper understanding of the measured itself.

Some of the ASTI items nicely evade this problem by being difficult to understand for individuals who have not achieved the respective levels of self-transcendence.

The positive German version of this item had the lowest mean, i. It may be worthwhile to try to construct more items of this kind. For now, we have identified five subdimensions that include the 24 positive items in German, 25 of the ASTI.

The 10 negative items measuring alienation were not included in this analysis, as negative items tend to be difficult to assign to the same dimension as positive items.

We recommend to leave them in the questionnaire in order to increase the range of item content, but to exclude them from score computations. In further applications of the ASTI, should the five subdimensions be scored separately or should the total score be used?

Strong advocates of the Rasch model would certainly argue that using the total score across the subdimensions amounts to mixing apples and oranges. However, other self-report scales of wisdom such as the 3D-WS Ardelt, or the SAWS Webster, measure several dimensions of wisdom that are conceptually and empirically related to about the same degree as the subdimensions of the ASTI we have identified here.

Both these authors suggest to use the mean across the subdimensions as an indicator of wisdom and to consider only individuals as wise who have a high mean, i. The same may be a good idea here: for an individual to be considered as highly wise in the sense of self-transcendence , he or she would need to have high scores in all five subdimensions, as all of them are considered as relevant components of wisdom. For individuals with lower means, we recommend to consider their profile across the subdimensions rather than compute a single score.

The subdimensions are ordered so as to represent a possible developmental order as suggested by Levenson et al. It is important to note that in addition to producing valid and reliable subdimensions, the CSS procedure has also led us to conceptually redefine some of the subdimensions so as to better differentiate them for example, independence of external sources of well-being was originally included in the definitions of both non-attachment and self-transcendence.

We first give definitions for all subdimensions and then discuss their relationships to each other and to age and gender. The first subdimension includes items that were originally intended to measure Curnow's separate dimensions of self-knowledge and integration.

It includes items that refer to broad and deep knowledge about as well as acceptance of all aspects of one's own self, including ambivalent or undesirable ones.

Thus, the distinction between being aware of certain aspects of the self and accepting them was not supported empirically. The idea that self-knowledge and the acceptance of all aspects of the self is key to wisdom can be found in Erikson's idea of integrity, i. Individuals high in this dimension of the ASTI are aware of the different, sometimes contradictory, facets of their self and their life, and they are able to accept all sides of their personality and integrate the different facets of their life.

Therefore, it seems advisable to add new items that refer to self-knowledge as well as items that differentiate between different kinds of integration e. With a higher number of items, the distinction between knowing and accepting aspects of one's self might also receive more empirical support. Non-attachment describes an individual's awareness of the fundamental independence of his or her internal self of external possessions or evaluations: non-attached individuals' self-esteem is not dependent on how others think about them or how many friends they have.

The scale comprises four items concerning the individual's independence of external things, such as other people's opinions, a busy social life, or material possessions. It is important to note that non-attachment does not mean that people are not committed to others or to important issues in their current world; the main point is that they do not depend on external sources for self-enhancement.

The fact that they are not affected by other people's judgments enables them to lead the life that is right for them and accept others non-judgmentally. Like other ideas originating from Buddhism, non-attachment as a path to mental health is currently receiving some attention in clinical psychology Shonin et al. Individuals high in this dimension, which was not part of Curnow's original conception, are able to live in the moment and enjoy the good times in their life without clinging to them, because they know that everything changes and that change may also foster growth.

The items of this subdimension describe individuals who are able to live life mindfully in any given moment: they find joy in their life and in what they are doing. They are aware that things are always changing, oriented toward learning from others, and aware that they have grown through losses, and they have accepted the finitude of life. Therefore, the measurement is not valid. A group of participants take a test designed to measure working memory. What can proofreading do for your paper?

Is this article helpful? Fiona Middleton Fiona has been editing for Scribbr since August She has a bachelor's degree in geology and is currently working towards a master's degree in marine sciences. She loves working with students based around the world to refine their writing.

Other students also liked. An introduction to quantitative research Quantitative research means collecting and analyzing numerical data to describe characteristics, find correlations, or test hypotheses. A guide to operationalization Operationalization means turning abstract concepts into measurable observations.

It involves clearly defining your variables and indicators. A step-by-step guide to data collection Data collection is the systematic process of gathering observations or measurements in research. It can be qualitative or quantitative. What is your plagiarism score? Scribbr Plagiarism Checker. The consistency of a measure across time : do you get the same results when you repeat the measurement?

A group of participants complete a questionnaire designed to measure personality traits. If they repeat the questionnaire days, weeks or months apart and give the same answers, this indicates high test-retest reliability.

The consistency of a measure across raters or observers : do you get the same results when different people conduct the same measurement? Based on an assessment criteria checklist, five examiners submit substantially different results for the same student project.

This indicates that the assessment checklist has low inter-rater reliability for example, because the criteria are too subjective. The consistency of the measurement itself : do you get the same results from different parts of a test that are designed to measure the same thing?

You design a questionnaire to measure self-esteem. If you randomly split the results into two halves, there should be a strong correlation between the two sets of results. If the two results are very different, this indicates low internal consistency. The adherence of a measure to existing theory and knowledge of the concept being measured.

A self-esteem questionnaire could be assessed by measuring other traits known or assumed to be related to the concept of self-esteem such as social skills and optimism. Strong correlation between the scores for self-esteem and associated traits would indicate high construct validity. The extent to which the measurement covers all aspects of the concept being measured.

Experts agree that listening comprehension is an essential aspect of language ability, so the test lacks content validity for measuring the overall level of ability in Spanish. The extent to which the result of a measure corresponds to other valid measures of the same concept. A survey is conducted to measure the political opinions of voters in a region. If the results accurately predict the later outcome of an election in that region, this indicates that the survey has high criterion validity.

How did you plan your research to ensure reliability and validity of the measures used?

safilacis1979's Ownd

0コメント

1000 / 1000