Definitions of Critical Thinking
One of the most debatable features about critical thinking is what constitutes critical thinking—its definition. Table 1 shows definitions of critical thinking drawn from the frameworks reviewed in the Markle et al. (2013) paper. The different sources of the frameworks (e.g., higher education and workforce) focus on different aspects of critical thinking. Some value the reasoning process specific to critical thinking, while others emphasize the outcomes of critical thinking, such as whether it can be used for decision making or problem solving. An interesting phenomenon is that none of the frameworks referenced in the Markle et al. paper offers actual assessments of critical thinking based on the group's definition. For example, in the case of the VALUE (Valid Assessment of Learning in Undergraduate Education) initiative as part of the AAC&U's LEAP campaign, VALUE rubrics were developed with the intent to serve as generic guidelines when faculty members design their own assessments or grading activities. This approach provides great flexibility to faculty and accommodates local needs. However, it also raises concerns of reliability in terms of how faculty members use the rubrics. A recent AAC&U research study found that the percent agreement in scoring was fairly low when multiple raters scored the same student work using the VALUE rubrics (Finley, 2012). For example, the percentage of perfect agreement of using four scoring categories across multiple raters was only 36% when the critical thinking rubric was applied.
In addition to the frameworks discussed by Markle et al. (2013), there are other influential research efforts on critical thinking. Unlike the frameworks discussed by Market et al., these research efforts have led to commercially available critical thinking assessments. For example, in a study sponsored by the American Philosophical Association (APA), Facione (1990b) spearheaded the effort to identify a consensus definition of critical thinking using the Delphi approach, an expert consensus approach. For the APA study, 46 members recognized as having experience or expertise in critical thinking instruction, assessment, or theory, shared reasoned opinions about critical thinking. The experts were asked to provide their own list of the skill and dispositional dimensions of critical thinking. After rounds of discussion, the experts reached an agreement on the core cognitive dimensions (i.e., key skills or dispositions) of critical thinking: (a) interpretation, (b) analysis, (c) evaluation, (d) inference, (e) explanation, and (f) self-regulation—making it clear that a person does not have to be proficient at every skill to be considered a critical thinker. The experts also reached consensus on the affective, dispositional components of critical thinking, such as “inquisitiveness with regard to a wide range of issues,” “concern to become and remain generally well-informed,” and “alertness to opportunities to use CT [critical thinking]” (Facione, 1990b, p. 13). Two decades later, the approach AAC&U took to define critical thinking was heavily influenced by the APA definitions.
Halpern also led a noteworthy research and assessment effort on critical thinking. In her 2003 book, Halpern defined critical thinking as
…the use of those cognitive skills or strategies that increase the probability of a desirable outcome. It is used to describe thinking that is purposeful, reasoned, and goal directed—the kind of thinking involved in solving problems, formulating inferences, calculating likelihoods, and making decisions, when the thinker is using skills that are thoughtful and effective for the particular context and type of thinking task. (Halpern, 2003, p. 6)
Halpern's approach to critical thinking has a strong focus on the outcome or utility aspect of critical thinking, in that critical thinking is conceptualized as a tool to facilitate decision making or problem solving. Halpern recognized several key aspects of critical thinking, including verbal reasoning, argument analysis, assessing likelihood and uncertainty, making sound decisions, and thinking as hypothesis testing (Halpern, 2003).
These two research efforts, led by Facione and Halpern, lent themselves to two commercially available assessments of critical thinking, the California Critical Thinking Skills Test (CCTST) and the Halpern Critical Thinking Assessment (HCTA), respectively, which are described in detail in the following section, where we discuss existing assessments. Interested readers are also pointed to research concerning constructs overlapping with critical thinking, such as argumentation (Godden & Walton, 2007; Walton, 1996; Walton, Reed, & Macagno, 2008) and reasoning (Carroll, 1993; Powers & Dwyer, 2003).
Existing Assessments of Critical Thinking
Multiple Themes of Assessments
As with the multivariate nature of the definitions offered for critical thinking, critical thinking assessments also tend to capture multiple themes. Table 2 presents some of the most popular assessments of critical thinking, including the CCTST (Facione, 1990a), California Critical Thinking Disposition Inventory (CCTDI; Facione & Facione, 1992), Watson–Glaser Critical Thinking Appraisal (WGCTA; Watson & Glaser, 1980), Ennis–Weir Critical Thinking Essay Test (Ennis & Weir, 1985), Cornell Critical Thinking Test (CCTT; Ennis, Millman, & Tomko, 1985), ETS® Proficiency Profile (EPP; ETS, 2010), Collegiate Learning Assessment+ (CLA+; Council for Aid to Education, 2013), Collegiate Assessment of Academic Proficiency (CAAP Program Management, 2012), and the HCTA (Halpern, 2010). The last column in Table 2 shows how critical thinking is operationally defined in these widely used assessments. The assessments overlap in a number of key themes, such as reasoning, analysis, argumentation, and evaluation. They also differ along a few dimensions, such as whether critical thinking should include decision making and problem solving (e.g., CLA+, HCTA, and California Measure of Mental Motivation [CM3]), be integrated with writing (e.g., CLA+), or involve metacognition (e.g., CM3).
|California Critical Thinking Disposition Inventory (CCTDI)||Insight Assessment (California Academic Press)a||Selected-response (Likert scale—extent to which students agree or disagree)||Online or paper/pencil||30 min||75 items (seven scales: 9–12 items per scale)||This test contains seven scales of critical thinking: (a) truth-seeking, (b) open-mindedness, (c) analyticity, (d) systematicity, (e) confidence in reasoning, (f) inquisitiveness, and (g) maturity of judgment (Facione, Facione, & Sanchez, 1994)|
|California Critical Thinking Skills Test (CCTST)||Insight Assessment (California Academic Press)||Multiple-choice (MC)||Online or paper/pencil||45 min||34 items (vignette based)||The CCTST returns scores on the following scales: (a) analysis, (b) evaluation, (c) inference, (d) deduction, (e) induction, and (f) overall reasoning skills (Facione, 1990a)|
|California Measure of Mental Motivation (CM3)||Insight Assessment (California Academic Press)||Selected-response (4-point Likert scale: strongly disagree to strongly agree)||Online or paper/pencil||20 min||72 items||This assessment measures and reports scores on the following areas: (a) learning orientation, (b) creative problem solving, (c) cognitive integrity, (d) scholarly rigor, and (e) technological orientation (Insight Assessment, 2013)|
|Collegiate Assessment of Academic Proficiency (CAAP) Critical Thinking||ACT||MC||Paper/pencil||40 min||32 items (includes four passages representative of issues commonly encountered in a postsecondary curriculum)||The CAAP Critical Thinking measures students' skills in analyzing elements of an argument, evaluating an argument, and extending arguments (CAAP Program Management, 2012)|
|Collegiate Learning Assessment+ (CLA+)||Council for Aid to Education (CAE)||Performance task (PT) and MC||Online||90 min (60 min for PT; 30 min for MC)||26 items (one PT; 25 MC)||The CLA+ PTs measure higher order skills including: (a) analysis and problem solving, (b) writing effectiveness, and (c) writing mechanics. The MC items assess (a) scientific and quantitative reasoning, (b) critical reading and evaluation, and (c) critiquing an argument (Zahner, 2013)|
|Cornell Critical Thinking Test (CCTT)||The Critical Thinking Co.||MC||Computer based (using the software) or paper/pencil||50 min (can also be administered untimed)||Level X: 71 items||Level X is intended for students in Grades 5–12+ and measures the following skills: (a) induction, (b) deduction, (c) credibility, and (d) identification of assumptions (The Critical Thinking Co., 2014)|
|Level Z: 52 items||Level Z is intended for students in Grades 11–12+ and measures the following skills: (a) induction, (b) deduction, (c) credibility, (d) identification of assumptions, (e) semantics, (f) definition, and (g) prediction in planning experiments (The Critical Thinking Co., 2014)|
|Ennis–Weir Critical Thinking Essay Test||Midwest Publications||Essay||Paper/pencil||40 min||Nine-paragraph essay/letter||This assessment measures the following areas of the critical thinking competence: (a) getting the point, (b) seeing reasons and assumptions, (c) stating one's point, (d) offering good reasons, (e) seeing other possibilities, and (f) responding appropriately to and/or avoiding argument weaknesses (Ennis & Weir, 1985)|
|ETS Proficiency Profile (EPP) Critical Thinking||ETS||MC||Online and paper/pencil||About 40 min (full test is 2 h)||27 items (standard form)||The Critical Thinking component of this test measures a students' ability to: (a) distinguish between rhetoric and argumentation in a piece of nonfiction prose, (b) recognize assumptions and the best hypothesis to account for information presented, (c) infer and interpret a relationship between variables, and (d) draw valid conclusions based on information presented (ETS, 2010)|
|Halpern Critical Thinking Assessment (HCTA)||Schuhfried Publishing, Inc.||Forced choice (MC, ranking, or rating of alternatives) and open-ended||Computer based||60–80 min, but test is untimed (Form S1)||25 scenarios of everyday events (five per subcategory)||This test measures five critical thinking subskills: (a) verbal reasoning skills, (b) argument and analysis skills, (c) skills in thinking as hypothesis testing, (d) using likelihood and uncertainty, and (e) decision-making and problem-solving skills (Halpern, 2010)|
|20 min, but test is untimed (Form S2)||S1: Both open-ended and forced choice items|
|S2: All forced choice items|
|Watson–Glaser Critical Thinking Appraisal tool (WGCTA)||Pearson||MC||Online and paper/pencil||Standard: 40–60 min (Forms A and B) if timed||80 items||The WGCTA is composed of five tests: (a) inference, (b) recognition of assumptions, (c) deduction, (d) interpretation, and (e) evaluation of arguments. Each test contains both neutral and controversial reading passages and scenarios encountered at work, in the classroom, and in the media. Although there are five tests, only the total score is reported (Watson & Glaser, 2008a, 2008b)|
|Short form: 30 min if timed||40 items|
|Watson–Glaser II: 40 min if timed||40 items||Measures and provides interpretable subscores for three critical thinking skill domains that are both contemporary and business relevant, including the ability to: (a) recognize assumptions, (b) evaluate arguments, and (c) draw conclusions (Watson & Glaser, 2010).|
The majority of the assessments exclusively use selected-response items such as multiple-choice or Likert-type items (e.g., CAAP, CCTST, and WGCTA). EPP, HCTA, and CLA+ use a combination of multiple-choice and constructed-response items (though the essay is optional in EPP), and the Ennis–Weir test is an essay test. Given the limited testing time, only a small number of constructed-response items can typically be used in a given assessment.
Test and Scale Reliability
Although constructed-response items have great face validity and have the potential to offer authentic contexts in assessments, they tend to have lower levels of reliability than multiple-choice items for the same amount of testing time (Lee, Liu, & Linn, 2011). For example, according to a recent report released by the sponsor of the CLA+, the Council for Aid to Education (Zahner, 2013), the reliability of the 60-min constructed-response section is only .43. The test-level reliability is .87, largely driven by the reliability of CLA+'s 30-min short multiple-choice section.
Because of the multidimensional nature of critical thinking, many existing assessments include multiple subscales and report subscale scores. The main advantage of subscale scores is that they provide detailed information about test takers' critical thinking ability. The downside, however, is that these subscale scores are typically challenged by their unsatisfactory reliability and the lack of distinction between scales. For example, CCTST reports scores on overall reasoning skills and subscale scores on five aspects of critical thinking: (a) analysis, (b) evaluation, (c) inference, (d) deduction, and (e) induction. However, Leppa (1997) reported that the subscales have low internal consistency, from .21 to .51, much lower than the reliabilities (i.e., .68 to .70) reported by the authors of CCTST (Ku, 2009). Another example is that the WGCTA provides subscale scores on inference, recognition of assumption, deduction, interpretation, and evaluation of arguments. Studies found that the internal consistency of some of these subscales was low and had a large range, from .17 to .74 (Loo & Thorpe, 1999). Additionally, there was no clear evidence of distinct subscales, since a single-component scale was discovered from 60 published studies in a meta-analysis (Bernard et al., 2008). Studies also reported unstable factor structure and low reliability for the CCTDI (Kakai, 2003; Walsh & Hardy, 1997; Walsh, Seldomridge, & Badros, 2007).
Comparability of Forms
Following reasons such as test security and construct representation, most assessments employ multiple forms. The comparability among forms is another source of concern. For example, Jacobs (1999) found that the Form B of CCTST was significantly more difficult than Form A. Other studies also found that there is low comparability between the two forms on the CCTST (Bondy, Koenigseder, Ishee, & Williams, 2001).
Table 3 presents some of the more recent validity studies for existing critical thinking assessments. Most studies focus on the correlation of critical thinking scores with scores on other general cognitive measures. For example, critical thinking assessments showed moderate correlations with general cognitive assessments such as SAT® or GRE® tests (e.g., Ennis, 2005; Giancarlo, Blohm, & Urdan, 2004; Liu, 2008; Stanovich & West, 2008; Watson & Glaser, 2010). They also showed moderate correlations with course grades and GPA (Gadzella et al., 2006; Giancarlo et al., 2004; Halpern, 2006; Hawkins, 2012; Liu & Roohr, 2013; Williams et al., 2003). A few studies have looked at the relationship of critical thinking to behaviors, job performance, or life events. Ejiogu, Yang, Trent, and Rose (2006) examined the scores on the WGCTA and found that they positively correlated moderately with job performance (corrected r = .32 to .52). Butler (2012) examined the external validity of the HCTA and concluded that those with higher critical thinking scores had fewer negative life events than those with lower critical thinking skills (r = −.38).
|Butler (2012)||HCTA||Community college students; state university students; and community adults||131||Significant moderate correlation with the real-world outcomes of critical thinking inventory (r(131) = −.38), meaning those with higher critical thinking scores reported fewer negative life events|
|Ejiogu et al. (2006)||WGCTA Short Form||Analysts in a government agency||84||Significant moderate correlations corrected for criterion unreliability ranging from .32 to .52 with supervisory ratings of job performance behaviors; highest correlations were with analysis and problem solving (r(68) = .52), and with judgment and decision making (r(68) = .52)|
|Ennis (2005)||Ennis–Weir Critical Thinking Essay Test||Undergraduates in an educational psychology course (Taube, 1997)||198|
Moderate correlation with WGCTA (r(187) = .37)
Low to moderate correlations with personality assessments ranging from .24 to .35
Low to moderate correlations with SAT verbal (r(155) = .40), SAT quantitative (r(155) = .28), and GPA (r(171) = .28)
|Malay undergraduates with English as a second language (Moore, 1995)||60||Correlations with SAT verbal (pretest: r(60) = .34, posttest: r(60) = .59), TOEFL® (pre: r(60) = .35, post: r(60) = .48), ACT (pre: r(60) = .25, post: r(60) = .66), TWE® (pre: r(60) = −.56, post: r(60) = −.07), SPM (pre: r(60) = .41, post: r(60) = .35)|
|10th-, 11th-, and 12th-grade students (Norris, 1995)||172||Low to moderate correlations with WGCTA (r(172) = .28), CCTT (r(172) = .32), and Test on Appraising Observations (r(172) = .25)|
|Gadzella et al. (2006)||WGCTA Short Form||State university students (psychology, educational psychology, and special education undergraduate majors; graduate students)||586||Low to moderately high significant correlations with course grades ranging from .20 to .62 (r(565) = .30 for total group; r(56) = .62 for psychology majors)|
|Giddens and Gloeckner (2005)||CCTST; CCTDI||Baccalaureate nursing program in the southwestern United States||218||Students who passed the NCLEX had significantly higher total critical thinking scores on the CCTST entry test (t(101) = .2.5*, d = 1.0), CCTST exit test (t(191) = 3.0**, d = .81), and the CCTDI exit test (t(183) = 2.6**, d = .72) than students who failed the NCLEX|
|Halpern (2006)||HCTA||Study 1: Junior and senior students from high school and college in California||80 high school, 80 college||Moderate significant correlations with the Arlin Test of Formal Reasoning (r = .32) for both groups|
|Study 2: Undergraduate and second-year masters students from California State University, San Bernardino||145 undergraduates, 32 masters||Moderate to moderately high correlations with the Need for Cognition scale (r = .32), GPA (r = .30), SAT Verbal (r = .58), SAT Math (r = .50), GRE Analytic (r = .59)|
|Giancarlo et al. (2004)||CM3||9th- and 11th-grade public school students in northern California (validation study 2)||484||Statistically significant correlation ranges between four CM3 subscales (learning, creative problem solving, mental focus, and cognitive integrity) and measures of mastery goals (r(482) = .09 to .67), self-efficacy (r(482) = .22 to .47), SAT9 Math (r(379) = .18 to .33), SAT9 Reading (r(387) = .13 to .43), SAT9 Science (r(380) = .11 to .22), SAT9 Language/Writing (r(382) = .09 to .17), SAT9 Social Science (r(379) = .09 to .18), and GPA (r(468) = .19 to .35)|
|9th- to 12th-grade all-female college preparatory students in Missouri (validation study 3)||587||Statistically significant correlation ranges between four CM3 subscales (learning, creative problem solving, mental focus, and cognitive integrity) and PSAT Math (r(434) = .15 to .37), PSAT Verbal (r(434) = .20 to .31), PSAT Writing (r(291) = .21 to .33), PSAT selection index (r(434) = .23 to .40), and GPA (r(580) = .21 to .46)|
|Hawkins (2012)||CCTST||Students enrolled in undergraduate English courses at a small liberal arts college||117||Moderate significant correlations between total score and GPA (r = .45). Moderate significant subscale correlations with GPA ranged from .27 to .43|
|Liu and Roohr (2013)||EPP||Community college students from 13 institutions||46,402|
Students with higher GPA and students with more credit hours performed higher on the EPP as compared to students with low GPA and fewer credit hours
GPA was the strongest significant predictor of critical thinking (β = .21, η2 = .04)
|Watson and Glaser (2010)||WGCTA||Undergraduate educational psychology students (Taube, 1997)||198||Moderate significant correlations with SAT Verbal (r(155) = .43), SAT Math (r(155) = .39), GPA (r(171) = .30), and Ennis–Weir (r(187) = .37). Low to moderate correlations with personality assessments ranging from .07 to .33|
|Three semesters of freshman nursing students in eastern Pennsylvania (Behrens, 1996)||172||Moderately high significant correlations with fall semester GPA ranging from .51 to .59|
|Education majors in an educational psychology course at a southwestern state university (Gadzella, Baloglu, & Stephens, 2002)||114||Significant correlation between total score and GPA (r = .28) and significant correlations between the five WGCTA subscales and GPA ranging from .02 to .34|
|Williams et al. (2003)||CCTST; CCTDI||First-year dental hygiene students from seven U.S. baccalaureate universities||207|
Significant correlations between the CCTST and CCTDI at baseline (r = .41) and at second semester (r = .26)
Significant correlations between CCTST and knowledge, faculty ratings, and clinical reasoning ranging from .24 to .37 at baseline, and from .23 to .31 at the second semester. For the CCTDI, significant correlations ranged from .15 to .19 at baseline with knowledge, faculty ratings, and clinical reasoning, and with faculty reasoning (r = .21) at second semester
The CCTDI was a more consistent predictor of student performance (4.9–12.3% variance explained) than traditional predictors such as age, GPA, number of college hours (2.1–4.1% variance explained)
|Williams, Schmidt, Tilliss, Wilkins, and Glasnapp (2006)||CCTST; CCTDI||First-year dental hygiene students from three U.S. baccalaureate dental hygiene programs||78|
Significant correlation between CCTST and CCTDI (r = .29) at baseline
Significant correlations between CCTST and NBDHE Multiple-Choice (r = .35) and Case-Based tests (r = .47) at baseline and at program completion (r = .30 and .33, respectively). Significant correlations between CCTDI and NBDHE Case-Based at baseline (r = .25) and at program completion (r = .40)
CCTST was a more consistent predictor of student performance on both NBDHE Multiple-Choice (10.5% variance explained) and NBDHE Case-Based scores (18.4% variance explained) than traditional predictors such as age, GPA, number of college hours
Our review of validity evidence for existing assessments revealed that the quality and quantity of research support varied significantly among existing assessments. Common problems with existing assessments include insufficient evidence of distinct dimensionality, unreliable subscores, noncomparable test forms, and unclear evidence of differential validity across groups of test takers. In a review of the psychometric quality of existing critical thinking assessments, Ku (2009) reported a phenomenon that the studies conducted by researchers not affiliated with the authors of the tests tend to report lower psychometric quality of the tests than the studies conducted by the authors and their affiliates.
For future research, a component of validity that is missing from many of the existing studies is the incremental predictive validity of critical thinking. As Kuncel (2011) pointed out, evidence is needed to clarify critical thinking skills' prediction of desirable outcomes (e.g., job performance) beyond what is predicted by other general cognitive measures. Without controlling for other types of general cognitive ability, it is difficult to evaluate the unique contributions that critical thinking skills make to the various outcomes. For example, the Butler (2012) study did not control for any measures of participants' general cognitive ability. Hence, it leaves room for an alternative explanation that other aspects of people's general cognitive ability, rather than critical thinking, may have contributed to their life success.
Challenges in Designing Critical Thinking Assessment
Authenticity Versus Psychometric Quality
A major challenge in designing an assessment for critical thinking is to strike a balance between the assessment's authenticity and its psychometric quality. Most current assessments rely on multiple-choice items when measuring critical thinking. The advantages of such assessments lie in their objectivity, efficiency, high reliability, and low cost. Typically, within the same amount of testing time, multiple-choice items are able to provide more information about what the test takers know as compared to constructed-response items (Lee et al., 2011). Wainer and Thissen (1993) reported that the scoring of 10 constructed-response items costs about $30, while the cost for scoring multiple-choice items to achieve the same level of reliability was only 1¢. Although multiple-choice items cost less to score, they typically cost more in assessment development than constructed-response items. That being said, the overall cost structure of multiple-choice versus constructed-response items will depend on the number of scores that are derived from a given item over its lifecycle.
Studies also show high correlations of multiple-choice items and constructed-response items of the same constructs (Klein et al., 2009). Rodriguez (2003) investigated the construct equivalence between the two item formats through a meta-analysis of 63 studies and concluded that these two formats are highly correlated when measuring the same content—mean correlation around .95 with item stem equivalence and .92 without stem equivalence. The Klein et al. (2009) study compared the construct validity of three standardized assessments of college learning outcomes (i.e., EPP, CLA, and CAAP) including critical thinking. The school-level correlation between a multiple-choice and a constructed-response critical thinking test was .93.
Given that there may be situations where constructed-response items are more expensive to score and that multiple-choice items can measure the same constructs equally well in some cases, one might argue that it makes more sense to use all multiple-choice items and disregard constructed-response items; however, with constructed-response items, it is possible to create more authentic contexts and assess students' ability to generate rather than select responses. In real-life situations where critical thinking skills need to be exercised, there will not be choices provided. Instead, people will be expected to come up with their own choices and determine which one is more preferable based on the question at hand. Research has long established that the ability to recognize is different from the ability to generate (Frederiksen, 1984; Lane, 2004; Shepard, 2000). In the case of critical thinking, constructed-response items could be a better proxy of real-world scenarios than multiple-choice items.
We agree with researchers who call for multiple item formats in critical thinking assessments (e.g., Butler, 2012; Halpern, 2010; Ku, 2009). Constructed-response items alone will not be able to meet the psychometric standards due to their low internal consistency, one type of reliability. A combination of multiple item formats offers the potential for an authentic and psychometrically sound assessment.
Instructional Value Versus Standardization
Another challenge of designing a standardized critical thinking assessment for higher education is the need to pay attention to the assessment's instructional relevance. Faculty members are sometimes concerned about the limited relevance of general student learning outcomes' assessment results, as these assessments tend to be created in isolation from curriculum and instruction. For example, although most institutions think that critical thinking is a necessary skill for their students (AAC&U, 2011), not many offer courses to foster critical thinking specifically. Therefore, even if the assessment results show that students at a particular institution lack critical thinking skills, no specific department, program, or faculty would claim responsibility for it, which greatly limits the practical use of the assessment results. It is important to identify the common goals of general higher education and translate them into the design of the learning outcomes assessment. The VALUE rubrics created by AAC&U (Rhodes, 2010) are great examples of how a common framework can be created to align expectations about college students' critical thinking skills. While one should pay attention to the assessments' instructional relevance, one should also keep in mind that the tension will always exist between instructional relevance and standardization of the assessment. Standardized assessment can offer comparability and generalizability across institutions and programs within an institution. An assessment designed to reflect closely the objectives and goals of a particular program will have great instructional relevance and will likely offer rich diagnostic information about the students in that program, but it may not serve as a meaningful measure of outcomes for students in other programs. When designing an assessment for critical thinking, it is essential to find that balance point so the assessment results bear meaning for the instructors and provide information to support comparisons across programs and institutions.
Institutional Versus Individual Use
Another concern is whether the assessment should be designed to provide results for institutional use or individual use, a decision that has implications for psychometric considerations such as reliability and validity. For an institutional level assessment, the results only need to be reliable at the group level (e.g., major, department), while for an individual assessment, the results have to be reliable at the individual test-taker level. Typically, more items are required to achieve acceptable individual-level reliability than institution-level reliability. When assessment results are used only at an aggregate level, which is how they are currently used by most institutions, the validity of the test scores is in question as students may not expend their maximum effort when answering the items. Student motivation when taking a low-stakes assessment has long been a source of concern. A recent study by Liu, Bridgeman, and Adler (2012) confirmed that motivation plays a significant role in affecting student performance on low-stakes learning outcomes assessment in higher education. Conclusions about students' learning gains in college could significantly vary depending on whether they are motivated to take the test or not. If possible, the assessment should be designed to provide reliable information about individual test takers, which allows test takers to possibly benefit from the test (e.g., obtaining a certificate of achievement). The increased stakes may help boost students' motivation while taking such assessments.
General Versus Domain-Specific Assessment
Critical thinking has been defined as a generic skill in many of the existing frameworks and assessments (e.g., Bangert-Drowns & Bankert, 1990; Ennis, 2003; Facione, 1990b; Halpern, 1998). On one hand, many educators and philosophers believe that critical thinking is a set of skills and dispositions that can be applied across specific domains (Davies, 2013; Ennis, 1989; Moore, 2011). The generalists depict critical thinking as an enabling skill similar to reading and writing, and argue that it can be taught outside the context of a specific discipline. On the other hand, the specifists' view about critical thinking is that it is a domain-specific skill and that the type of critical thinking skills required for nursing would be very different from those practiced in engineering (Tucker, 1996). To date, much of the debate remains at the theoretical level, with little empirical evidence confirming the generalization or specificity of critical thinking (Nicholas & Labig, 2013). One empirical study has yielded mixed findings. Powers and Enright (1987) surveyed 255 faculty members in six disciplinary domains to gain understanding of the kind of reasoning and analytical abilities required for successful performance at the graduate level. The authors found that some general skills, such as “reasoning or problem solving in situations in which all the needed information is not known,” were valued by faculty in all domains (p. 670). Despite the consensus on some skills, faculty members across subject domains showed marked difference in terms of their perceptions of the importance of other skills. For example, “knowing the rules of formal logic” was rated of high importance for computer science but not for other disciplines (p. 678).
Tuning USA is one of the efforts that considers critical thinking in a domain-specific context. Tuning USA is a faculty-driven process that aims to align goals and define competencies at each degree level (i.e., associate's, bachelor's, and master's) within a discipline (Institute for Evidence-Based Change, 2010). For Tuning USA, there are goals to foster critical thinking within certain disciplinary domains, such as engineering and history. For example, for engineering students who work on design, critical thinking suggests that they develop “an appreciation of the uncertainties involved, and the use of engineering judgment” (p. 97) and that they understand “consideration of risk assessment, societal and environmental impact, standards, codes, regulations, safety, security, sustainability, constructability, and operability” at various stages of the design process (p. 97).
In addition, there is insufficient empirical evidence showing that, as a generic skill, critical thinking is distinguishable from other general cognitive abilities measured by validated assessments such as the SAT and GRE tests (see Kuncel, 2011). Kuncel, therefore, argued that instead of being a generic skill, critical thinking is more appropriately studied as a domain-specific construct. This view may be correct, or at least plausible, but there also needs to be empirical evidence demonstrating that critical thinking is a domain-specific skill. It is true that examples of critical thinking offered by members of the nursing profession may be very different from those cited by engineers, but content knowledge plays a significant role in this distinction. Would it be reasonable to assume that skillful critical thinkers can be successful when they transfer from one profession to another with sufficient content training? Whether and how content knowledge can be disentangled from higher order critical thinking skills, as well as other cognitive and affective faculties, await further investigation.
Despite the debate over the nature of critical thinking, most existing critical thinking assessments treat this skill as generic. Apart from the theoretical reasons, it is much more costly and labor-intensive to design, develop, and score a critical thinking assessment for each major field of study. If assessments are designed only for popular domains with large numbers of students, students in less popular majors are deprived of the opportunity to demonstrate their critical thinking skills. From a score user perspective, because of the interdisciplinary nature of many jobs in the 21st century workforce, many employers value generic skills that can be transferable from one domain to another (AAC&U, 2011; Chronicle of Higher Education, 2012; Hart Research Associates, 2013), which makes an assessment of critical thinking in a particular domain less attractive.
Total Versus Subscale Scores
Another challenge related to critical thinking assessment is whether to offer subscale scores. Given the multidimensional nature of the critical thinking construct, it is a natural tendency for assessment developers to consider subscale scores for critical thinking. Subscale scores have the advantages of offering detailed information about test takers' performance on each of the subscales and also have the potential to provide diagnostic information for teachers or instructors if the scores are going to be used for formative purposes (Sinharay, Puhan, & Haberman, 2011). However, one should not lose sight of the psychometric requirements when offering subscale scores. Evidence is needed to demonstrate that there is a real and reliable distinction among the subscales. Previous research reveals that for some of the existing critical thinking assessments, there is lack of support for the factor structure based on which subscale scores are reported (e.g., CCTDI; Kakai, 2003; Walsh & Hardy, 1997; Walsh et al., 2007
Aikenhead, G. S. (2006).Science education for everyday life: Evidence-based practice. New York: Teachers College Press.Google Scholar
Bell, P., & Linn, M. C. (2000). Scientific argumentations as learning artifacts: Designing for learning from the Web with KIE.International Journal of Science Education, 22, 797–817.CrossRefGoogle Scholar
Benninga, J. S., Berkowitz, M. W., Kuehn, P., & Smith, K. (2003). The relationship of character education implementation and academic achievement in elementary schoolsJournal of Research in Character Education 1(1), 19–32.Google Scholar
Berkowitz, M. W. (1997). The complete moral person: Anatomy and formation. In J. M. DuBois (Ed.),Moral issues in psychology: Personalist contributions to selected problems (pp. 11–42). Lanham, MD: University Press of America.Google Scholar
Berkowitz, M. W., Battistich, V. A., & Bier, M. C. (2008). What works in character education: What is known and what needs to be known. In L. Nucci & D. Narvaez (Eds.),Handbook on moral and character education (Chapter 22). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
Berkowitz, M. W., & Grych, J. H. (2000). Early character development and education.Early Education and Development, 11(1), 55–72.CrossRefGoogle Scholar
Berkowitz, M. W., Oser, F., & Althof, W. (1987). The development of sociomoral discourse. In W. M. Kurtines & J. L. Gewirtz (Eds.),Moral development through social interaction (pp. 322–352). New York: John Wiley.Google Scholar
Facione, P. A. (2007).Critical thinking: What it is and why it counts (2007 update). Millbrae, CA: Insight-Assessment/California Academic Press LLC. Retrieved April 28, 2009, from www.insightassessment.com/pdf_files/what&why2006.pdf.Google Scholar
Fowler, S. R., Zeidler, D. L., & Sadler, T. D. (2009). Moral sensitivity in the context of socioscientific issues in high school science students.International Journal of Science Teacher Education, 31(2), 279–296.Google Scholar
Kolstø, S. D. (2006). Patterns in students’ argumentation confronted with a risk-focused socio-scientific issue.International Journal of Science Education, 28(14), 1689–1716.CrossRefGoogle Scholar
Levinson, R. (2006). Towards a theoretical framework for teaching controversial socio-scientific issues.International Journal of Science Education, 28(10), 1201–1224.CrossRefGoogle Scholar
Pouliot, C. (2008). Students’ inventory of social actors concerned by the controversy surrounding cellular telephones: A case study.Science Education, 92, 543–559.CrossRefGoogle Scholar
Ratcliffe, M., & Grace, M. (2003).Science education and citizenship: Teaching socio-scientific issues. Buckingham, UK: Open University Press.Google Scholar
Ratcliffe, M., Harris, R., & McWhirter, J. (2004). Teaching ethical aspects of science: Is cross-curricular collaboration the answer?School Science Review, 86(315), 39–44.Google Scholar
Sadler, T. D. (2004a). Informal reasoning regarding socioscientific issues: A critical review of research.Journal of Research in Science Teaching, 41(5), 513–536.CrossRefGoogle Scholar
Sadler, T. D. (2004b). Moral and ethical dimensions of socioscientific decision-making as integral components of scientific literacy.The Science Educator, 13, 39–48.Google Scholar
Sadler, T. D., & Zeidler, D. L. (2005). The significance of content knowledge for informal reasoning regarding socioscientific issues: Applying genetics knowledge to genetic engineering issues.Science Education, 89(1), 71–93.CrossRefGoogle Scholar
Walker, K. A., & Zeidler, D. L. (2007). Promoting discourse about socioscientific issues through scaffolded inquiry.International Journal of Science Education, 29(11), 1387–1410.CrossRefGoogle Scholar
Wellington, J. (2004). Ethics and citizenship in science education: Now is the time to jump off the fence.School Science Review, 86, 33–38.Google Scholar
Zeidler, D. L. (2003).The role of moral reasoning on socioscientific issues and discourse in science education. The Netherlands: Kluwer Academic Press.Google Scholar
Zeidler, D. L. (2007).An inclusive view of scientific literacy: Core issues and future directions. Paper presented at “Promoting Scientific Literacy: Science Education Research and Practice in Transaction,” LSL Symposium, Uppsala, Sweden.Google Scholar
Zeidler, D. L., & Keefer, M. (2003). The role of moral reasoning and the status of socioscientific issues in science eudcation: Philosophical, psychologicla and pedagogical considerations. In D. L. Zeidler (Ed.),The role of moral reasoning and discourse on socioscientific issues in science education (pp. 7–38). The Netherlands: Kluwer Academic Press.Google Scholar
Zeidler, D. L., & Sadler, T. D. (2008a). The, role of moral reasoning in argumentation: Conscience, character and care. In S. Erduran & M. Pilar Jimenez-Aleixandre (Eds.),Argumentation in science education: Perspectives from classroom-based research (pp. 201–216). The Netherlands: Springer Press.Google Scholar
Zeidler, D. L., & Sadler, T. D. (2008b). Social and ethical issues in science education: A prelude to action.Science & Education, 17(8, 9), 799–803. (Guest Editors for Special Issue on Socio-Ethical Issues in Science Education).CrossRefGoogle Scholar
Zeidler, D. L., Sadler, T. D., Applebaum, S., & Callahan, B. E. (2009). Advancing reflective judgment through socioscientific issues.Journal of Research in Science Teaching, 46(1), 74–101.CrossRefGoogle Scholar
Zeidler, D. L., Sadler, T. D., Simmons, M. L., & Howes, E. V. (2005). Beyond STS: A research-based framework for socioscientific issues education.Science Education, 89(3), 357–377.CrossRefGoogle Scholar
Zeidler, D. L., Walker, K. A., Ackett, W. A., & Simmons, M. L. (2002). Tangled up in views: Beliefs in the nature of science and responses to socioscientific dilemmas.Science Education, 86(3), 343–367.CrossRefGoogle Scholar