Get Updates via Email Get Updates Get our RSS Feed
  Follow Mathematica on Twitter  Share/Save/Bookmark
Mathematica Policy Research - Home Center on Health Care Effectiveness - Home


•Building capacity to understand and use evidence

•Planning rigorous and relevant evaluations

•Strengthening research methods and standards

Welcome to the Center for Improving Research Evidence

High quality evidence is fundamental for decisions to improve public well-being. For policymakers, philanthropies, practitioners, and others concerned with evidence-based decision making, the Center for Improving Research Evidence (CIRE) provides training and assistance in designing, conducting, assessing, and using a range of scientific policy research and evaluations in worldwide settings.

What We Do

  • Highlights
  • Latest Work
  • Forums

Recommendations for Conducting High Quality Systematic Evidence Reviews

Systematic reviews are a useful tool for decision makers because they identify relevant studies about a policy or program of interest, and summarize the findings across the various studies. A new issue brief from the Center for Improving Research Evidence provides recommendations for conducting high quality systematic reviews to support policy and program decisions.

Video: Overview of CIRE

Ann Person, director of Mathematica's Center for Improving Research Evidence (CIRE), discusses how CIRE bridges the gap between program and policy research and practice.

  • "Compendium of Student, Teacher, and Classroom Measures Used in NCEE Evaluations of Educational Interventions." April 2010. This NCEE Reference Report is a ready resource to help evaluators and researchers select outcome measures for their future studies and also assist policymakers in understanding the measures used in existing IES studies. The two-volume Compendium provides comparative information about the domain, technical quality, and history of use of outcome measures used in IES-funded evaluations between 2005 and 2008. The Compendium is intended to facilitate the comparisons of results across studies, thus expanding an understanding of these measures within the educational research community. Focusing exclusively on studies that employed randomized controlled trials or regression discontinuity designs, the Compendium also used outcome measures that were available to other researchers and had information available about psychometric properties. For example, Volume I describes typical or common considerations when selecting measures and the approach used to collect and summarize information on the 94 measures reviewed. While Volume II provides detailed descriptions of these measures including source information and references.

    Volume I: Measures Selection Approaches and Compendium Development Methods Kimberly Boller, Sally Atkins-Burnett, Elizabeth M. Malone, Gail P. Baxter, and Jerry West.

    Volume II: Volume II: Technical Details, Measure Profiles, and Glossary (Appendices A-G)
    Lizabeth M. Malone, Charlotte Cabili, Jamila Henderson, Andrea Mraz Esposito, Kathleen Coolahan, Juliette Henke, Subuhi Asheer, Meghan O'Toole, Sally Atkins-Burnett, and Kimberly Boller.
  • "Survey of Outcomes Measurement in Research on Character Education Programs." Technical Methods Report. Ann E. Person, Emily Moiduddin, Megan Hague-Angus, and Lizabeth M. Malone, December 2009. Character education programs are school-based programs that have as one of their objectives promoting the character development of students. This report systematically examines the outcomes that were measured in evaluations of a delimited set of character education programs and the research tools used for measuring the targeted outcomes. The multifaceted nature of character development and many possible ways of conceptualizing it, the large and growing number of school-based programs to promote character development, and the relative newness of efforts to evaluate character education programs using rigorous research methods combine to make the selection or development of measures relevant to the evaluation of these programs especially challenging. This report is a step toward creating a resource that can inform measure selection for conducting rigorous, cost-effective studies of character education programs. The report, however, does not provide comprehensive information on all measures or types of measures, guidance on specific measures, or recommendations on specific measures.
  • "Using State Tests in Education Experiments: A Discussion of the Issues." Technical Methods Report. Henry May, Irma Perez-Johnson, Joshua Haimson, Samina Sattar, and Phil Gleason, November 2009. Securing data on students' academic achievement is typically one of the most important and costly aspects of conducting education experiments. As state assessment programs have become practically universal and more uniform in terms of grades and subjects tested, the relative appeal of using state tests as a source of study outcome measures has grown. However, the variation in state assessments—in both content and proficiency standards—complicates decisions about whether a particular state test is suitable for research purposes and poses difficulties when planning to combine results across multiple states or grades. This discussion paper aims to help researchers evaluate and make decisions about whether and how to use state test data in education experiments. It outlines the issues that researchers should consider, including how to evaluate the validity and reliability of state tests relative to study purposes; factors influencing the feasibility of collecting state test data; how to analyze state test scores; and whether to combine results based on different tests. It also highlights best practices to help inform ongoing and future experimental studies. Many of the issues discussed are also relevant for nonexperimental studies.
  • "Do Typical RCTs of Education Interventions Have Sufficient Statistical Power for Linking Impacts on Teacher Practice and Student Achievement Outcomes?" Technical Methods Report. Peter Z. Schochet, October 2009. For randomized controlled trials (RCTs) of education interventions, it is often of interest to estimate associations between student and mediating teacher practice outcomes, to examine the extent to which the study's conceptual model is supported by the data, and to identify specific mediators that are most associated with student learning. This paper develops statistical power formulas for such exploratory analyses under clustered school-based RCTs using ordinary least squares (OLS) and instrumental variable (IV) estimators, and uses these formulas to conduct a simulated power analysis. The power analysis finds that for currently available mediators, the OLS approach will yield precise estimates of associations between teacher practice measures and student test score gains only if the sample contains about 150 to 200 study schools. The IV approach, which can adjust for potential omitted variables and simultaneity biases, has very little statistical power for mediator analyses. For typical RCT evaluations, these results may have design implications for the scope of the data collection effort for obtaining costly teacher practice mediators.
  • "The Estimation of Average Treatment Effects for Clustered RCTs of Education Interventions." Technical Methods Report. Peter Z. Schochet, August 2009. This paper examines the estimation of two-stage clustered RCT designs in education research using the Neyman causal inference framework that underlies experiments. The key distinction between the considered causal models is whether potential treatment and control group outcomes are considered to be fixed for the study population (the finite-population model) or randomly selected from a vaguely defined universe (the super-population model). Appropriate estimators are derived and discussed for each model. Using data from five large-scale clustered RCTs in the education area, the empirical analysis estimates impacts and their standard errors using the considered estimators. For all studies, the estimators yield identical findings concerning statistical significance. However, standard errors sometimes differ, suggesting that policy conclusions from RCTs could be sensitive to the choice of estimator. Thus, a key recommendation is that analysts test the sensitivity of their impact findings using different estimation methods and cluster-level weighting schemes.
  • "Estimation and Identification of the Complier Average Causal Effect Parameter in Education RCTs." Technical Methods Report. Peter Z. Schochet and Hanley Chiang, April 2009. In RCTs in the education field, the complier average causal effect (CACE) parameter is often of policy interest because it pertains to intervention effects for students who receive a meaningful dose of treatment services. This report uses a causal inference and instrumental variables framework to examine the identification and estimation of the CACE parameter for two-level clustered RCTs. The report also provides simple asymptotic variance formulas for CACE impact estimators measured in nominal and standard deviation units. In the empirical work, data from 10 large RCTs are used to compare significance findings using correct CACE variance estimators and commonly used approximations that ignore the estimation error in service receipt rates and outcome standard deviations. Our key finding is that the variance corrections have very little effect on the standard errors of standardized CACE impact estimators. Across the examined outcomes, the correction terms typically raise the standard errors by less than one percent, and change p-values at the fourth or higher decimal place.
  • "The Late Pretest Problem in Randomized Control Trials of Education Interventions." Peter Z. Schochet, October 2008. This report addresses pretest-posttest experimental designs that are often used in RCTs in the education field to improve the precision of the estimated treatment effects. For logistic reasons, however, pretest data are often collected after random assignment, so that including them in the analysis could bias the posttest impact estimates. Thus, the issue of whether to collect and use late pretest data in RCTs involves a variance-bias tradeoff. This paper addresses this issue both theoretically and empirically for several commonly used impact estimators using a loss function approach that is grounded in the causal inference literature. The key finding is that for RCTs of interventions that aim to improve student test scores, estimators that include late pretests will typically be preferred to estimators that exclude them or that instead include uncontaminated baseline test score data from other sources. This result holds as long as the growth in test score impacts do not grow very quickly early in the school year.
  • "Statistical Power for Regression Discontinuity Designs in Education Evaluations." Technical Methods Report. Peter Z. Schochet, August 2008. This report examines theoretical and empirical issues related to the statistical power of impact estimates under clustered regression discontinuity (RD) designs. The theory is grounded in the causal inference and hierarchical linear modeling literature, and the empirical work focuses on commonly used designs in education research to test intervention effects on student test scores. The main conclusion is that three to four times larger samples are typically required under RD than experimental clustered designs to produce impacts with the same level of statistical precision. Thus, the viability of using RD designs for new impact evaluations of educational interventions may be limited, and will depend on the point of treatment assignment, the availability of pretests, and key research questions.
  • "Guidelines for Multiple Testing in Impact Evaluations." Technical Methods Report. Peter Z. Schochet, May 2008. This report presents guidelines for education researchers that address the multiple comparisons problem in impact evaluations in the education area. The problem occurs due to the large number of hypothesis tests that are typically conducted across outcomes and subgroups in evaluation studies, which can lead to spurious significant impact findings.