2007-2020 MarketScan Data at Boston University

Boston University is now in its third year of licensing use to the MarketScan Commercial Claims and Encounters databases. This data is available for free to Boston University faculty, staff, and students for unfunded research, but researchers are required to request funding for any externally funded research projects. Interested researchers should contact Randy Ellis, who is data manager for the data.

The Truvan Analytics MarketScan Commercial Claims Databases provide individual-level clinical utilization, expenditures, and enrollment across inpatient, outpatient, prescription drug, and carve-out services from a selection of large employers and health plans. The MarketScan Databases link paid claims and encounter data to detailed patient information across sites and types of providers, and over time. The annual medical databases include private sector health data from approximately 100 payers. Historically, more than 500 million claim records are available in the MarketScan Databases. These data represent the medical experience of insured employees and their dependents for active employees, early retirees, COBRA continues and Medicare-eligible retirees with employer-provided Medicare Supplemental plans.

While the information about the individuals is rather limited (age, gender, employment status, industry, MSA, enrollment information, plan type), the information about their utilization of medical care is incredibly detailed. Some of the most useful variables are: Out-of-pocket payment (sub-divided into deductible, coinsurance, and copayments) and total payment by service rather than by admission, detailed diagnosis and procedure codes, service codes, precise dates of visits and admissions, provider-type, and facility information. The data also included detailed information on prescription drug claims including information for identifying the specific (down to the dose) drug purchased, the amount purchased, and the date of refills.

This vast amount of information allows researchers to construct general variables such as the financial risk of an enrollee (in terms of an age-gender and diagnosis-based risk score), an enrollee’s annual out-of-pocket expenses, geographic variation in spending, geographic variation in the use of a particular procedure or drug down to the state and MSA-level (State and county and 3-digit zip code-level in 2007-2010 data). It also allows researchers to construct more detailed individual-level variables such as cancer diagnosis and subsequent chemotherapy use, ER admission and subsequent readmissions, individual preferences for brand vs. generic pharmaceuticals, etc.

There are separate tables for enrollee information (individual-level), outpatient claims (service-level), inpatient services (service-level), inpatient admissions (admission-level; aggregated version of inpatient services), prescription drug claims (prescription/refill-level), and facility information (facility-level). All of these tables can be linked using a unique enrollee ID. The unique enrollee IDs are constant across years, allowing researchers to follow individuals over time as long as they remain insured by the same payer.

The information in these tables comes directly from the payers (employers and insurance plans). Truven Analytics then cleans and verifies the data from each payer, de-identifies the data it, and combines it to form the final dataset. Because the data come from the payers, and the payers are paying Truven Analytics to provide them with accurate information and analysis about the claims, the incentives are aligned to provide accurate data.

The data includes an electronic copy of the Red Book list of all pharmaceuticals marketed in the US, along with information about each of the 350,000+ NDC (National Drug Code) values. Significant detail about the data is available in the accompanying data description and data quality appendices.

The versions we have use a six month claims “runout”, which is to say that claims for 2011 services are accepted through the June 30, 2012.

The following table includes additional year-specific information about the data files:

Year Number of Individuals Size of all files Geographic Detail
2007 35,305,924 203 GB MSA, 3-digit zip code & county
2008 41,275,020 251 GB MSA, 3-digit zip code & county
2009 39,970,145 263 GB MSA, 3-digit zip code & county
2010 45,239,752 281 GB MSA, 3-digit zip code & county
2011 52,194,324 321 GB MSA and state ONLY
Total 213,985,165 1.319 TB


Leave a Reply

Your email address will not be published. Required fields are marked *