How complete are the CDC's COVID-19 Case Surveillance datasets for race/ethnicity at the state and county levels?
Abstract
The Covid Tracking Project was the most reliable source for COVID-19 data with race/ethnicity at the state level until it stopped collecting data on March 7, 2021. The CDC's Case Surveillance Restricted Access and Public Use with Geography datasets are the only available replacements for the Covid Tracking Project's dataset, and they additionally include county-level data and age along with race/ethnicity. This paper evaluates the completeness of the CDC datasets at the state and county levels in terms of (1) the total number of cases included compared to the New York Times, and (2) the number of cases included with race/ethnicity data compared to the Covid Tracking Project.
The CDC's Restricted Access dataset contains 78% of the cases in the New York Times up to April 15, 2021, and 65% of cases have race/ethnicity information vs. 67% in the Covid Tracking Project. The dataset's completeness has steadily and gradually improved over time; e.g., the first available version from May 2020 had race/ethnicity information for only 43% of cases. At the state and county levels, the dataset's completeness has also improved with a state-level average of 62% of cases with race/ethnicity in April 2021 vs. 46% in June 2020. However, the dataset's completeness at the state level is highly variable; for example, Minnesota has 102% of the cases included in the New York Times, while Louisiana has only 4% of the cases in the New York Times. Minnesota has 91% of cases with race/ethnicity, while Louisiana has only 19% with race/ethnicity (vs. 94% in the Covid Tracking Project). Texas alone is missing 2.8M cases, accounting for more than a third of the total 7.1M missing cases. New York is missing race/ethnicity for 1.3M cases and California for 1.1M cases, accounting for more than a quarter of the 8.6M cases missing race/ethnicity when combined.
The CDC's Public Use with Geography dataset is similar to the Restricted Access dataset for total case counts, but is less complete due to more privacy suppression; e.g., only 49% of cases have race/ethnicity information.
The CDC's Restricted Access dataset contains 78% of the cases in the New York Times up to April 15, 2021, and 65% of cases have race/ethnicity information vs. 67% in the Covid Tracking Project. The dataset's completeness has steadily and gradually improved over time; e.g., the first available version from May 2020 had race/ethnicity information for only 43% of cases. At the state and county levels, the dataset's completeness has also improved with a state-level average of 62% of cases with race/ethnicity in April 2021 vs. 46% in June 2020. However, the dataset's completeness at the state level is highly variable; for example, Minnesota has 102% of the cases included in the New York Times, while Louisiana has only 4% of the cases in the New York Times. Minnesota has 91% of cases with race/ethnicity, while Louisiana has only 19% with race/ethnicity (vs. 94% in the Covid Tracking Project). Texas alone is missing 2.8M cases, accounting for more than a third of the total 7.1M missing cases. New York is missing race/ethnicity for 1.3M cases and California for 1.1M cases, accounting for more than a quarter of the 8.6M cases missing race/ethnicity when combined.
The CDC's Public Use with Geography dataset is similar to the Restricted Access dataset for total case counts, but is less complete due to more privacy suppression; e.g., only 49% of cases have race/ethnicity information.