On Data Quality
August 31, 2012 by Andrea Spillmann
I've been getting a lot of questions recently on data quality, how Captricity handles it, and what steps you can take to get higher quality data. We can think about quality as a combination of complete and accurate data. There are four essential steps of data collection, and quality assurance mechanisms for each. I'll give a broad introduction to all four in this post, then do a series of individual posts on each:
- Data cleaning/coding
- Ambiguity resolution
Complete data sets have all their pages accounted for and filled in. Incomplete data can stem from a number of issues: people skip questions, coffee spills, forms get lost.
Quality assurance mechanisms:
One of the best ways to ensure complete data is to run an on-site check to make sure forms are completed and pages are accounted for, as quickly after they’re filled out as possible. That way, if pages or data are missing, they can be found and corrected on-the-spot.
How Captricity helps:
When you upload data sets to Captricity, we’ll let you know if a document isn’t the right number of pages, so you can immediately find the missing information. With our mobile application, users can snap and review pages on-site, instantly. Additionally, Captricity lets you review data almost as soon as it is completed. Upload forms as they’re filled in, and within a few hours you can see the results, easily noting if an employee or enumerator is skipping questions, or if whole sheets of data weren’t included.
Transcription, or data entry, is inherently error-prone. Typically, workers sit at a computer with handwritten text on paper in front of them, typing the text in. They can misread a word, or read it correctly and mistype it, or do both correctly but enter it into the wrong place.
Quality assurance mechanism:
Double entry is the hallmark of data accuracy. It involves having two people enter the same document. Results are compared, and incongruities are re-entered or verified by a third-party. While double entry improves transcription accuracy, it is only a portion of a complete quality assurance strategy. “Gold standard” values – entries whose value is already known – ensure workers are not cheating.
How does Captricity ensure quality?
Captricity outperforms double entry, assuming the same workers.
Captricity’s algorithm combines both of these established methods of ensuring data accuracy. A human worker’s answer is compared against our machine vision algorithm’s answer. If they differ, a third worker is prompted to select the answer that is more correct, and to fix it if necessary. If a third worker’s answer is still different from the first two, then a fourth worker is asked the same. We proceed up to 5 workers, to ensure that there is agreement in the final answer. Gold standard values are sprinkled throughout. If a worker does not get these right, all of his or her work is rejected. The image on the top right shows how those responses come back to us.
Once data is entered, the task of cleaning and coding begins.
Quality assurance mechanisms: In this stage, you correct misspellings, expand abbreviations, look up codes to see if they exist in your database, and analyze or cross-check answers on related questions.
How Captricity helps: As with manual entry, after Captricity finishes, there may be more to do. If you find a coding mistake, Captricity makes it easy to re-run a set of values or edit individual values. The Captricity API can automatically port data into any other website, software or system you use for data analysis or business processes, simplifying both data validation and use.
Once data is cleaned and already in use, ambiguities may arise.
Image 3: View original form snippet alongside entered value
Quality assurance mechanisms: Typically, you would resolve these by keeping and referring back to the original data. With stacks of paper forms, this can become complicated and is often the hardest part of the workflow.
How Captricity helps: Captricity simplifies this issue. We maintain both the entire form image and also individual form snippet images. Digitized values can be compared to the original handwriting from which they were entered, and full pages can be examined.