Paper Data Capture for OpenGov: Captricity digitizes IRS 990 forms
Ingredients for OpenGov
When government agencies — DARPA and the NSF — pioneered the Internet, they looked to civilians like Vint Cerf for the enabling technologies that knit together the various private networks including ARPANET and NSFNET. The agencies provided the will, and innovative technologists provided the way. Together, they seeded the most transformative platform of their generation. The growing Open Government (OpenGov) movement, flying the banner of “Government as a platform,” is no less ambitious. Its champions, like Jen Pahlka, Tim O’Reilly and Gavin Newsom, push for opening government data so that citizens can enjoy better services. Their challenge is similar to that of the Internet’s early days: governments need the will to provide the raw data, and enabling technologies must be available to knit the data together.
Captricity extracts 1M+ values from latest IRS 990Ts
We are proud to be an enabling technology for OpenGov. To demonstrate our capabilities, we sifted through 12 years of IRS 990 forms to auto-magically pull out all the latest form T filings (what’s released of 2011 tax year filings, so far). We then released them as structured, machine-readable public datasets on our Open Data Portal. Check them out! Set 1, Set 2
Why IRS 990 forms?
Beth Noveck explains the power of the data from the 990T forms best in the Aspen Institute’s influential report on nonprofit sector data:
“The data that the IRS collects about nonprofit organizations present a great opportunity to learn about the sector and make it more effective. Yet this data could be made far more useful than it is today. It’s time to ‘liberate’ 990 data and make it easier to gain insight into the workings of America’s nonprofits.”
You see, while these forms technically have been publicly available for years, they have been locked away on CDS costing tens-of-thousands of dollars. PublicResource.org led the charge to post 12 years’ worth of returns online as freely-available PDFs. That was step one towards putting the data in the public’s hands. Still, you can’t map, sum, or otherwise analyze data trapped in PDF format, making it impossible to expose trends or gain the insight needed to guide research or inform policy at a high level.
How did Captricity extract the data?
Captricity uses massively parallel machine- and human-intelligence-powered algorithms in the cloud to extract data (even handwriting) from paper forms. We call it human-guided machine learning. What were the steps?
- Document Identification: First we used “clustering” algorithms to group all the documents of the same type together. Of the set we examined, there were 105 different “clusters,” or types of forms.
- Template creation: Because there are so many different types of forms, we did not want to spend even the 5-10 min per type to mark up each. So, we used computer vision algorithms to analyze each cluster to automatically figure out where on each page the data that we wanted was.
- Page Extraction: Next, we used “classification” algorithms to find the key pair of pages, out of sometimes over 100 pages per filing. This was necessary because the key pages were often out of order.
- Data Extraction: Once all the pre-processing was completed, we used Captricity’s standard service to “shred” and digitize the data.