February 21, 2013 | Success Stories
Paper Data Capture for OpenGov: Captricity digitizes IRS 990 forms
By Kuang Chen
Ingredients for OpenGov
When government agencies — DARPA and the NSF — pioneered the Internet, they looked to civilians like Vint Cerf for the enabling technologies that knit together the various private networks including ARPANET and NSFNET. The agencies provided the will, and innovative technologists provided the way. Together, they seeded the most transformative platform of their generation. The growing Open Government (OpenGov) movement, flying the banner of "Government as a platform," is no less ambitious. Its champions, like Jen Pahlka, Tim O'Reilly and Gavin Newsom, push for opening government data so that citizens can enjoy better services. Their challenge is similar to that of the Internet's early days: governments need the will to provide the raw data, and enabling technologies must be available to knit the data together.
Captricity extracts 1M+ values from latest IRS 990Ts
We are proud to be an enabling technology for OpenGov. To demonstrate our capabilities, we sifted through 12 years of IRS 990 forms to auto-magically pull out all the latest form T filings (what's released of 2011 tax year filings, so far). We then released them as structured, machine-readable public datasets on our Open Data Portal. Contact us if you're interested in learning more!
How did Captricity extract the data?
Captricity uses massively parallel machine- and human-intelligence-powered algorithms in the cloud to extract data (even handwriting) from paper forms. We call it human-guided machine learning. What were the steps?
- Document Identification: First we used “clustering” algorithms to group all the documents of the same type together. Of the set we examined, there were 105 different “clusters,” or types of forms.
- Template creation: Because there are so many different types of forms, we did not want to spend even the 5-10 min per type to mark up each. So, we used computer vision algorithms to analyze each cluster to automatically figure out where on each page the data that we wanted was.
- Page Extraction: Next, we used “classification” algorithms to find the key pair of pages, out of sometimes over 100 pages per filing. This was necessary because the key pages were often out of order.
- Data Extraction: Once all the pre-processing was completed, we used Captricity's standard service to “shred” and digitize the data.
How did it all turn out? We cranked through 1M+ values in a couple days, and at one-fifth (1/5) of the cost of manual outsourcing (at double-entry quality). Contact us to find out more!