Documents and OCR
Posted: 06 Nov 2009 12:31 PM PST
Part of what we believe is our job here at thestory.ie is not just to dig out new information via FOI requests. Another important part of the work we are doing will be to make existing information more accessible. We have already started this work through importing TD donations and expenses into Google spreadsheets, centralising the data and opening it up to Google bots. It also allows anyone else to come in and retool or visualise the data we share.
But another important effort is this: publish existing documents in a more accessibly format. We have already found hundreds of Government documents online that are scanned without OCR, meaning the contents of the documents are not searchabe, nor (for the moment anyway) are they indexed by Google in a consistent way. Many of these documents are legacy, some from as far back as the mid 1990s.
We have begun a process of downloading these documents, OCRing them, and reuploading them. We will publish all documents to the new thestory.ie Scribd account, as well as to Google Documents. This will mean two things. First the documents will be indexed by Google, second the documents will become instantly more usable to the general public, thus in a small way, creating a more transparent government, and one slightly more accountable to the people. We are under no illusions that this effort will have any instant or major effect, but it will have a gradual one. And this furthers our aims for helping create a more transparent Ireland.