Q: What concerns people most about data on the Web, versus a data warehouse?
Rick Schaffer: I would have to say data integrity.
Q: And why is that?
RS: I don’t know, sometimes data is like a family pet to people — they get very attached to it. They want to hold it close and control everything about it, thinking they are protecting it from the outside world.
Q: But isn’t there something to that logic?
RS: Actually, no. It’s a myth that controlled data is accurate data. The truth, which we see played out again and again with our clients, is that the more data is exposed, the better it gets.
Q: Is that counterintuitive?
RS: Not at all. If two people proofread your book, would they catch as many errors as having two hundred people proof it?
Q: No, but a book isn’t critical data that can shift around, get changed or lost, and cause problems.
RS: Remember, we aren’t writing data here. We’re reading data that has already been written in another system so it can be analyzed, manipulated, and reported on. So if there’s a data integrity issue, chances are it rests with the system of origin, and/or the people doing the input. And our system will actually expose those kinds of issues.
Q: More than a data warehouse/business intelligence solution?
RS: Oh yes. One of the first things we tell clients is that our system is going to expose their current data integrity issues, and most clients will say, “We don’t have data integrity issues!”
Q: And what typically happens?
RS: They start using our system and find a bunch of data integrity issues. [laughs]
Q: But what makes your system so much more effective at exposing issues? Is it just the number of people looking at the data?
RS: That helps, but it’s also the power and flexibility of our toolset. Again, the more access and control you hand out to users, the more they are going to slice and dice things and naturally and quickly expose anomalies. The users know the data; they know what looks right and what doesn’t — they’ll find the mistakes.
Q: Any examples come to mind?
RS: I’ll give just one tiny example. We have a really cool feature that plots zones on a Google map. We took the property valuation dataset from the County Assessor, filtered it by one zip code, and plotted the results. All of the properties were naturally bunched together on the map, except a handful of red blips that immediately popped up across town, in the wrong zip code. Bam, data integrity issue.
Q: And those anomalies wouldn’t have been spotted in the old system?
RS: They came from the old system. They’ve been like that for who knows how long. By giving users full access to the data and tools to work with it and experience it in new ways, you’re going to expose those kinds of things daily. And the data will get better and better.
Q: Then why do some people have resistance to the concept of open data?
RS: Human nature, I guess. I don’t know, maybe some people still have fears relating to the early days of the PC, when there were a lot more issues with data getting messed up. Intuit had a glitch in a very early version of Quicken where if you hit the wrong button all of your financial records got deleted. Maybe things like that made people a little paranoid and possessive of their data and they haven’t gotten over it.
Q: You think the need to control has something to do with it?
RS: I do. And in some cases that’s legitimate, but remember, we’re talking about government data here. This is not intended to be proprietary information, and if there are legitimate security or privacy concerns, we have mechanisms for handling those situations. It doesn’t have to stop the flow of data.
Q: So what should the take-away be, regarding data integrity?
RS: Number one, it’s a myth that you can achieve perfect data quality — we have to get over it. And it’s certainly a myth that you can achieve data quality by controlling more and sharing less. The more you give users access, tools, and context, the better your feedback loop will be and the better your data will be.
