As cofounder of Syndetic, I’ve talked to a lot of people about their data dictionaries. At this point, probably dozens of people, ranging from data governance managers at large enterprises to founders of early stage tech companies. And every single one of the lot hates their data dictionary. When I say hates, I mean that they say something like “Ugh. I won’t even show it to you. It’s an embarrassment.”
Why does everyone hate their data dictionary? A sort of meta-spreadsheet, a data dictionary on its face sounds like a relatively simple thing. It is a document describing the meaning of a dataset. Typically this includes field names and types (e.g. string, text, varchar) and maybe some annotations that describe the lineage of the data (where did it come from) and the business definition. But as with many workflows that are captured in spreadsheets, things can go awry very quickly.
- They are difficult to maintain.
The first person to create a data dictionary for their company usually has great intentions. They may be the first data scientist hired there, or the first data governance professional. They are diligent and organized, and dedicated to the mission of ferreting out every last bit of information about their information. They meticulously craft a spreadsheet (or google sheet) that contains the best information available to them at the time. They double and triple-check it for accuracy. But then, of course, things go off the rails.
Maintaining a data dictionary is not a full time job. And so, the person who created it cannot be expected to be thinking about this document at all times. They go back to doing their day job, and bit by bit, changes start to happen. Engineers change schemas without thinking to alert the person who created or keeps the data dictionary. Data salespeople call their prospects and walk them through the fields of the dataset, but realize that the annotations don’t quite make sense for the use case of their prospect. So they make a copy of the spreadsheet, change the annotation, and send it out. Product teams buy a system to capture data that used to be captured manually. And the dictionary very quickly gets out of date, often before anyone even realizes.
It is dangerous to have any company asset that is so dependent on one person in the organization, in this case the creator of the dictionary. If that person leaves the organization, all history of the document often goes with them, and a new person in that role may be tempted to just scrap it and start over. But then the problem repeats.
- They are necessary, but not sufficient, to fully explain the data.
Oftentimes, datasets are shared with business analysts or other non-technical people inside organizations who are tasked with assessing whether the data they are being provided is useful. For these people, receiving a data dictionary containing a bunch of field names and types is the standard. But it doesn’t really help them make the assessment they need to judge whether the data is of high quality; whether it is of better quality than can be purchased elsewhere by a different provider, or whether it has improved or decreased in quality over time. For these people, a data dictionary is often filed away and they go about other means to try to assess the data’s quality. Can you send me a sample of 100 rows? What are the coverage rates for each field? What are the most common values that I’m likely to see?
If it all looks good, and they start receiving the data, often they will move on to the next data provider and next assessment. Rarely do teams have the resources to monitor their incoming data files on an ongoing basis for anomalies, like a sudden increase in null values in a particular field. Even more rarely do teams conduct regular data assessments to ask for new sample sets or statistics on the data. They simply move on.
Data as a product is very different than an application or a service because its value is dependent on many other variables besides the data being good or not. For example, the usability of the data is extremely important. You can have the most complete dataset in the world on say, university rankings, but if the data is not usable, it is worthless. By useable, I really mean that it can be easily joined with other datasets. And that’s because people in the market to buy data on university rankings aren’t just curious whether Stanford is ranked #1 again this year. They want to answer questions that require the data to be joined with data on say, student populations, geography, or fundraising. Rarely can a dataset be so valuable in isolation. Data providers should understand this, and work as hard as possible to make their data easily combined with other datasets.
Another area where data as a product is special is that it is (usually) a collection of facts. However the data was collected, if a dataset contains information about property transactions, there is an objective truth to the amount of those transactions. A prospective buyer of that dataset is primarily concerned with whether the data is actually accurate or not. If it’s not, it’s not only worthless, but also potentially very damaging to that company’s business, as decisions will be made (such as pricing) in reliance on that data. It is in every data provider’s interest to invest as many resources as possible to the accuracy of its data with rigorous testing and monitoring.
- They’re ugly.
The standard in data dictionaries is the good old Excel spreadsheet, closely followed by a word document that has been saved as a PDF. It’s curious to me that for all of the time and money that companies spend on product marketing, they do not do a good job of marketing their actual product, which is the data itself. Software companies often pride themselves on design and on making their application as user-friendly and intuitive as possible. But when they receive an inquiry about their data, they send over a spreadsheet. Surely there is a better way.
- They cause confusion within the organization.
As with any workflow trapped in a spreadsheet, users of the spreadsheet often don’t know if they can trust it. Before sending out the dictionary to an important prospect, a salesperson may look it over and ask a few people in engineering or product if it’s still accurate. They are unlikely to know. If a current customer has a problem with the data, like if the file breaks, and they call up the support team at the data provider, the support team is going to check the actual file that was sent to the customer. They are not going to check the data dictionary. And so you have a reference document that is not really reliable, which sows confusion among many teams that need to work closely together to support their product. Confusion = time wasted that can be spent on other more valuable things.
Hate your data dictionary? Drop us a line at firstname.lastname@example.org.