In our last post, we explored how the meaning and importance of data dictionaries have changed over time. 20 years ago, a data dictionary referred to a list of field names and types spit out by a database. Now, a data dictionary serves as a sort of spec sheet for a dataset: it must be both accurate and attractive, putting the dataset in its best light so potential buyers understand what the data means and also why it’s valuable.
Let’s move one level up to the so-called vendor tear sheet. A tear sheet is a one page summary document describing a data vendor and the dataset offering. It too must serve several functions – to describe the vendor at a high level, such as legal company name and contact information, but also to describe the data product as a whole – what type of data it is, if it’s a time series dataset or not, how often the data is updated, how the data is delivered, what are the legal and contractual requirements when buying the dataset, does it contain any PII, etc. etc. At Syndetic we’ve read hundreds of vendor tear sheets, spoken with dozens of data buyers, and come to the conclusion that just as with data dictionaries, it’s time for tear sheets to be reformed.
Recently a group of researchers in the machine learning community published a paper called Datasheets for Datasets, proposing a standardized process for documenting datasets. They cite inspiration from the electronics industry, in which electronic parts “no matter how simple or complex” are accompanied by a datasheet outlining their characteristics, recommended use cases, test results, and so on. The authors propose an analogous standard in machine learning, where every dataset is accompanied by a datasheet outlining its motivation, composition, collection process, and recommended uses.
We’d like to expand this recommendation beyond the machine learning community to argue that any dataset exchanged between two companies should be accompanied by a standard datasheet, or tear sheet. Both sides of the market benefit. For data providers, completing the tear sheet encourages reflection on how their data is sourced, delivered, maintained, and used. Often a tear sheet also acts as an SLA between buyer and seller, with the data provider promising to notify their customer if, for example, the data is updated less frequently than it is stated in the tear sheet. The burden to monitor conformance to tear sheets today still falls predominantly upon the buyer. For data buyers, a standard tear sheet makes it easier to compare data vendors and datasets with each other. It encourages accountability and transparency within the industry. It also gives the buyer something to refer to if the data doesn’t seem to conform to the specs outlined in the tear sheet after purchase. In this way, it acts as another contract between data buyer and seller – not legally binding like the data licensing agreement signed by the parties, but equally important to the success of the ongoing relationship.
Managing tear sheets is becoming increasingly complex for both data providers and data buyers. This makes the adoption of a standard all the more urgent. Account managers at data providers must manage tear sheets across all sales channels: direct to customer and indirectly through data aggregators which have proliferated in the past 12 months. Just keeping track of which tear sheets have been sent to who can be a nightmare. Similarly, data buyers must manage the tear sheets they receive from vendors and view in the aggregators. Important points about the dataset are often lost in Sharepoint attachments or in the personal notes of an employee. Standards, and a centralized tear sheet management system, make it much more likely that these attributes don’t get lost.
Some industries are starting to adopt their own tear sheet standards, such as the one adopted by the Alternative Data Council within the financial services industry. We are a member of the council, and we built the recommended FISD standard into Syndetic so that our customers, data providers who often sell into financial services, can create and manage their tear sheets from one central system. You can create one for free here.
We expect that other industries will follow the lead of financial services and move to adopt their own best practices and standards for data tear sheets. We will be closely following the standards as they evolve. If you know of an industry working to adopt a standard data tear sheet, drop us a line at tearsheets@getsyndetic.com.