In trying to set up a Data Products team recently, I quickly realised I don’t have a good working definition of what Data Products are or should be, at least not in the context I was operating in. Sure, a quick googling will surface many generic definitions of Data Products from respected sources like Forbes and McKinsey. They are all usually variations of DJ Patil’s definition: “A data product is a product that facilitates an end goal through the use of data.” But that generic definition is just not precise enough to help my team structure and prioritise all the “stuff” currently on our plate and in our backlog. The much longer definition covered in Patil’s book http://radar.oreilly.com/2012/07/data-jujitsu.html is much better, but it is just not concise enough to be internalised as a mental model easily.
Clearly, we need to do something about this state of affairs, and here is my current working definition of Data Products, from a capitalist viewpoint if you like: Data products are durable, reusable information assets constructed from raw data and/or other information assets, that deliver measurable value to an addressable market of consumers, where the marginal cost of production is low.
Here, an asset is as defined in financial accounting: a resource that can produce positive economic value, measured in terms of the total sum of future cash flow (or savings) that can be achieved, discounted to today’s value using a (risk-free) rate or return. In Patil’s definition, a data product can be pretty much anything, but I have narrowed that to an Information Asset in our case, which can take one of the following non-exhaustive forms: (1) a conceptual or logical data model; (2) the physical realisation of a logical data model in the form of an enterprise curated and integrated dataset that can be queried; (3) a report or dashboard that presents the key performance indicators of an organisation; (4) a report or data visualisation that presents new insights or quantification of known business knowledge, including business and data anomalies, through linking and non-trivial inferences on previously unlinked datasets; (5) a statistical or causal model that takes as input a description of the state of the world and produces a correlated state or a set of possible next states; (6) a digital twin in the form of a mechanical / chemical / causal AI model that takes as input the current state and a desired future state and produces possible courses of actions that can achieve the desired future state.
Consumers can be internal or external to an organisation, and they are the ultimate assessor of a product’s value. Except in rare scenarios, the consumer should have a choice on whether to consume the product; this is so we can benefit from the creative destruction of capitalism, where lousy products can be allowed to die and money and resources can flow to build and sustain good products. We definitely don’t want self-licking ice creams here.
The final two qualities of a Data Product are durability and reusability. Durability is what mandates the use of a (multi-disciplinary) product team to design, build, and sustain a Data Product using end-to-end product life cycle management practices. (Information assets with only a short shelf life can be built and decommissioned using a time-boxed project team, rather than a permanent product team.) Reusability is related to the economic value of packaging an information asset as a Data Product (as opposed to rebuilding the information asset from scratch each time there is a demand). This means the marginal cost of production to service additional consumers should be much lower than the total cost that would be incurred by a new consumer or producer wishing to build the data product from scratch. This is a measure of the scalability of a Data Product, and obviously we prefer to build highly scalable data products when given a choice.
There you have it, my heavily biased and opinionated definition of a Data Product.