Setting up a Data Science Practice: Analytics Processes

In this third post on setting up a data science practice, I address some of the analytics processes that need to be in place to maximise value from analytics.

After more than two decades of practice and development, there are now well- established data analytics frameworks like the Cross Industry Standard Process for Data Mining. These standard processes, enriched with ideas from the latest developments in big data best practices and methodologies, form the subject matter of this post.

Journey Towards a Predictive Enterprise

The transition to a predictive enterprise involves a two-stage evolution. In the first stage, the organisational goal is to achieve the status of “sensing and responding in a timely manner”. Here, we would expect the company to become aware of its environment and data holdings. An example would be understanding customers at the detailed individual level using a Customer Lifetime Value (CLV) analysis. The CLV metrics can then be continuously updated and used to produce ongoing reports that enables the organisation to respond rapidly to any changes in the market place.

In the second stage, the organisation moves on from “sensing and responding in a timely manner” to “predicting and acting”. For example, using the CLV metrics computed, the organisation can start to construct models for customer acquisition, product recommendation, and churn prediction. These predictive models, when plugged into a digital layer with sensors and actuators connected to the analytics platform, then allows the organisation to take actions in real-time to achieve better business outcomes. Another example is a smart drilling platform in oil and gas production. Data from several sources can be fed into analytical models that continuously monitor the drill position, drilling fluid pressure, and other important metrics and correct for any deviation from acceptable parameters automatically. Additionally, they can predict the status of the drill hole in the immediate future and raise an alarm and take appropriate actions to prevent blowouts and other accidents when necessary.

The following figure shows the dynamic processes inside a predictive enterprise.

predictive-enterprise

Solid processes for democratising data, identifying analytics opportunities, building predictive models, and deploying those models in the business environment need to be in place for an organisation to make the transition to an analytics-driven enterprise.

Democratising Data

The phrase “Democratising Data” is made famous by the launch of the website http://data.gov by the US government, which makes available, on a single website, economic, healthcare, environmental, and other government data that were hitherto fragmented across multiple sites and formats that made them hard to use and often hard to access in the first place. The initiative was created as part of the US government’s commitment to open government and democratising information and the initiative has seen widespread adoption across many governments throughout the world, both local and national. (See for example http: //data.gov.uk, http://data.gov.au, and http://data.gov.sg.)

I believe the philosophy behind data democratisation applies equally well inside the walls of enterprises. The central tenet behind a modern data lake, that of pulling all sources of business data into a central repository which is then made widely accessible for all forms of analysis, is in fact a realisation of the general data democratisation idea.

The particular approach we take on the design and operation of such a central repository is now described. It is called the Magnetic, Agile, Deep (MAD) approach to data warehousing and analytics. The central MAD philosophy is to get the organisation’s data into the data lake as soon as possible; in other words, the platform must be magnetic instead of repellent like a traditional Enterprise Data Warehouse, especially with respect to new sources of data and business users. Secondary to that goal, the cleaning and integration of the data should be staged intelligently. The agile component refers to the data lake’s support for data scientists to perform their analysis within the platform, on all available data. This minimises unnecessary and time-consuming data movement, reduces analytics cycle time, and makes it easier for successful analytics models to be deployed in the enterprise. Finally, the deep component refers to the data lake’s support for deep predictive analytical techniques on big data, via in-database analytical techniques like those being developed in the open source libraries MADlib, Mahout, and MLlib.

To turn these themes into practice, a three-layer approach to warehouse design is required. A Staging schema should be used when loading raw fact tables or logs. Only data engineers and data scientists are permitted to manipulate data in this schema. The Production schema holds the aggregates that serve most users. Sophisticated users comfortable in a large-scale SQL/MapReduce environment are given access to this schema. A separate Reporting schema is maintained to hold specialised, static aggregates that support reporting tools and casual users. This last schema should be tuned to provide rapid access to modest amounts of data.

The three layers are not physically separated. Users with the right set of permissions are able to cross-join between layers and schemas. Data scientists should also be given a fourth class of schema called sandboxes. The sandbox schema is under the data scientist’s full control, and is to be used for managing their experimental processes.

Identifying and Scoping Analytics Projects

A key step in transitioning an organisation to a predictive enterprise is the creation of an Analytics Roadmap. At the inception of the data science practice, the business champion and senior data scientist responsible for each of the major value chains/business units will lead, in collaboration with key stakeholders, the creation of an analytics roadmap to identify and prioritise the major analytics projects within the respective business areas. These roadmaps form the basis of all work activities in the data science practice in the foreseeable future and they also inform the architectural requirements for the technology platform itself.

The following is a summary of the process for creating an analytics roadmap:

Identify key stakeholders and their business goals.
Identify data flows and generation within the business and confirm data availability.
Identify and map out possible analytical projects in alignment with the stakeholders’ business goals, including an assessment of possible value generation for each project.
Identify requirements, dependencies, and estimated amount of effort required for each analytics project.
Conduct prioritisation exercise on the identified projects.

Proper communications channels need to be established between the different parties to facilitate the roadmapping process. Senior executives need to be involved at an early stage to set strategic directions. There must be at least one business sponsor from the line of business committed to spending time sharing information and facilitating discussions between the analytics team and the business end users. When there are significant solution deployment issues, a person from IT also needs to be at the discussion table from early on. Transparently involving all stakeholders through socialisation, joint-prioritisation meetings, and consensual sign-off on documents is critical to the success of any analytics road mapping effort.

A well-constructed roadmap will typically be segmented into several phases. These phases will be designed so that there are concrete deliverables at the end of every phase and later phases can build upon the infrastructure created in the early ones. A good roadmap should also be flexible enough to allow lessons learned from the earlier phases to be used to modify the later phases, if necessary. Naturally, analytics roadmaps need to be reviewed and updated on an ongoing basis at the conclusion of each key phase of execution (typically quarter by quarter).

Building Analytics

Consistent with the guiding principles in this post, the activities within the data science practice will be dynamically organised around the conduct and management of a collection of iterative short-duration projects. These projects range from three to twelve weeks. We now discuss the general philosophies and processes governing the conduct of these analytics projects.

Iterative Solution Development via Rapid Prototyping A typical execution plan for an analytics project has four major steps: Data, Modelling, Statistical Validation, and Business Validation. The Data step involves identifying all relevant data and figuring out how different data sources can be fused and transformed into a convenient form for analysis. The Modelling step involves a careful formulation of the business problem as a rigorous mathematical problem and investigations into how existing algorithms and tools can be used to solve the problem. Once a suitable model is constructed, we then go into the Statistical Validation step to study the robustness and the predictive/descriptive power of the model. If a model passes the statistical tests, we then go into the Business Validation step to verify that the model makes sense in the business context and study how it can be deployed. The four major steps do not fit in a strictly linear process, of course. At every step, the analyst will inevitably discover new insights that would require reversion to previous steps to revise work done in those steps. To emphasise a point made repeatedly throughout the document, the process is iterative, as shown in this figure.

analytics-cycle

To facilitate mistakes to be made quickly and cheaply in this iterative process towards a final solution, we recommend the strategy of rapid prototyping at every step. Building a prototype is the easiest way to crystalize abstract thoughts and that, in turn, can help in uncovering problems and unrealistic assumptions quickly. To facilitate the volume of software prototypes and documentation that will be generated in the iterative process, the data science practice will need a good collaborative platform that properly supports information management and sharing activities.

In certain applications where experiments are cheap to do (e.g. finding optimal ways of displaying products on a web page), real-time randomised testing can be conducted to collect actual performance data on different strategies/models. This is something Yahoo and Ebay, for example, do on a daily basis.

A business sponsor from the line of business needs to be available through out the conduct of an analytics project to provide ongoing information support to the data scientists. The business sponsor also plays an important role in trialling and obtaining feedback on the many prototypes that will be developed in the course of an analytics project. Such joint participation in the development of an analytics solution will ease change management when it comes time to implement the new solution.

Peer Review The nature of statistical modelling is such that there are usually multiple ways of formulating a business problem mathematically and even more ways of solving them. To make sure optimal approaches (given what is known) are always taken in tackling a problem, a peer review process needs to be put in place and made a core part of every analytics project. A simple process is to have each project team do a public seminar at each of the initial, middle, and final stages of a project. Group participation in these information-sharing and feedback-solicitation sessions need to be encouraged within the data science practice.

Discover New Insights While most analytical projects to be conducted in the data science practice will be initiated by business and focussed on a particular business problem, there needs to be a small number of completely exploratory projects whose aim is to discover the applicability of newly available analytical techniques/tools or just to improve understanding of what lies hidden inside an organisation’s data holding. Such projects are valuable because they encourage self-learning in a broader range of topics than that provided in a person’s day-to-day work and that, in turn, can lead to fresh perspectives and solutions on old or existing problems. We recommend that approximately 10-20% of a data scientist’s time be devoted to such projects.

Prediction Markets The phenomenon of the effectiveness of prediction markets to accurately estimate the probability of highly complex and dynamic events (like election results and significant events like the collapse of the Euro by a certain date) that do not succumb to standard statistical analysis is now reasonably well understood. Crowd sourcing solutions to difficult problems within an organisation can be encouraged with the setup of a prediction market like http://intrade.com. The overwhelming success of the Netflix and GoldCorp challenges bear testimony to the potential of these solution-finding strategies.

Deploying Analytics

There is a third category of projects that takes on a more IT focus in that they involve the deployment of analytical solutions developed by the data scientists. The conduct of such projects is now described.

Moving From Prototype To Production The first step in deploying analytics solutions involves the conversion of prototype R&D software produced by the data scientists and data engineers into industrial-strength software that can be integrated with existing business systems. The software engineering best practices that are needed to support such activities are well understood and will not be discussed further here. The deployment model needs some consideration, however. The recommended approach is to have the new analytics solution and the existing solution running side-by-side for a period of time, collect the data to compare the performance of the two systems, followed by a cutover to the new analytics solution once definite measurable improvement or some other agreed KPIs can be established. Changes to business practices arising from the deployment of new analytics solutions need to be clearly documented in the business process blueprint.

Online Self-Learning Systems The models deployed in most analytics solutions are non-self-learning in the sense that they are constructed using historical data and rigorously validated both statistically and with business users. They are well understood and they do not change while in deployment. However, in a small number of analytics applications, the models need to be modified in real-time with every piece of new data to achieve maximum adaptability. Deployment of such self-learning models require a more thorough implementation plan that puts in place an ongoing monitoring framework that collects feedback on system performance and defaults to a fail-safe mechanism when the self-learning models fail to perform for whatever reason.

Governance

Data Security and Privacy Bringing all sources of data generated inside and outside of an organisation onto an analytics platform like a data lake poses considerable data security and privacy issues that need to be addressed. Policies, procedures, and systems need to be put in place right from the beginning to ensure appropriate levels of security and access for all users. Personally-identifiable information and enterprise-sensitive data, in particular, need special protection.

Intellectual Property Management In the course of its business, the data science practice is expected to produce a lot of intellectual properties in the form of innovative solutions to business problems. A balance needs to be struck between keeping these IPs private to maintain competitive advantage and contributing them back to the community (through publications or open-source software) to promote a viable open research community whose work will continue to benefit the organisation. A patent application process that rewards innovation through profit-sharing with patent inventors also needs to be established.

Risk Assessment and Mitigation

The following lists the key risks facing the implementation of a data science practice and mitigation plans.

Hiring and retention – Focus on compensation and professional growth
Business risk – Executive backing and public communications
Data availability and accessibility – Escalation management and technology hurdles
Data quality and usefulness – Iterative adaptive methodology for risk reduction
Adoption resistance – Evangelise success stories and democratise data

Many of the key risks are associated with change management and one effective way to address these risks is to facilitate proper communications between all stakeholders. Towards that end, I recommend the setting up of processes to allow staff from the data science practice to be seconded to line of business for a period of time to gain a proper understanding of business, and for staff from line of business to spend time at the practice to obtain a better understanding of the benefits of analytics. This is in addition to the usual evangelising and executive support that are needed to push through change adoption in a large enterprise.

A seldom mentioned risk is over-reliance on analytics. In our day-to-day decision making, there is usually a balance to be struck between intuition or gut feelings gained from years of experience and what we can infer from (existing) data. In many enterprises today, the main risk in decision-making is in not paying enough attention to data. In a predictive enterprise, the main risk is not failing to consider data in decision making, but being slavish to it. Every decision should be informed by data, but in a way that recognises the strengths and limitations of data analytics.

Mental Models 4 Life