In the previous post, we discussed the key principles of setting up a data science practice. In this post, we’ll discuss the people dimension. One should read the below as suggestions, not prescriptions. There is more than one way to set up a data science practice.
Critical to the success of a data science practice are the following people elements:
- Strong in-house team of industry-leading data science practitioners.
- Strong confidence among business stakeholders in the data itself as well as the tools that are enabling data access and analysis.
- Investment in the creation of a data-driven genetic wiring involving data democratization, training, and alignment to define measures of success across the businesses.
The many aspects of these elements are now discussed.
Teams in the Data World
The following five teams are needed in a data science practice:
- Data Science Team, whose role it is to generate meaningful and actionable insights from data. Their performance metrics include the number of analytics projects conducted and the value (dollar benefits and others) delivered by those projects.
- Data Governance Team, whose role it is to govern access to data. Their performance metrics include time to access delivery and management of access breaches.
- Data Quality Team, whose role it is to find, identify, and resolve data quality issues. Their performance metrics include the number of data quality issues resolved, and time to resolution.
- Data Engineering Team, whose role it is to service data requests from the Data Science team and to provide proper understanding of data from a business use perspective. Their performance metrics include time to delivery of new data requests and the quality of those deliveries.
- Data Operations Team, whose role it is to manage data production, processing, and usage. The performance metrics include time to delivery of data and feeds and the quality of those deliveries.
The roles of the Governance, Quality, and Operations teams are familiar and will not be discussed further in this document. We will focus instead on the operations of the Data Science and Data Engineering teams.
Roles and Responsibilities
The following roles are required within a data science practice.
Program Manager The role of this person is to manage and coordinate the activities of the entire practice to meet the key performance indicators. The Program Manager will need to have deep knowledge of business, data, analytics, and general project management. The Program Manager’s main responsibilities are to set up the practice and roadmap, fill the key positions, and manage the day-to-day operations of the practice, including managing key stakeholder relationships and putting in place the right incentive schemes for the entire organisation to embrace data analytics as a way of life. The person needs to be a strong leader, and preferably leads by example. The Program Manager would need to be supported by a program administrator.
Lead Data Scientist The role of this person is to manage and coordinate the activities of the Data Science team. The Lead Data Scientist will need to have deep knowledge of business and a good understanding of the strengths and limitations of data analytics. The Lead Data Scientist’s main responsibilities are to establish the right channels to connect the different stakeholders of an analytics project, set the right expectations for all stakeholders, including that of senior management, and set up the right training infrastructure to ensure proper and efficient knowledge transfer to staff. The person would typically be a leading data scientist with good research and development credentials, practical problem-solving experience in multiple industry verticals, and teaching experience at university level or industry certification courses.
Lead Data Architect The role of this person is to manage and coordinate the activities of the Data Engineering, Data Operations, and Data Quality teams. The Lead Data Architect will have deep knowledge of the architectures and operations of large-scale data warehousing and analytics platforms, as well as the requirements of the different users of such platforms. The Lead Data Architect’s main responsibility is to set up and maintain the analytics platform, including the management of all the data ingest and data request servicing processes across the organisation. The person would typically have a lot of prior experience in building and managing such infrastructures.
Senior Data Scientist The role of this person is to initiate and lead analytics projects. Senior Data Scientists will need to have broad knowledge of statistics, computer science (including parallel computations), business, and change management. The Senior Data Scientist’s main responsibility is to scope out and manage the execution of analytics projects, including stakeholder management and change management. The person would typically have a lot of prior experience in conducting analytics projects and is expected to be hands-on when it comes to doing the analysis.
Data Scientist The role of this person is to work with stakeholders and data to conduct appropriate analysis to answer business questions and improve business processes. The Data Scientist will need to have good knowledge of statistics, computer science, and business, usually with specialisation in one or two areas of analytical techniques and/or business domains. The Data Scientist’s main responsibilities are to understand business problems and to apply innovative and rigorous analytical techniques to solve them. The person would typically hold a postgraduate qualification in statistics and/or computer science and have a strong passion to derive business values from data.
Data Engineer The role of this person is to provide programming and data support to the data scientists. The Data Engineer will need to have good knowledge of system administration skills and database skills like SQL. The Data Engineer’s main responsibility is to help the data scientists in data access and manipulation tasks as well as the implementation of prototype systems. The person would typically hold a degree in computer science or IT.
Data Architect The role of this person is to build up and maintain aspects of the analytics platform. The Data Architect will need to have good knowledge of system administration and data ingest processes. The Data Architect’s main responsibility is the installation and maintenance of data ingest, business intelligence reporting, and advanced analytics components. The person would typically be someone with prior experience in installing and maintaining large-scale data analytics systems.
Business Champion The role of this person is to be a bridge between the data scientists and the lines of business. The Business Champion will have excellent knowledge of the different lines of business and the processes inside them. The Business Champion’s main responsibility is to engage with business to identify opportunities for using analytics to improve business processes and to initiate projects to capitalise on those opportunities. The person would typically be someone coming from a business background but with strong interests in analytics.
Business Analyst The role of this person is to support the Business Champion’s activities. The Business Analyst will have good knowledge of specific lines of business and the complex processes within them. The Business Analyst’s main responsibility is to engage with business to understand and translate their analytical needs and requirements. The person would typically be someone coming from a business background and has an eye for details.
Analytics Evangelist The role of this person is to communicate the benefits and promote the uptake of data analytics across an organisation. The Analytics Evangelist will have a good knowledge of marketing, corporate education, and change management. The Analytics Evangelist’s main responsibility is to promote awareness of the benefits of data analytics through effective circulation of success stories; s/he would also be providing training and establish channels for business to engage with the data science team. The person would typically be someone coming from a marketing/business background but with strong interests in analytics.
Software Engineer The role of this person is to provide software engineering support to the data scientist/engineers. The Software Engineer will have good knowledge of modern software design and implementation methodologies as well as system integration skills. The Software Engineer’s main responsibility is to take prototype systems and produce industry-scale software that can be deployed in the enterprise. The person would typically have a software engineering degree and at least several years of experience in software development.
To be agile, the Data Science Team needs to have a relatively flat hierarchy that can be easily organised around the many iterative short-duration projects that are expected to be in place at any one time. The Lead Data Scientist will maintain overall management responsibility. Below that, the group members will be dynamically organised around analytics projects on the major value chains identified in an organisation.
There will be one senior data scientist and one business champion with shared responsibility for each major value chain or business unit. Below that, we have a shared pool of data scientists, data engineers, business analysts, and software engineers that can be brought into any one project across the major value chains. Each project will need, in general, two data scientists, one data engineer, and one business analyst. Software engineers are typically only required in solution deployment projects. Each of the data scientists, business analysts, data engineers, and software engineers in the shared pool will have skill specialisations. Different analytics projects will require different combinations of those skill sets.
Performance management within the Data Science team will be via peer reviews, consistent with how a scientist’s work is usually evaluated.
Specialty Skills Required
This section lists some of the specialty skills that would be required within a data science practice. It is impossible to expect any one person/role within the practice to know everything listed below, but the practice as a whole must possess all the skills listed. To- wards that end, a balanced representation of these skill sets need to be achieved in the process of staffing the practice.
Data scientists work at the intersection of data, computational science, statistics, and business. They thus need to be equipped with a broad education in all the four areas. On the technical side, specialist training in one or more of the following topic areas are particularly useful: machine learning, artificial intelligence, Bayesian probability theory, decision theory, computational logic, parallel databases, optimisation techniques, operations research, automation, supply chain management, information retrieval, natural language processing, open source software development, behavioural economics, finance, accounting, and change management.
In addition to these technical skills, domain knowledge is also crucial. The exact type of domain knowledge required is context dependent, but analytics experience in one or more areas, which should translate to ability to learn different domains, would be really useful.
On the data engineering and architecture side, specialty skills required include deep knowledge of parallel database systems and large-scale data processing architectures like Hadoop, Spark and related systems.
Hiring and Retention
Given that a data science practice is all about its human infrastructure, it is crucial that the practice gets its hiring and retention policies right as a first order of business.
The hiring process must be rigorous for the flat organisational structure within the Data Science team to work well. These are the specific job requirements for a data scientist:
- A postgraduate degree from a reputable university, or equivalent industry experience, in one or more of the following topics: machine learning, artificial intelligence, robotics, computational linguistics, mathematics, economics, physics, and related areas.
- Prior experience at a leading organisation known for analytics innovations.
- Good programming skills.
- Excellent communication skills, both written and spoken.
- A team player that is also self-motivated.
Note that the above requirements are conjunctive; lacking any one of the skills or attributes and the person would not be an effective data scientist.
The data science practice cannot compromise on the quality of the people that are hired. I recommend adopting the interview process employed at Google, where a candidate has to go through, usually within the same day, different kinds of interviews with the six to seven people the candidate is expected to work with in the role and is only admitted when all the interviewers reached a consensus to employ. Places that can be targeted to find suitable talents include top universities, innovative companies and consulting firms like Deloitte and McKinsey.
An effective way to organically grow an ongoing supply of new talents is to work with leading university researchers on linkage projects whereby postgraduate students are provided with incentives to work on joint projects and to spend time interning with an organisation. This strategy has proven very successful in many innovative companies.
A big challenge in maintaining a knowledge workforce is that the practice’s best assets walk out the door every evening. Maintaining a high-level of job satisfaction among staff is thus as important as bringing the best people onboard in the first instance. A focus on proper renumeration, as well as being attentive to the professional needs of data scientists (like attending international research conferences and devoting some time to fundamental research and university teaching), goes a long way towards keeping them happy. Cycling staff through the different business units would also give them a sense of always having new challenges to solve. A well-run data science practice with a good reputation would also instil a sense of pride and belonging for its staff.
Learning and Growth
Data scientists need to be continually trained on the latest technologies and best practices in data analytics. Besides the standard curriculum covered in a multi- year, multi-disciplinary course on data and business analytics, in today’s world, knowledge of massively parallel relational databases, large-scale data processing architectures like Hadoop and Spark, in-database approaches to machine learning, and the latest developments in artificial intelligence and machine learning, two fields which are changing rapidly, are essential. Such knowledge can be acquired though courses offered by leading universities and companies that specialise in industry training.
Good candidates for the data scientist role are typically self-driven individuals who would naturally do a lot on their own to learn new skills. A flexible R&D environment needs to be set up to encourage such self-learning activities, with an emphasis on hands-on experimentation and experience.
The data science practice needs to provide its staff with online access to academic journals. Subscriptions to such online access (e.g. ScienceDirect) can be expensive and are best arranged through affiliations with the libraries of local universities. To serve the long term interest of the practice and the community in general, the practice should have in place policies and actions that support Open Access Publishing. A physical library inside the practice is useful but not essential – books in the emerging data science field become outdated rather quickly anyway.
An important part of a data scientist’s professional development is regular attendance at top international research conferences. This allows them to keep in touch with the latest developments in their professional interest areas. Most organisations require staff to have papers published in a conference before supporting their travel. This can have the undesirable effect of having company data scientists spending too much time on publications instead of solving practical business problems, resulting in lost productivity for the organisation. An effective way to handle this issue is to unconditionally support data scientists to attend a number of research conferences every year.
In the next post, I will talk about some of the processes that need to be in place for a data science practice to run smoothly.