I have given quite a few data science training courses over the years and those conducted for industry participants are, without a doubt, the most challenging ones. There are a few reasons:
- The training course tends to be quite short — between 3-5 days typically — so the trade-offs between depth and breadth, between theory and practice need to be handled carefully.
- The participants usually come from diverse backgrounds; it’s not unusual to have (budding) data scientists, business analysts, data engineers, data/solution architects, and senior executives all in the same room, each expecting to learn something new.
- To make it worse, participants are usually impatient, often coming with the expectation that they will be able to pick up a few practical (magic) tricks that can be applied in their work straightaway, and either math-averse, programming-averse, or both at the same time.
So what should one try to cover or learn in a 3-5 days industry data science course? What is that core curriculum?
In trying to answer that question, we must first recognise that data science is a multi-disciplinary field. I have tried to distill what I consider key skills an all-rounded data scientist needs to have in the following DS Skills Tree, which covers four main areas and a breakdown and progression of topics that needs to be learned in each area. (The breakdown and progression are necessarily subjective and reflect my own biases.)
I will focus on the Technical domain of the Skills Tree in this post, in particular the statistical modelling aspect; the other domains will be treated in their own right. As is commonly done elsewhere, the broad topic of statistical modelling can be divided into two separate categories: supervised learning and unsupervised learning. (This is actually a rough categorisation that fails to capture topics like semi-supervised learning and reinforcement learning, but it will do for now.) Both learning settings are important in practice, although the former is much more well-defined than the latter. A first course on data science naturally needs to cover both, but what are the most fundamental principles and algorithms that students would need to learn first?
Here are some factors that should be considered in answering the question:
- The expected long-term value of an idea/algorithm: Would people still be studying and using it 5-10 years from now?
- The immediate real-life practicality of an idea/algorithm: Are there existing good implementations and use cases to guide us in its use?
- The theoretical foundation of an idea/algorithm: Is it founded on solid mathematical/computational principles and techniques or is it a hack that happens to work (sometimes)?
- The extendability of an idea/algorithm: Are there natural extensions to allow it to be applied in more uncommon but yet important special cases?
I would argue that only ideas and algorithms that pass all four filters should be considered for a 3-5 days data science course. Of course, judgements would need to be exercised in assessing the four factors — the durability or long-term value of an idea/algorithm is probably the hardest to judge correctly — and different instructors would come with different biases. What follows are my biases.
In the Supervised Learning setting, I believe linear methods and their generalisations, being the workhorse in real-life statistical practice, should form the core of the curriculum, to be augmented with key non-linear methods in common use as well as a short introduction to unifying concepts like probabilistic graphical models. The following diagram gives a breakdown of key problems in supervised learning and common practical algorithms for each problem. I would suggest that a good data scientist would need to be comfortable with most, if not all, of the listed algorithms and models and a good portion of these can be covered in a 3-5 day data science course.
What to learn in a first course in the Unsupervised Learning setting is a bit more tricky to determine. Common clustering and outlier-detection algorithms like k-means, hierarchical clustering, and local outlier factor need to be covered, obviously. A short discourse on distance functions, including common distances on vector spaces as well as more unusual but useful distance functions on strings and sequences like Levenshtein distance, Dynamic Time Warping and Normalised Compression Distance, seem appropriate given their importance in clustering and outlier detection algorithms. I would also suggest that data compression techniques, including matrix factorisation techniques lying at the heart of many state-of-the-art collaborative filtering algorithms, should be covered in a first short data science course.
To round up the above topics, a discussion on general algorithm selection strategy — when to use what where — is also needed to make sure students can go on to apply the acquired techniques and algorithms with some confidence. A short discourse on theoretical foundations — for example, an introduction to Bayesian probability theory and statistical learning theory — would also seem appropriate.
I will discuss other aspects of data science training in upcoming posts.