Image via Pixabay/Fraggo
No matter where you are on your path to a career in data, it probably seems daunting to consider all the skills you still need to be recruiter-ready. Typically, data workers come from three different backgrounds:
Given your starting point, what is your best path to your first data science job? What skills can you use to build your foundations in the most efficient and effective way?
That’s where we come in. It’s helpful to examine each of those three scenarios—zero experience, programming but no math, math but no programming—in terms of the building blocks you’ll need to build your ultimate data skill set.
Starting From Zero
So you want to be a data analyst? Congrats! You’ve chosen a lucrative, geographically flexible, and super-secure career in a field that’s only going to continue to blossom in the years to come.
Of course, you’ve got to do the up-front work of learning and sharpening the necessary skills before you can reap the benefits. Follow this step-by-step, from-the-ground-up guide to acquiring the tools to become an ultra-hireable data analyst.
Programming is an integral aspect of data analysis. It’s the core skill that sets data analysts apart from business analysts. You’ll need to be able to program well in one or more programming languages—start with Python or R—and to have a good grasp of the landscape of the most commonly used data science libraries and packages (such as ggplot2, reshape2, numpy, pandas, and scipy).
What good is all that programming prowess without the ability to interpret the data? An understanding of statistics, including statistical tests, distributions, and maximum likelihood estimators, is essential in data analysis.
Acquaint yourself with both descriptive and inferential statistics. The former refers to quantitative measures that describe the properties of a sample; the latter, to predictive measures that infer properties of the larger population by interpreting the sample. You’ll need to know the basics, many of which will sound familiar from high school or college (mean, median, mode; standard deviation and variance; hypothesis testing), onto which you will layer more complex statistical skills as well (different types of data distribution: standard normal, exponential/poisson, binomial, chi-square; and tests for significance: Z-test, t-test, Mann-Whitney U, chi-squared, ANOVA).
Beyond descriptive and inferential stats, data analysts need to be adept at statistical experimental design. That’s the systematic process of selecting parameters in order to make results both valid and significant. For example, you’ll need to determine how many samples to collect, how different factors should be interwoven, how to choose good control and testing groups, and the like. To execute strong experimental design using tools like A/B testing and concepts like power law, best practice is to use as a barometer the idea of “SMART (Specific, Measurable, Actionable, Realistic, Timely) experiments.”
The language of data analysts is numbers, so it follows that a strong foundation in math is an essential building block on the path to becoming a data analyst.
At a basic level, you should be comfortable with college algebra. You’ll have to translate what you once knew as as “word problems” (real-world equivalent: business problems) into mathematical expressions; you’ll need to be able to manipulate algebraic expressions and solve equations; and you’ll need to be able to graph different types of functions, with a deep understanding of the relationship between a function’s graph and its equation.
Beyond that, a solid grasp of multivariable calculus and linear algebra will serve you well as a data analyst. Think: matrix manipulations, dot product, eigenvalues and eigenvectors, and multivariable derivatives.
Multivariable calculus and linear algebra, along with statistics, make up the basic foundation of machine learning, which enables data professionals to make predictions or calculated suggestions based on huge amounts of data. For a career as a data analyst, you won’t need to invent new machine-learning algorithms (advanced skills like that qualify you to become a data scientist), but you should know the most common of them. A few examples include principal component analysis, neural networks, support vector machines, and k-means clustering. Note that you may not need to know the theory and implementation details behind these algorithms, but you should understand the pros and cons, as well as when to (and when not to) apply them to a dataset.
In supervised learning, the “learner” (computer program) is provided with two sets of data, a training set and a test set. The computer “learns” from a set of labeled examples in the training set so that it can identify unlabeled examples in the test set accurately. The goal is for the learner to develop a rule that can identify the elements in the test set. It is supervised learning that makes it possible for your phone to recognize your voice, and your email to filter spam. Specific tools you’ll use include:
- decision trees
- Naive Bayes classification
- Ordinary Least Squares regression
- logistic regression
- neural networks
- support vector machines
- and ensemble methods.
Unsupervised learning is what you’ll use when faced with the challenge of discovering implicit relationships, and thus hidden structure, in a given “unlabeled” dataset. Unsupervised learning makes it possible for Netflix to recommend movies you’d enjoy, and Amazon to predict products you’ll like. Specific tools you’ll use include:
- clustering algorithms
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- and Independent Component Analysis (ICA).
Lastly, reinforcement learning applies to situations that fall between the two extremes of supervised and unsupervised, i.e., when there is some form of feedback available for each predictive step or action, but no precise label or error measure. You can apply reinforcement learning when you want to figure out how to maximize rewards, for instance in arenas like robot control, chess, backgammon, checkers, and other activities that a software agent can learn. Specific tools you’ll use include:
- and genetic algorithms.
Still with us? The last three abilities crucial to your development as a data analyst pertain to manipulating, displaying, and interpreting data. To transform raw material into a useful, organized datasets, data wrangling (also known as “data munging”) comes into play. This is the process of collecting and cleaning data so it can be easily explored and analyzed.
You’ll need to equip yourself with knowledge of database systems (both SQL-based and NoSQL-based) that act as a central hub to store information. It’ll be useful to be familiar with relational databases such as PostgreSQL, mySQL, Netezza, and Oracle, as well as Hadoop, Spark, and MongoDB.
Other concepts and tools essential to data wrangling include regular expressions, mathematical transformations, and Python String library for string manipulations. You’ll also need to know how to parse common file formats such as csv and xml files and how to convert non-normal distribution to normal with log-10 transformation.
It may all sound overwhelming right now, especially if you’re brand new not only to the skills involved, but to some of the terms themselves. Remember that all of these skills are stackable: each one you master will help you build the next, and the next after that, until you’re a fully equipped data analyst ready to kick butt and take some names.
Once you’ve cleaned, organized, arranged, plied, and interpreted the data, you want to be able to illustrate your findings visually so that stakeholders, including the data-illiterate, can fully understand. You won’t get any credit for your data analysis chops if you don’t communicate your insights clearly and effectively.
It’ll be helpful to be familiar with data visualization tools like ggplot, matplotlib, sea born, and D3.js. Of course, it’s key to be familiar not just with the tools necessary to actually display the data visually, but also with the principles underlying the visual encoding of that data. To wit, you’ll need to intimately understand the context of the business situation in order to determine how to situate your data visualization to be maximally relevant.
Bolstered by the technical knowledge of the combined skills above, you’ve got to know how to think, how to ask the right questions. You could spend the rest of your life analyzing a single dataset and visualizing your interpretation in a multitude of formats with a plethora of findings. The reality is, you’ll only ever have a limited amount of time and space to address your associates’ questions in analyzing the data at hand. Therefore, it’s important to nurture an intuition about what things are important, and what things aren’t.
Work toward developing a deep understanding of the field in which you’re working, whether it’s the stock market or consumer packaged goods. Invest the time to work through as many datasets as you can, for example by participating in Kaggle competitions, to learn how to avoid dead ends. Learn to sense the “question behind the question” in assignments, digging down, in other words, to discover the exact business issues driving the need to analyze the data.
Building on a Programming Background
Did some, or a lot, of that content overview sound familiar to you? Have you been trained as a software engineer, or perhaps you studied programming in college, but yet lack the solid mathematical foundation required to become a data analyst?
No sweat. You’re in a great position to launch a learning journey, at the culmination of which you’ll be situated for maximum data analysis success.
Programming is an integral aspect of data analysis.
Here’s what you’ll need to learn next, in order, on the road to clicking “apply” on a data analyst job opening.
- Statistics: You’ll need to be able to rigorously interpret, make inferences, and compare different types of data by applying the right approach, technique, or statistical tests to different types of distributions. Check out the above breakdown for specific tools and skills.
- Probability: In order to draw accurate conclusions, data analysts need to be able to reason about the likelihood that an event could have happened or that it will happen. Check out the above breakdown for specific tools and skills.
- Multivariable calculus/linear algebra: These advanced math skills are less important to know than statistics and probability, but will definitely be useful if you want to understand how machine learning actually works. In addition, if you envision wanting to leverage your data analyst chops into a career as a data scientist at some point, multivariable calc and linear algebra will provide the foundational knowledge to build your own algorithms.
Building on a Mathematical Background
OK, so maybe you’re a math whiz, but have no knowledge of programming. Here’s a step-by-step guide to building that programming knowledge that’s so crucial to becoming a data analyst.
- Variables, control flow, loops, functions: These are the basic building blocks of programming. Know them and love them.
- Debugging: Your code will probably not work correctly the first time around, or could break when unexpected situations occur. When that happens, you’ll need to be able to figure out what the problem is and why it’s happening. This is where debugging skills will come in handy.
- Object-oriented programming: Learn how to structure your code into object-oriented design patterns, so it can be easily reused, tested, and shared with other people.
- Data structures: For extra credit, familiarize yourself with Stacks, Queues, Lists, Arrays, Hashmaps, Priority Queues, Tries, and Graphs. There are certain situations in which one data structure will be superior to others (in terms of memory usage and runtime efficiency), and if you understand these relationships, you can optimize your program to run faster and more efficiently. That’ll impress your team, and set you apart among other data professionals.
- Algorithms: Knowing which algorithm to apply in which situation can reduce the running time of your program from a few days to a few hours, or the memory requirement from a few gigabytes to a few hundreds of megabytes. Work towards understanding divide and conquer (D&C) algorithms, greedy algorithms, dynamic programming, linear programming, and graph algorithms (depth vs. breadth vs. traversal, minimum spanning trees, and shortest path between two nodes).
- Software design patterns: Want to make your code robust, reusable, and testable? Many pioneering software engineers and computer scientists have developed software design patterns to help you do so. Become comfortable with them so you can excel at your data analysis.
The Bottom Line
Data analysis is a fast-growing field, and there are a lot of voices out there sharing what you need to learn, in what order. The variety of information can be confusing, overwhelming, and discouraging.
Know that you can rely on this breakdown as the definitive guide to what you really do need to learn in order to land that first data analyst job, along with prescriptions for where to start, depending on your specific background.
The investment in a career as a data analyst is huge, no matter if you’re just starting out or if you’re expanding on existing abilities. But the payoff, we promise, is even bigger.