As an instructor in the Data Analyst Nanodegree, I know how important it is to have a learning plan when starting something new. When I first started learning about data analysis and data science three years ago, I came across the following roadmap of skills, and I couldn’t help but feel overwhelmed. Where do I start? What programming language should I learn first? And why are the names of animals included in the list of skills (I’m looking at you Python, Pandas, and Pig)?
Source: Map of Data Science Skills to Learn
Credit: Swami Chandrasekaran
Learning about data analysis shouldn’t feel so overwhelming and difficult to the point of discouragement. So I’m here to share my advice for getting started in data analysis.
For starters, you will want to use a programming language so that you can record your work and share it with others. R is one programming language well-suited for data analysis and statistics. It’s a language that makes the computer to do the heavily lifting of computation and visualization so you can focus on thinking about your data. To make programming easier in R, there’s R Studio, which is a visual interface for writing code to crunch numbers and draw graphs with the R programming language. You can think of R Studio as an Excel-like program. If you are looking to learn data analysis, R Studio and the R programming language are must have tools. Above all else, they can make the process of learning data analysis easier.
Get started today!
Getting started is usually the most intimidating part. But fear not—you’re covered with many exceptional free resources, including Try R, Swirl, and Udacity’s own Data Analysis with R. All of these resources teach how to use the R programming language, and the last two will teach you statistics and data analysis in addition to R.
One encouraging thing about R, especially when you’re getting started, is that just a few commands can lead to powerful insights. Recently, I looked at some student data to see how long students spend on an education website. Here’s what I did.
# Load the data minutesPerDay <- read.csv("minutes_per_day.csv", header=T, colClasses=c("integer", "Date", "integer")) # How much time do students spend on the site? summary(minutesPerDay$minutes_on_site) [Output] Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 18.00 30.00 39.99 50.00 351.00
One student spent a whopping 5 hours and 51 minutes on the site and another student spent just 1 minute on the site. You don’t need to understand all of the coding syntax such as “colClasses” or the “$” symbol yet, but you’ll pick it up as you continue learning and experimenting!
Focus on learning the process and techniques of working with data.
Every programming language has its own idiosyncrasies, which can lead to a lot frustration when coding. It’s easy to get bogged down in the syntax of a programming language, so you should focus on learning the skills of data analysis. R let’s you do this because the language is well-documented and because many users have created packages to make data analysis easier. This let’s you ask questions about your data so you can learn how to solve problems with the data. The syntax will change between languages, but the concepts and ideas for working with data will not.
Once you learn how to load data and do some basic tasks in R, you can focus on learning more about data manipulation, machine learning, and data visualization. You need to learn how to gain insight from data by understanding the structure of the data set and the distributions and relationships of the variables. There are many textbooks and examples of using the R programming language in each of these domains. The R programming language also has many user-created packages, which simplify the process of working with data. Here are some recommended packages that can help you learn more about the skills for working with data.
|Data Manipulation||dplyr, tidyr|
|Machine Learning||caret, randomForest, gbm, kernlab, rpart|
|Data Visualization||ggplot2, RColorBrewer, scales, ggvis|
Here’s a quick example using the same student data but this time using the dplyr and ggplot2 packages. These packages allow you to aggregate and visualize data more easily than if you used the base commands that come with R. I would also argue that the code is more semantic and easier to read. See if you can figure out what the code is doing.
library(dplyr) # Compute the average time on the site for each day avgMinutesPerDay <- minutesPerDay %>% group_by(date) %>% summarise(avg_minutes = mean(minutes_on_site)) library(ggplot2) # Create a graph of average minutes on the site over time qplot(data=avgMinutesPerDay, x=date, y=avg_minutes) + geom_line() + geom_point()
The visualization reveals that on average, students spent less time on the site towards the beginning of the course. Also, some event may have happened in early November when the average time on the site hit its maximum value. These observations need to be contextualized with relevant knowledge about the course’s structure (orientation, midterms, deadlines, etc.) and with the number of students contributing to each point on the graph, but you can easily see how packages can give you the power to dive deeper into asking important questions about your data.
Experiment and play with data!
Find a data set and start applying what you learn! You can grab a data set online (many government and nonprofits will have published data) or ask a co-worker or manager if they have data that they are trying to understand.
If you ever get stuck, you can refer to the documentation for R or a user-created package. The documentation will have examples that you can copy, paste, and run to figure out what the code does. If you’re still scratching your head about how to work with your data, you can take to Cookbook for R , R-bloggers, or StackOverflow to find curated examples, blog posts, and explanations.
Data analysis can seem overwhelming at first, but your journey into learning data analysis doesn’t need to be so stressful. You can get started today by learning the basics of the R programming language. Then, you can choose a skill you want to learn (summarizing data sets, correlation, or random forests). And finally, you can put your skills into practice by working with data. As you work with more data, you will come to see yourself as a proficient R programmer and data analyst.