Languages and Libraries for Machine Learning

MLPost_2

In our previous post 5 Skills You Need to Become a Machine Learning Engineer, we identified the key skills you need to succeed in this field. Now, we’re going to address one of the most common questions that comes up from students interested in Machine Learning: Which programming language(s) do I need to know?

The answer may surprise you. It doesn’t really matter!

As long as you’re familiar with the Machine Learning libraries and tools available in your chosen language, the language itself isn’t as important. A variety of Machine Learning libraries are available in different programming languages. Depending on your role within a company, and the task you’re trying to accomplish, certain languages, libraries and tools can be more effective than others.

Continue Reading

4 Ways to Pick Your First Programming Language

If you haven’t picked your first programming language, the programming world is your oyster. Yet with evangelists for every language telling you their language is the best, choosing one to start with can be incredibly overwhelming. We’ve looked at the data for the top ten programming languages in the US (based on IEEE Spectrum data) to help you pick the best language to start with based on your priorities in lifestyle, location, and career potential.

Python is a popular, well-paid language, being versatile enough to be used in many different applications, while Javascript is used widely across the country, and can be a good choice if you don’t want to relocate for a job. Although some newer programming languages, such as Swift, are not included, you shouldn’t discount the growth of their popularity. Career opportunities in iOS development using Swift, similar to Android development using Java, will increase as the field of mobile app development continues to expand.

Continue Reading

How Do I Start Learning Data Analysis?

As an instructor in the Data Analyst Nanodegree, I know how important it is to have a learning plan when starting something new. When I first started learning about data analysis and data science three years ago, I came across the following roadmap of skills, and I couldn’t help but feel overwhelmed. Where do I start? What programming language should I learn first? And why are the names of animals included in the list of skills (I’m looking at you Python, Pandas, and Pig)?

Road to Data Scientist, via Nirvacana
Source: Map of Data Science Skills to Learn
Credit: Swami Chandrasekaran

Learning about data analysis shouldn’t feel so overwhelming and difficult to the point of discouragement. So I’m here to share my advice for getting started in data analysis.

For starters, you will want to use a programming language so that you can record your work and share it with others. R is one programming language well-suited for data analysis and statistics. It’s a language that makes the computer to do the heavily lifting of computation and visualization so you can focus on thinking about your data. To make programming easier in R, there’s R Studio, which is a visual interface for writing code to crunch numbers and draw graphs with the R programming language. You can think of R Studio as an Excel-like program. If you are looking to learn data analysis, R Studio and the R programming language are must have tools. Above all else, they can make the process of learning data analysis easier.

Get started today!

Getting started is usually the most intimidating part. But fear not—you’re covered with many exceptional free resources, including Try R, Swirl, and Udacity’s own Data Analysis with R. All of these resources teach how to use the R programming language, and the last two will teach you statistics and data analysis in addition to R.

One encouraging thing about R, especially when you’re getting started, is that just a few commands can lead to powerful insights. Recently, I looked at some student data to see how long students spend on an education website. Here’s what I did.

# Load the data
minutesPerDay <- read.csv("minutes_per_day.csv", header=T,
                          colClasses=c("integer",
                                       "Date",
                                       "integer"))
# How much time do students spend on the site?
summary(minutesPerDay$minutes_on_site)

[Output]
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00   18.00   30.00   39.99   50.00  351.00

One student spent a whopping 5 hours and 51 minutes on the site and another student spent just 1 minute on the site. You don’t need to understand all of the coding syntax such as “colClasses” or the “$” symbol yet, but you’ll pick it up as you continue learning and experimenting!

Focus on learning the process and techniques of working with data.

Every programming language has its own idiosyncrasies, which can lead to a lot frustration when coding. It’s easy to get bogged down in the syntax of a programming language, so you should focus on learning the skills of data analysis. R let’s you do this because the language is well-documented and because many users have created packages to make data analysis easier. This let’s you ask questions about your data so you can learn how to solve problems with the data. The syntax will change between languages, but the concepts and ideas for working with data will not.

Once you learn how to load data and do some basic tasks in R, you can focus on learning more about data manipulation, machine learning, and data visualization. You need to learn how to gain insight from data by understanding the structure of the data set and the distributions and relationships of the variables. There are many textbooks and examples of using the R programming language in each of these domains. The R programming language also has many user-created packages, which simplify the process of working with data. Here are some recommended packages that can help you learn more about the skills for working with data.

Domain R Packages
Data Manipulation dplyr, tidyr
Machine Learning caret, randomForest, gbm, kernlab, rpart
Data Visualization ggplot2, RColorBrewer, scales, ggvis

Here’s a quick example using the same student data but this time using the dplyr and ggplot2 packages. These packages allow you to aggregate and visualize data more easily than if you used the base commands that come with R. I would also argue that the code is more semantic and easier to read. See if you can figure out what the code is doing.

library(dplyr)
# Compute the average time on the site for each day
avgMinutesPerDay <- minutesPerDay %>%
    group_by(date) %>%
    summarise(avg_minutes = mean(minutes_on_site))

library(ggplot2)
# Create a graph of average minutes on the site over time
qplot(data=avgMinutesPerDay, x=date, y=avg_minutes) +
    geom_line() +
    geom_point()

Avg minutes spent on a website

The visualization reveals that on average, students spent less time on the site towards the beginning of the course. Also, some event may have happened in early November when the average time on the site hit its maximum value. These observations need to be contextualized with relevant knowledge about the course’s structure (orientation, midterms, deadlines, etc.) and with the number of students contributing to each point on the graph, but you can easily see how packages can give you the power to dive deeper into asking important questions about your data.

Experiment and play with data!

Find a data set and start applying what you learn! You can grab a data set online (many government and nonprofits will have published data) or ask a co-worker or manager if they have data that they are trying to understand.

If you ever get stuck, you can refer to the documentation for R or a user-created package. The documentation will have examples that you can copy, paste, and run to figure out what the code does. If you’re still scratching your head about how to work with your data, you can take to Cookbook for R , R-bloggers, or StackOverflow to find curated examples, blog posts, and explanations.

Data analysis can seem overwhelming at first, but your journey into learning data analysis doesn’t need to be so stressful. You can get started today by learning the basics of the R programming language. Then, you can choose a skill you want to learn (summarizing data sets, correlation, or random forests). And finally, you can put your skills into practice by working with data. As you work with more data, you will come to see yourself as a proficient R programmer and data analyst.

How to Choose Between Learning Python or R First

If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. If you’re not exactly sure which to start learning first, you’re reading the right article.

When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a newcomer to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.

Should you learn Python or R? via udacity.com

 

Luckily, you can’t really go wrong with either.

The Case for R

R has a long and trusted history and a robust supporting community in the data industry. Together, those facts mean that you can rely on online support from others in the field if you need assistance or have questions about using the language. Plus, there are plenty of publicly released packages, more than 5,000 in fact, that you can download to use in tandem with R to extend its capabilities to new heights. That makes R great for conducting complex exploratory data analysis. R also integrates well with other computer languages like C++, Java, and C.

When you need to do heavy statistical analysis or graphing, R’s your go-to. Common mathematical operations like matrix multiplication work straight out of the box, and the language’s array-oriented syntax makes it easier to translate from math to code, especially for someone with no or minimal programming background.

The Case for Python

Python is a general-purpose programming language that can pretty much do anything you need it to: data munging, data engineering, data wrangling, website scraping, web app building, and more. It’s simpler to master than R if you have previously learned an object-oriented programming language like Java or C++.

In addition, because Python is an object-oriented programming language, it’s easier to write large-scale, maintainable, and robust code with it than with R. Using Python, the prototype code that you write on your own computer can be used as production code if needed.

Although Python doesn’t have as comprehensive a set of packages and libraries available to data professionals as R, the combination of Python with tools like Pandas, Numpy, Scipy, Scikit-Learn, and Seaborn will get you pretty darn close. The language is also slowly becoming more useful for tasks like machine learning, and basic to intermediate statistical work (formerly just R’s domain).

Choosing Between Python and R

Here are a few guidelines for determining whether to begin your data language studies with Python or with R.

Personal preference

Choose the language to begin with based on your personal preference, on which comes more naturally to you, which is easier to grasp from the get-go. To give you a sense of what to expect, mathematicians and statisticians tend to prefer R, whereas computer scientists and software engineers tend to favor Python. The best news is that once you learn to program well in one language, it’s pretty easy to pick up others.

Project selection

You can also make the Python vs. R call based on a project you know you’ll be working on in your data studies. If you’re working with data that’s been gathered and cleaned for you, and your main focus is the analysis of that data, go with R. If you have to work with dirty or jumbled data, or to scrape data from websites, files, or other data sources, you should start learning, or advancing your studies in, Python.

Collaboration

Once you have the basics of data analysis under your belt, another criterion for evaluating which language to further your skills in is what language your teammates are using. If you’re all literally speaking the same language, it’ll make collaboration—as well as learning from each other—much easier.

Job market

Jobs calling for skill in Python compared to R have increased similarly over the last few years.

Graph via r4stats

That said, as you can see, Python has started to overtake R in data jobs. Thanks to the expansion of the Python ecosystem, tools for nearly every aspect of computing are readily available in the language. In addition, since Python can be used to develop web applications, it enables companies to employ crossover between Python developers and data science teams. That’s a major boon given the shortage of data experts in the current marketplace.

The Bottom Line

In general, you can’t err whether you choose to learn Python first or R first for data analysis. Each language has its pros and cons for different scenarios and tasks. In addition, there are actually libraries to use Python with R, and vice versa—so learning one won’t preclude you from being able to learn and use the other. Perhaps the best solution is to use the above guidelines to decide which of the two languages to begin with, then fortify your skill set by learning the other one.

Is your brain warmed up enough yet? Get to it!

Data Visualization Case Study: How Bad Is Your Commute?

chrisS11

Hi, I’m Chris and I teach Data Analysis with R!

Data Analysis with R is a mouthful. Whether you know a lot about analyzing data or are looking to share numbers in a presentation, I want to show you how easy it can be to make sense of data and communicate it to an audience. Let’s start with a question.

How long does it take you to get to work?

My Udacious colleagues travel all across the Bay Area with commutes times as long as 2 hours and as short as 2 minutes. With your commute in mind, can you think about which cities have some of the worst commutes in the world? Go ahead, jot down some cities.

IBM conducted a Commuter Pain Survey (2010) and determined which cities had the most commuting woes, such as terrible traffic and crowded busses. Researchers ranked 20 cities on a “pain index” based upon many issues. One way of making sense of this data is by looking at a table like the one below. How does your list of cities compare?

 

City Pain Index
Amsterdam 25
Beijing 99
Berlin 24
Buenos Aires 50
Houston 17
Johannesburg 97
London 36
Los Angeles 25
Madrid 48
Melbourne 17
Mexico City 99
Milan 52
Montreal 23
Moscow 84
New Delhi 81
New York 19
Paris 36
Sao Paolo 75
Stockholm 15
Toronto 32

 While this table delivers the data, our brains digest data visualizations, like bar charts and scatter plots, much faster than numbers. If you have ever scanned an excel spreadsheet, you know what I mean. Here is the same data in a visualization.

The World's Worst Commutes
Made in R with safe colors for those who are color blind. I actually like this better.

By reordering the data, the visualization draws the eye towards the worst cities at the top of the diagram and the less “painful” cities towards the bottom of the chart. Additionally, color serves to draw the eye to particular cities: Beijing, Mexico City, and Berlin. Color might seem like an odd addition to the bar chart, so let me connect it to another visualization.

 

The chart above shows the percentage of drivers who would spend more time working if their commute times were considerably less. Notice that even though Berlin’s pain index is about a quarter of Beijing and Mexico City (from the first diagram), the same percentage of drivers would prefer to work. Now that’s dedication!If you want to dedicate time to create visualizations like this one, then join myself and members of Facebook’s Data Science Team in Data Analysis with R. I know commuting can be a pain but making sense of data should be easy.

Check out the course

Chris Saden,
Course Instructor, Data Analysis with R

Facebook Meme Hunting with Data Analysis with R

Exploratory Data Analysis is an approach to data analysis for summarizing and visualizing the important characteristics of a data set. It can be used to get a quick, basic understanding of a data set.

We can explore and visualize interesting questions such as, “When do memes pop up in social networks?” Below, Facebook data scientist Lada Adamic explains how she uses EDA techniques to follow a meme’s Facebook presence over time.

As Lada explains, a meme is an idea that replicates itself. In a social network, you may see a meme suggesting that you repost it to all of your friends.

In this example, Lada is interested in the Moneybags meme, which has popped up on Facebook regularly over the years, and is specifically adapted to Facebook by suggesting that readers copy and paste to share with their friends.

Lada wants to know how this Facebook-specific meme keeps recurring over time. A quick glance at a plot of the meme’s occurrence over time shows that it seems to disappear in between spikes of activity.

Screen-Shot-2014-06-02-at-5.47.30-PM

Check out the full lesson here! Lada tries some EDA techniques to see what is happening! She tries looking at meme occurrences over time using a log scale instead of a linear scale, and we are now able to see that the meme does not disappear entirely between spikes, and instead persists in low numbers over time before being replicated in variations and becoming popular once again.

For more on data investigation, check out Data Analysis with R, a self-paced course where you’ll learn with the Facebook data science team how to investigate, visualize and summarize data using R.

Check out the course

Rockin’ Data Visualization Project!

Are top musicians born or are they a product of large cities? Data Analysis with R student, Stefan Z., decided to address this question and many others while exploring the geography of American music.

First, he collected 5,280 of the “hottest” songs using The Echo Nest’s API, or application programming interface. If you’re not familiar with APIs, think of it as a way of requesting information. Next, he added location data for each artist’s birth place using Google Maps API.

From there, Stefan combined his song and artist data with US census data, and finally, he used mongoDB’s geospatial mapping to find nearby artists to identify clusters of top musicians.

plot of chunk long_lat_plot_of_songs
Clusters of top musicians

Stefan didn’t stop there. He researched how to create maps using a blog post and this StackOverflow question. With a little bit of more code, he created this map.

plot of chunk map_music
Birthplace of top musicians alongside cities with large populations

Red dots represent cities with populations above 500,000, and yellow dots represent the birthplace of a top musician. What observations do you have about the map?

You can see if your observations match up with Stefan’s and see the rest of his work in Stefan’s Geography of American Music full report.

Want to learn by doing? Exciting projects like this one are within your reach! You can visualize data and learn the building blocks of code with the R programming language in Data Analysis with R.

Chris Saden, Course Instructor, Data Analysis with R