MLPost_2

In our previous post 5 Skills You Need to Become a Machine Learning Engineer, we identified the key skills you need to succeed in this field. Now, we’re going to address one of the most common questions that comes up from students interested in Machine Learning: Which programming language(s) do I need to know?

The answer may surprise you. It doesn’t really matter!

As long as you’re familiar with the Machine Learning libraries and tools available in your chosen language, the language itself isn’t as important. A variety of Machine Learning libraries are available in different programming languages. Depending on your role within a company, and the task you’re trying to accomplish, certain languages, libraries and tools can be more effective than others.

R

R is a purpose-built language meant for statistical computing, and is a clear winner for large-scale data-mining, visualization and reporting. You have easy access to a huge collection of packages (through the CRAN repository) that enable you to apply almost all kinds of Machine Learning algorithms, statistical tests and analysis procedures. The language itself has an elegant—albeit esoteric—syntax for expressing relationships, transforming data and performing parallelized operations.

A recent poll conducted by KDNuggets found R to be the most popular language for analytics, mining and data science tasks in 2015, although Python has been gaining ground over the past few years.

ML2_Graph

KDNuggets 2015 poll: Primary programming language for Analytics, Data Mining, Data Science tasks

MATLAB

MATLAB is very popular in academia because of its ability to operationalize complex mathematical expressions, rich support for algebra and calculus, symbolic computation, and a large collection of toolboxes available for different disciplines ranging from digital signal processing to computational biology. It is often used for prototyping new Machine Learning algorithms, and in certain cases, for producing complete solutions. It does carry a hefty license for commercial use, but can still be worthwhile as it drastically reduces research and development efforts. Octave is a free alternative to MATLAB with an almost identical syntax, but only a limited set of toolboxes and less mature IDE.

Python

Even though Python is a more general-purpose programming and scripting language, it is gaining popularity among data scientists and Machine Learning engineers. Unlike R or MATLAB, data processing and scientific computing idioms are not built into the language itself, but libraries like NumPy, SciPy and Pandas offer equivalent functionality in an arguably more approachable syntax.

Specialized Machine Learning libraries such as scikit-learn, Theano and TensorFlow give you the ability to train a variety of Machine Learning models, potentially using distributed computing infrastructure. Most of the performance-critical code for these libraries is still typically written in C/C++ or even Fortran, with the Python packages serving as wrappers or APIs (the same is true with many R packages).

But the biggest advantage is that the Python ecosystem makes it really easy to put together a complex end-to-end product or service, such as a web application using Django or Flask, or a desktop application using PyQt, or even an autonomous robotic agent using ROS.

This versatility is the reason we primarily use Python in the Machine Learning Engineer Nanodegree program!

Java

Java is the software engineer’s language of choice because of its clean and consistent implementation of object-oriented programming, and platform-independence using JVMs. It sacrifices brevity and flexibility for clarity and reliability, which makes it popular for implementing critical enterprise software systems. In order to maintain that same level of reliability and to avoid writing messy interfaces, companies that have been using Java may prefer to stick to it for their Machine Learning needs.

Besides libraries and tools that are useful for analysis and prototyping (e.g. Weka), there are some great options for building large-scale distributed learning systems in Java, such as Spark+MLlib, Mahout, H2O and Deeplearning4j. These libraries/frameworks play well with industry-standard data processing and storage systems such as Hadoop/HDFS, making them easier to integrate.

C/C++

C/C++ is ideal for low-level software such as operating system components and networking protocols where computational speed and memory efficiency are extremely critical. For these same reasons, it is also a popular choice for implementing the guts of Machine Learning procedures. However, its lack of idiomatic abstractions for data processing and added overhead for memory-management can make it unsuitable for beginners, and burdensome for developing complete end-to-end systems.

In case of embedded systems such as smart cars, devices and sensors, it may be necessary to use C/C++. In other situations, it might be a matter of convenience due to existing infrastructure and application-specific code. In either case, there is no dearth of Machine Learning libraries available in C/C++, e.g. LibSVM, Shark and mlpack.

Enterprise Solutions

Besides these languages and libraries, there are several other commercial products used for statistical modeling and business analytics that apply machine learning models within a more managed data processing environment. These products, including RapidMiner, IBM SPSS, SAS+JMP and Stata, aim to provide a reliable and end-to-end solution for data analysis, and often expose a programmable API and/or scripting syntax.

Another recent development in this space is the proliferation of cloud-based Machine-Learning-as-a-Service platforms, such as Amazon Machine Learning, Google Prediction, DataRobot, IBM Watson and Microsoft Azure Machine Learning. They help you scale your learning solutions to process large amounts of data and quickly experiment with different models. As long as you have a solid foundation in Machine Learning, working with a new product or platform is just like learning to use a new tool.

Pro Tip: One important consideration when choosing a language/library is the balance between execution time and development time. An extremely fast learning pipeline that churns data in minutes may be worthless if it takes months to develop. It is more important to be able to build and test prototypes quickly, because your first try is almost guaranteed to fail.

This is why most companies look for Machine Learning engineers who are effective and experienced at using whatever tools/languages/libraries they are comfortable with. It is common practice to prototype algorithms in high-level languages like Python or R, and then port solutions over to Java or C/C++ for production, if needed.

 

Further reading:

Best Programming Language for Machine Learning – Jason Brownlee, Machine Learning Mastery

Kagglers’ Favourite Tools – Ben Hamner, Kaggle (detailed forum post)

Python, Machine Learning, and Language Wars – Sebastian Raschka, Michigan State University

Primary programming language for Analytics, Data Mining, Data Science tasks – Gregory Piatetsky, KDNuggets