Starting Data Science with Julia
Ku Wee Kiat, Research Computing, NUS Information Technology
This series aims to provide an introduction to Julia for Data Science. For this article, we will cover the benefits of Julia, some resources to look into for performing machine learning and other related tasks with Julia.
What & Why?
- high-level, general purpose programming language
- high-performance
- It is just-in-time (JIT) compiled and can be as fast as the C language
- more focused on parallelism compared to Python
- mostly used for numerical analysis.
- can call Python, R, C, Fortran libraries.
Disadvantages of Julia
- Array index starting at 1
- Not as mature as Python
- Not as many packages available
- Unpolished documentation
What this means is that for data cleaning/wrangling you are no longer limited to the single-process or poorer parallel implementation. You can now use Julia and take advantage of near native performance and better parallelisation for your data manipulation needs.
If you are a heavy user of Pandas Dataframes on Python, Julia’s version of DataFrames, even though it’s not an exact copy, might not be that unfamiliar to you. There is also the Queryverse.jl which is a set of Julia packages for Data Science. It has packages for Data Manipulation using Query.jl, Data Visualisation using VegaLite.jl or for something interactive there is DataVoyager.jl. There are methods for saving your data as CSV, Feather, Excel, SPSS, Stata, SAS and Parquet files as well.
Following that, you can utilise the machine learning libraries that you are familiar with in Python (e.g. Scikit-Learn) via Julia. Alternatively you can use MLJ which is a machine learning framework for Julia.
IBM has created a package called AutoMLPipeline.jl for Julia. Which makes it easy to discover optimal structures for machine learning regression and classification. It takes advantage of the built-in macro programming features of Julia to symbolically process, manipulate pipeline expressions. The package looks at the entire machine learning workflow as different steps forming a pipeline, and then considers the entire process to be a pipeline optimisation problem (POP). POP requires simultaneous optimisation of pipeline structure and parameter adaptation of its components. Do check out the official repository for more information and examples.
In subsequent articles we will be covering each step of a Data Science or Machine Learning pipeline with Julia. Also look out for Julia workshops that will be held in the future. In the meantime, you can try it out yourself on your local machine or on NUS HPC.
Using Julia on NUS HPC
There are a few versions of Julia available on HPC:
$ module avail julia |
You can run Julia with Jupyter notebook on NUS HPC. Please drop an email to dataengineering@nus.edu.sg to request the guide for Interactive Julia on NUS HPC.