Airflow Kubernetes Executors on Google Kubernetes Engine

Introduction In this post, I’ll document the use of Kubernetes Executor on a relative large Airflow cluster (Over 200 data pipelines and growing). Kubernetes Executor in my setup solved/improved...

Big Data Exploratory With Dask On Kubernetes

Dask is a parallel analytical computing library that implements many of the pandas API, built to aid the online (as opposed to batch) “big data” analytics. When a dataset is...

Word Embeddings with Word2Vec

Word embeddings is a type of word representations that encode general semantic relationships. As most machine learning techniques do not accept text as direct inputs, text data must transform into...

Explain PCA and 2 ways to calculate it

There are lots resources about Principal Component Analysis (PCA) yet it is one such abstract knowledge many ML engineers fail to fully understanding. One perhaps is able to use...

Exploratory data analysis on NSW Roads Fines

Data Exploration of Australia NSW Fines dataset The dataset describes the fines issued in Australia NSW between January 2012 and November 2017. The dataset is available to...

Classifying Caltech 101 categories of images

This is a short article about how to build and train VGG19 on image classification task. The main purpose of this mini project is to walk through all verbose part...

Explain L1 and L2 regularisation in Machine Learning

What is regularisation When training a machine learning model, there typically will have an objective function that it needs to be optimised. This objective function often refers to loss...

Numpy reshape and transpose

For almost all who worked with Numpy, who must have worked with multi-dimensional arrays or even higher dimensional tensors. Reshape and transpose two methods are inevitably used to manipulate the...

Staring a development blog

Why do I blog? Blogging is time-consuming and there is no guarantee I could keep motivated to produce new contents. However, I find blogging is the missing manual I...