https://sean.lane.sh/posts/2016/05/PySpark-and-Latent-Dirichlet-Allocation/