7.6 kmeans train

20211016 The input dataset for this package is provided as a csv file with a column for each variable, like age and income for a dataset of people, or for the iris dataset, sepal width, sepal length, petal width, and petal length. Each row of the dataset records the observations of those variables. The data is assumed to have been normalised (e.g, converted to numbers in the range -1 to 1) so that no variable dominates any other variable in calculating distances, otherwise the distance between income (e.g., $50,000 and $60,000 is 10,000) swamps the distance between age (e.g., 50 and 60 is 10) when we add distances together, as we do in the k-means algorithm.

The output from train is also a csv file with k rows and a header row. Each of the k rows correspond to a “discovered” or “fit” cluster. This is the trained model. The cluster is represented as the central point of the cluster, calculated as the “mean” or “average” of each of the variables in the dataset across all the points/people in that cluster.

We can use these centroids to place (predict) new observations (people) into one of the k clusters.

The parameter k needs to be provided up front—i.e., we need to guess a suitable value for the number of clusters. There are other tools that can assist in deciding on a good value for k.

ml train kmeans [options] <k> 
     -i <file.csv> --input=<file.csv>      Load training date csv file.
     -o <file.csv> --output=<file.csv>     Save the model to csv file.
     -m <file.mp4> --movie=<file.mp4>      Save the movie file.
                   --view                  Popup a movie of the clustering.

With no --input= of a csv file the csv data (with a header row) is read from standard input. This allows the command to be part of a pipeline of commands, whereby the training data could be piped from another operation.

The default output is a csv of the centres, with a cluster label appended to each row, and a header row with the cluster label column named label.

If -o (--output=) is provided then the model as a csv file is written to the named file.

A -v (--view) will cause a movie of the iterations of the algorithm to be displayed. Each step of the algorithm may move the centre point, maybe ever so slightly as the algorithm converges on to the best fit.

A -m (--movie=) saves the generated movie of the iterations of the algorithm to an mp4 file. This can be combined with -v (--view) to also display the movie.

ml train kmeans 3 -i iris.csv -o model.csv -m movie.mp4 --view

If no csv output is specified then the output is always to the terminal, irrespective of whether a mp4 is also output or whether --view is requested.

The output might look something like:

$ ml train kmeans 3 iris.csv

Note that the algorithm initialises it’s starting point (the first k centres) randomly, and so the model that is built each time may be quite different.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0