7.7 kmeans predict

20211015 Having performed a cluster analysis we have effectively fit a model to the data and trained a model from the data. The model can now be used to “predict,” or in our case assign, each point to a cluster. The predict command is utilised to label each point in a supplied dataset (also a csv file) based on the “model” saved as a csv file.

ml predict kmeans [options] <csvfile>
     -m <model.csv>  --model=<model.csv>   Read model from file or else STDIN.
     -o <file.csv>   --output=<file.csv>   Save the output predictions to file.

If no input model file is supplied (--input=model.csv, containing the centres representing the model, and a label, together with a header row) then it is read from standard input. This allows the command to be part of a pipeline of commands, whereby the model data could be piped from the train command. The cluster label is assumed to be in a column named label and the remaining columns are the centres.

The output is csv format, with a header, and a column for the label, named as such, as the last column, identifying the nearest centre to each point.

$ ml predict kmeans iris.csv -m model.csv
sepal_length,sepal_width,petal_length,petal_width,label
5.1,3.5,1.4,0.2,1
4.9,3.0,1.4,0.2,1
4.7,3.2,1.3,0.2,1
4.6,3.1,1.5,0.2,1
...


Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2021 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0