7.12 kmeans example wine dataset

UNDER DEVELOPMENT 20211229 We can illustrate the application of K-Means to a new dataset. The sample dataset contains observations of the chemical analysis of wine. The dataset is available from the UCI Machine Learning repository.

First we download the data from the repository:

wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data -O wine.data
wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names -O wine.names

The data is in csv format but without a header, and so some command line tools will help transform it. First we get the variable names from the names file.

egrep '^\s+[0-9][0-9]?)' wine.names |
  cut -d')' -f2 |
  tr 'A-Z' 'a-z' |
  awk '{printf("%s,", $1)}' |
  sed 's|,$||' |
  awk '{print}' > wine.header

Prepend that to the data file after removing the first column which is a wine class variable.

cut -d"," -f2- wine.data | cat wine.header - > wine.csv

The dataset is now in the right csv format:

$ head -4 wine.csv

alcohol,malic,ash,alcalinity,magnesium,total,flavanoids,nonflavanoid,proanthocyanins,color,hue,...
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185

Notice the scale of the different variables



Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0