@@ -10,56 +10,61 @@ The synthetic database presented here can be used to test the ability of the alg

The fact that the data is simulated provides detailed knowledge about

the structure underlying the data, enabling a more thorough evaluation of the

results. This is particularly important in online clustering for evaluating not only the

final result provided by the algorithm, but also the

results. This is particularly important in online clustering for evaluating not only the

final result provided by the algorithm, but also the

intermediate results.

By having detailed knowledge about the models that generated the data

it is possible to accurate assess the performance of the algorithms

through all the intermediate states.

The database is composed by a several data sets with concept drift

The database is composed by several data sets with concept drift

which also contains information about the temporal

evolution of the models that generated the data.

evolution of the models that generated the data.

The data sets have been generated by Gaussian distributions

whose mean and/or covariance change over time. In the database both the

whose mean and/or covariance change over time. In the database both the

simulated data and the Gaussians that generated it are provided,

hence enabling an accurate evaluation of the partitions through time.

Each data set is made up of three files. The first file

contains in each row one of the samples of the data set, being the last column

the cluster number. The other files contain

in the corresponding rows the mean and covariance matrix that generated that sample.

The database has been made available under Creative Commons license and we encourage other researchers

to evaluate their own EC algorithms over it.

Naming Convention

-------------

The data sets are named using the following naming convention: first

the number of clusters followed by the letter "C", then the number of dimensions of the data

set followed by the letter "D", then the number of samples (k is used for thousands) and finally

a final word roughly describing clusters' movement.

For example, 3C2D2400Spiral is a data set with 3 clusters in 2 dimensions where spiral alike movements are present.

the number of clusters followed by the letter `C`, then the number of dimensions of the data

set followed by the letter `D`, then the number of samples (k is used for thousands) and finally

a final word roughly describing clusters' movement.

For example, `3C2D2400Spiral` is a data set with 3 clusters in 2 dimensions where spiral alike movements are present.

Data sets names=("1C2D1kLinear","4C2D800Linear","4C2D3200Linear","3C2D2400Spiral","4C3D20kLinear","5C5D1kLinear","2C3D4kHelix","2C2D200kHelix","4C2D4kStatic")

| Data set name | Short name |

|-----------------|------------|

| `1C2D1kLinear` | `a` |

| `4C2D800Linear` | `b` |

| `4C2D3200Linear`| `c` |

| `3C2D2400Spiral`| `d` |

| `4C3D20kLinear` | `e` |

| `5C5D1kLinear` | `f` |

| `2C3D4kHelix` | `g` |

| `2C2D200kHelix` | `h` |

| `4C2D4kStatic` | `i` |

Short names=("a","b","c","d","e","f","g","h","i")

Data Format

-------------

Each data set is composed by three files.

Each data set is made up of **three files**.

-`SamplesFile_{short name}_{long name}.csv`

-`MeanFile_{short name}_{longname}.csv`

-`VarsFile_{short name}_{long name}.csv`

- SamplesFile_{short name}_{long name}.csv

- MeanFile_{short name}_{longname}.csv

- VarsFile_{short name}_{long name}.csv

The first file contains in each row one of the samples of the data set, being the last column the cluster

number. The other files contain in the corresponding rows the mean and covariance matrix that generated that sample.

The first file contains in each row one of the samples of the data set, being the last column

the number of cluster. The other files contain in the corresponding rows the mean and covariance matrix that generated that sample.

License

-------

The database is available under Creative Commons Attribution-ShareAlike 4.0 International license.

We encourage other researchers to evaluate their own EC algorithms over it.