August 2017 Gitlab migration is now completed. Emails should work now. More information: http://telegra.ph/CiTIUS-Gitlab-Migration-Status-08-23

Commit 8c4f70f0 by Jorge Suárez de Lis

Update README.md

parent 04236a5d
......@@ -10,56 +10,61 @@ The synthetic database presented here can be used to test the ability of the alg
The fact that the data is simulated provides detailed knowledge about
the structure underlying the data, enabling a more thorough evaluation of the
results. This is particularly important in online clustering for evaluating not only the
final result provided by the algorithm, but also the
results. This is particularly important in online clustering for evaluating not only the
final result provided by the algorithm, but also the
intermediate results.
By having detailed knowledge about the models that generated the data
it is possible to accurate assess the performance of the algorithms
through all the intermediate states.
The database is composed by a several data sets with concept drift
The database is composed by several data sets with concept drift
which also contains information about the temporal
evolution of the models that generated the data.
evolution of the models that generated the data.
The data sets have been generated by Gaussian distributions
whose mean and/or covariance change over time. In the database both the
whose mean and/or covariance change over time. In the database both the
simulated data and the Gaussians that generated it are provided,
hence enabling an accurate evaluation of the partitions through time.
Each data set is made up of three files. The first file
contains in each row one of the samples of the data set, being the last column
the cluster number. The other files contain
in the corresponding rows the mean and covariance matrix that generated that sample.
The database has been made available under Creative Commons license and we encourage other researchers
to evaluate their own EC algorithms over it.
Naming Convention
-------------
The data sets are named using the following naming convention: first
the number of clusters followed by the letter "C", then the number of dimensions of the data
set followed by the letter "D", then the number of samples (k is used for thousands) and finally
a final word roughly describing clusters' movement.
For example, 3C2D2400Spiral is a data set with 3 clusters in 2 dimensions where spiral alike movements are present.
the number of clusters followed by the letter `C`, then the number of dimensions of the data
set followed by the letter `D`, then the number of samples (k is used for thousands) and finally
a final word roughly describing clusters' movement.
For example, `3C2D2400Spiral` is a data set with 3 clusters in 2 dimensions where spiral alike movements are present.
Data sets names=("1C2D1kLinear","4C2D800Linear","4C2D3200Linear","3C2D2400Spiral","4C3D20kLinear","5C5D1kLinear","2C3D4kHelix","2C2D200kHelix","4C2D4kStatic")
| Data set name | Short name |
|-----------------|------------|
| `1C2D1kLinear` | `a` |
| `4C2D800Linear` | `b` |
| `4C2D3200Linear`| `c` |
| `3C2D2400Spiral`| `d` |
| `4C3D20kLinear` | `e` |
| `5C5D1kLinear` | `f` |
| `2C3D4kHelix` | `g` |
| `2C2D200kHelix` | `h` |
| `4C2D4kStatic` | `i` |
Short names=("a","b","c","d","e","f","g","h","i")
Data Format
-------------
Each data set is composed by three files.
Each data set is made up of **three files**.
- `SamplesFile_{short name}_{long name}.csv`
- `MeanFile_{short name}_{longname}.csv`
- `VarsFile_{short name}_{long name}.csv`
- SamplesFile_{short name}_{long name}.csv
- MeanFile_{short name}_{longname}.csv
- VarsFile_{short name}_{long name}.csv
The first file contains in each row one of the samples of the data set, being the last column the cluster
number. The other files contain in the corresponding rows the mean and covariance matrix that generated that sample.
The first file contains in each row one of the samples of the data set, being the last column
the number of cluster. The other files contain in the corresponding rows the mean and covariance matrix that generated that sample.
License
-------
The database is available under Creative Commons Attribution-ShareAlike 4.0 International license.
We encourage other researchers to evaluate their own EC algorithms over it.
> Maintained by David Gonzalez Marquez
\ No newline at end of file
Maintained by David Gonzalez Marquez.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment