Source Code and Datasets for submission "Toward Random Walk Based Clustering of Variable-Order Networks"
Notes: This folder does not contain a main script for all experiments but different ones explained below
The experiments' scripts use some third party code: Files BuildRulesFast.py and BuildNetwork.py are used for the generation of relevant subsequences in an input dataset and the generation of the corresponding Von networks. The files are minor modifications of the ones available at https://github.com/xyjprc/hon (last check June 2021).
- Python, version >=3 (experiments made with version 3.6.9)
- HeapDict python library (command
pip install HeapDict) (experiments made with version 1.0.1) - Infomap. Users must have a console command
infomap(experiments made with version 1.3.0). - LFR Benchmark. Source code is given in folder 'LFRBenchmark'. Also requires
gcc(experiments made with version 7.5.0).
In order to run test experiment reported in Section III, the user must go in folder 'LFRBenchmark' and run command make
See the ReadMe.txt in the folder 'LFRBenchmark' for more details.
We corrected a small bug occuring when compiling the program with gcc version >=7.5.0 (it is not related to the network generation itself). Original code available at https://sites.google.com/site/andrealancichinetti/files/binary_networks.tar.gz (last check June 2021).
The three datasets used in the paper are available here. An explanation of each dataset is given in Section VI of the paper.
maritime_sequences.csv: Maritime sequences dataset (default dataset used in the script below)2011Q1_SEQ.zipAirport sequences dataset (compressed).trajectories_PoliceStation.zipTaxis sequences dataset (compressed).
The structure of input files is described at the end of the document.
In order to reproduce the results given in Fig. 3 (page 5), run
python3 TestCasesClustering.py
The NMI similarity values used to make Fig. 3 correspond to the following columns in the output:
- 10th Col. (nmi_2o_ns) : 2-Von input where different codes are assigned to representations of a same location within a given cluster
- 14th Col. (nmi_2o_unif) : 2-Von input where a unique code is assigned to representations of a same location within a given cluster
- 18th Col. (nmi_agg) : Min 2-Von input
- 22th Col.,(nmi_fon) : 2-Fon input
The boxplots are made using the ggplot2 package of the R library
Warning: The script creates some temporary files that are not removed at the end. However, they are written over during each execution.
In order to reproduce the results given in col. 'Acc +- 2sd' of Table I (page 8), run
python3 HONModelsAccuracy.py
To change the dataset used, open file HONModelsAccuracy.py and change the variable filename.
Using the default value will launch the experiments on the Maritime dataset (file maritime_sequences.csv).
All results are printed inside the Python console.
In order to reproduce the results given in Fig. 4, Table I and Table II (page 8), run
python3 HONModelsClustering.py
To change the dataset used, open file HONModelsAccuracy.py and change the variable filename.
Using the default value will launch the experiments on the Maritime dataset (file maritime_sequences.csv).
Network specific outputs are printed inside the Python console.
Node specific statistics (e.g. number of cluster per location) are printed in file ./clusters_stats.csv.
The reported variables are id,NET1_nd,NET1_cd,NET2,.. where NETi is either the Von2, Agg Von2 or Fon2 network.
id: id of the locationNETi_nd: number of representations of the location in networkNETiNETi_nc: number of clusters found for the location in networkNETi
The cumulative plots in Fig. 4 are made using the ggplot2 package of the R library
The clusters found for each location are printed in file ./clusters.csv.
The reported variables are id;NETi where NETi is either the Von2, Agg Von2 or Fon2 network.
id: id of the locationNETi: list of ids of clusters (int) the location belongs to in networkNETi
The input file structure containing the sequences should have the following format
ID1 L1 L2 L3 ...
ID2 L5 L6 L2 ...
ID3 L2 L4 ...
...
The first element of each line is the id of the sequence. You can also drop this column and set is_line_id=False when the function readSequenceFile() is called.
The rest of the line is the sequence of successive visited locations Lx, any string can be used to identified locations.
The separating character (variable sep) can be changed in each experiment script.