Source Code and Datasets for submission "Toward Random Walk Based Clustering of Variable-Order Networks"

Notes: This folder does not contain a main script for all experiments but different ones explained below

The experiments' scripts use some third party code: Files BuildRulesFast.py and BuildNetwork.py are used for the generation of relevant subsequences in an input dataset and the generation of the corresponding Von networks. The files are minor modifications of the ones available at https://github.com/xyjprc/hon (last check June 2021).

Dependencies and Setup

Python, version >=3 (experiments made with version 3.6.9)
HeapDict python library (command pip install HeapDict) (experiments made with version 1.0.1)
Infomap. Users must have a console command infomap (experiments made with version 1.3.0).
LFR Benchmark. Source code is given in folder 'LFRBenchmark'. Also requires gcc (experiments made with version 7.5.0).

In order to run test experiment reported in Section III, the user must go in folder 'LFRBenchmark' and run command make See the ReadMe.txt in the folder 'LFRBenchmark' for more details. We corrected a small bug occuring when compiling the program with gcc version >=7.5.0 (it is not related to the network generation itself). Original code available at https://sites.google.com/site/andrealancichinetti/files/binary_networks.tar.gz (last check June 2021).

Datasets

The three datasets used in the paper are available here. An explanation of each dataset is given in Section VI of the paper.

maritime_sequences.csv: Maritime sequences dataset (default dataset used in the script below)
2011Q1_SEQ.zip Airport sequences dataset (compressed).
trajectories_PoliceStation.zip Taxis sequences dataset (compressed).

The structure of input files is described at the end of the document.

Reproducing the experiments

LFR Tests cases clustering (Results in Section IV)

In order to reproduce the results given in Fig. 3 (page 5), run

    python3 TestCasesClustering.py

The NMI similarity values used to make Fig. 3 correspond to the following columns in the output:

10th Col. (nmi_2o_ns) : 2-Von input where different codes are assigned to representations of a same location within a given cluster
14th Col. (nmi_2o_unif) : 2-Von input where a unique code is assigned to representations of a same location within a given cluster
18th Col. (nmi_agg) : Min 2-Von input
22th Col.,(nmi_fon) : 2-Fon input

The boxplots are made using the ggplot2 package of the R library

Warning: The script creates some temporary files that are not removed at the end. However, they are written over during each execution.

Models Accuracy (Results in Section VII)

In order to reproduce the results given in col. 'Acc +- 2sd' of Table I (page 8), run

    python3 HONModelsAccuracy.py

To change the dataset used, open file HONModelsAccuracy.py and change the variable filename. Using the default value will launch the experiments on the Maritime dataset (file maritime_sequences.csv). All results are printed inside the Python console.

Networks and Clustering Comparison (Results in Section VII)

In order to reproduce the results given in Fig. 4, Table I and Table II (page 8), run

    python3 HONModelsClustering.py

To change the dataset used, open file HONModelsAccuracy.py and change the variable filename. Using the default value will launch the experiments on the Maritime dataset (file maritime_sequences.csv).

Network specific outputs are printed inside the Python console. Node specific statistics (e.g. number of cluster per location) are printed in file ./clusters_stats.csv. The reported variables are id,NET1_nd,NET1_cd,NET2,.. where NETi is either the Von2, Agg Von2 or Fon2 network.

id : id of the location
NETi_nd : number of representations of the location in network NETi
NETi_nc : number of clusters found for the location in network NETi

The cumulative plots in Fig. 4 are made using the ggplot2 package of the R library

The clusters found for each location are printed in file ./clusters.csv. The reported variables are id;NETi where NETi is either the Von2, Agg Von2 or Fon2 network.

id : id of the location
NETi : list of ids of clusters (int) the location belongs to in network NETi

Using different Datasets of Sequences

The input file structure containing the sequences should have the following format

    ID1 L1 L2 L3 ...
    ID2 L5 L6 L2 ...
    ID3 L2 L4 ...
    ...

The first element of each line is the id of the sequence. You can also drop this column and set is_line_id=False when the function readSequenceFile() is called. The rest of the line is the sequence of successive visited locations Lx, any string can be used to identified locations. The separating character (variable sep) can be changed in each experiment script.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LFRBenchmark		LFRBenchmark
2011Q1_SEQ.zip		2011Q1_SEQ.zip
AccuracyUtils.py		AccuracyUtils.py
AggOrder2Rules.py		AggOrder2Rules.py
BuildNetwork.py		BuildNetwork.py
BuildRulesFast.py		BuildRulesFast.py
FON2StatesNetwork.py		FON2StatesNetwork.py
HONModelsAccuracy.py		HONModelsAccuracy.py
HONModelsClustering.py		HONModelsClustering.py
HONUtils.py		HONUtils.py
InfoMapClust.py		InfoMapClust.py
OverlappingNMI.py		OverlappingNMI.py
README.md		README.md
TestCasesClustering.py		TestCasesClustering.py
TestCasesGeneration.py		TestCasesGeneration.py
maritime_sequences.csv		maritime_sequences.csv
trajectories_PoliceStation.zip		trajectories_PoliceStation.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source Code and Datasets for submission "Toward Random Walk Based Clustering of Variable-Order Networks"

Dependencies and Setup

Datasets

Reproducing the experiments

LFR Tests cases clustering (Results in Section IV)

Models Accuracy (Results in Section VII)

Networks and Clustering Comparison (Results in Section VII)

Using different Datasets of Sequences

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

fqueyroi/von2network-clust

Folders and files

Latest commit

History

Repository files navigation

Source Code and Datasets for submission "Toward Random Walk Based Clustering of Variable-Order Networks"

Dependencies and Setup

Datasets

Reproducing the experiments

LFR Tests cases clustering (Results in Section IV)

Models Accuracy (Results in Section VII)

Networks and Clustering Comparison (Results in Section VII)

Using different Datasets of Sequences

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages