Co-authorship Network of Google Scholar

I investigated the properties of a co-authorship network of Google Scholar. The nodes of this network are Google Scholar profiles. Two nodes are linked if at least one of the authors has specified the other as their co-author. Furthermore, I am focusing only on one connected component that contains Paul Erdos in it. So, we are dealing with an unweighted undirected connected graph. The list of co-authors for a given author is located at the right-bottom side of their Google Scholar profile page. So, it is possible for two authors to be co-authors in real life but don't list each other as co-authors. For example, Terence Tao has zero co-authors. But he is listed as a co-author of Emmanuel Candes, so he is not an isolated node and he belongs to the connected component I focus on. So, the number of edges in this graph is likely considerably underestimated.

Basic properties of the network

Number of nodes: 969,559
Number of edges: 3,873,577
Maximal Erdos number (the maximal distance from Paul Erdos to some other node): 11

My objective was to find out if the coauthorship graph is decently connected. There are reasons to believe that, for example, the physicists - non-physicists or Chinese - non-Chinese partition would split the graph into poorly connected parts. Let's try to minimize the conductance (a.k.a. the Cheeger constant) of the graph h. A small conductance means the presence of two relatively large poorly connected communities. Determining the conductance is an NP-hard problem, so trying to implement an exact algorithm is a hopeless idea. First, I have implemented two relatively standard approaches.

Spectral partition

It is known that if we sort the vertices with respect to the coordinates of the eigenvector that corresponds to the second-smallest eigenvalue of the Laplacian matrix of a graph, and search among the partitions of the form "the one part is vertices 1..k, another part is vertices k+1..n in the sorted array of vertices", we would get a decent partition.
Eigen and Spectra packages were used to compute the eigenvalues and the eigenvectors. In our case, the method finds a partition with conductance 0.0133 and the sizes of parts 37 and 969522. It turned out that there is a community of 37 people that are connected with an outside world with just one edge. That's not exactly what I was looking for. If we demand the sizes of parts be at least 10,000, we get a partition of conductance 0.141 and sizes of the parts 605255 and 364304.

Since the community of 37 people may strongly affect the second eigenvector of the Laplacian matrix, I also tried to exclude the 37-community and recalculate the second eigenvector of the Laplacian matrix. Manually setting the coordinates of the vector that would correspond to the vertices from the 37-community to zeros, I got a partition of the conductance 0.0579 and sizes of the parts 3228 and 966331. This is again not exactly what I wanted to see. If we limit the search to the partitions with the sizes of the parts greater than 10,000, we would get a partition of conductance 0.250 and sizes of the parts 318612 and 650947.
There is also a nice property called Cheeger's inequality that allows one to set bounds for the conductance h using the second-smallest eigenvalue of the normalized Laplacian matrix of the graph.

The first and the second computed eigenvalues are 1.6e-15 and 0.00878701452518. The first value is meant to be exactly zero, but it differs a bit due to a numerical error. In fact, all eigenvalues of the normalized Laplacian matrix of a graph are non-negative and there is a zero eigenvalue that corresponds to (sqrt(d₁), sqrt(d₂), ... sqrt(d_n)) eigenvector, where d_i are the degrees of the vertices. If the graph is connected (which is our case), all other eigenvalues are positive.
So, using Cheeger's inequality we have 0.00439 < h < 0.133.
The 37-community instance of partition gives a new upper bound for conductance h, and 0.00439 < h < 0.0133.
When the 37-community is excluded, the second eigenvalue of the normalized Laplacian matrix of the new graph is 0.0093441, so, using Cheeger's inequality, we get the following bounds for the conductance of the co-authorship graph without the 37-community: 0.00467 < h_-37 < 0.137, which is not much different from the bound on the conductance of the entire graph. It is worth noting that the exclusion of 37 random people from the network shouldn't affect the second eigenvalue of the normalized Laplacian matrix significantly, but since the 37 people had large components of the eigenvector corresponding to them, I was not sure if the exclusion of the 37-community would change the second eigenvalue a lot.

Annealing

Another relatively standard general approach is to anneal the problem. The initial partition is random. The step of annealing is to pick a random vertex and try to move it to another part of the partition. The diameter of the network of states in this process is 10⁶, which is not very good for annealing, but since there are about 2^10⁶ possible partitions, there is not much we can do to keep both the diameter of the network of possible states and the number of the adjacent states for a given state relatively small. 10¹⁰ steps of annealing gave a partition with conductance 0.0422 and sizes of the parts 482538 and 487021.
So far, the best-known instance of partition is cutting off the 37-community, with conductance 0.0133, obtained with spectral partitioning. The best-known instance of partition with the sizes of the parts that are not small has conductance 0.0422 obtained with annealing.

The degree distribution of the graph

This is the plot of the probability of finding a vertex of a given degree for the GS coauthorship graph and Erdos-Renyi random graph with the same number of vertices and edges. As one can see, the GS coauthorship graph has a huge amount of leaves, in fact, 35% of all profiles (in the connected component I am focusing on). There is also an increase in the density of vertices at degree 20, which I can't explain.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
figures		figures
files		files
.gitignore		.gitignore
Analyser.cpp		Analyser.cpp
Analyser.h		Analyser.h
Annealing.cpp		Annealing.cpp
Annealing.h		Annealing.h
FreeStylePartitioner.cpp		FreeStylePartitioner.cpp
FreeStylePartitioner.h		FreeStylePartitioner.h
README.md		README.md
SpectralPartitioner.cpp		SpectralPartitioner.cpp
SpectralPartitioner.h		SpectralPartitioner.h
collect.py		collect.py
illustrator.py		illustrator.py
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Co-authorship Network of Google Scholar

Basic properties of the network

Spectral partition

Annealing

The degree distribution of the graph

About

Uh oh!

Releases

Packages

Languages

Paul566/coauthorGraph

Folders and files

Latest commit

History

Repository files navigation

Co-authorship Network of Google Scholar

Basic properties of the network

Spectral partition

Annealing

The degree distribution of the graph

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages