Skip to content

Paul566/coauthorGraph

Repository files navigation

Co-authorship Network of Google Scholar

I investigated the properties of a co-authorship network of Google Scholar. The nodes of this network are Google Scholar profiles. Two nodes are linked if at least one of the authors has specified the other as their co-author. Furthermore, I am focusing only on one connected component that contains Paul Erdos in it. So, we are dealing with an unweighted undirected connected graph. The list of co-authors for a given author is located at the right-bottom side of their Google Scholar profile page. So, it is possible for two authors to be co-authors in real life but don't list each other as co-authors. For example, Terence Tao has zero co-authors. But he is listed as a co-author of Emmanuel Candes, so he is not an isolated node and he belongs to the connected component I focus on. So, the number of edges in this graph is likely considerably underestimated.

Basic properties of the network

Number of nodes: 969,559
Number of edges: 3,873,577
Maximal Erdos number (the maximal distance from Paul Erdos to some other node): 11

My objective was to find out if the coauthorship graph is decently connected. There are reasons to believe that, for example, the physicists - non-physicists or Chinese - non-Chinese partition would split the graph into poorly connected parts. Let's try to minimize the conductance (a.k.a. the Cheeger constant) of the graph h. A small conductance means the presence of two relatively large poorly connected communities. Determining the conductance is an NP-hard problem, so trying to implement an exact algorithm is a hopeless idea. First, I have implemented two relatively standard approaches.

Spectral partition

It is known that if we sort the vertices with respect to the coordinates of the eigenvector that corresponds to the second-smallest eigenvalue of the Laplacian matrix of a graph, and search among the partitions of the form "the one part is vertices 1..k, another part is vertices k+1..n in the sorted array of vertices", we would get a decent partition.
Eigen and Spectra packages were used to compute the eigenvalues and the eigenvectors. In our case, the method finds a partition with conductance 0.0133 and the sizes of parts 37 and 969522. It turned out that there is a community of 37 people that are connected with an outside world with just one edge. That's not exactly what I was looking for. If we demand the sizes of parts be at least 10,000, we get a partition of conductance 0.141 and sizes of the parts 605255 and 364304.

Since the community of 37 people may strongly affect the second eigenvector of the Laplacian matrix, I also tried to exclude the 37-community and recalculate the second eigenvector of the Laplacian matrix. Manually setting the coordinates of the vector that would correspond to the vertices from the 37-community to zeros, I got a partition of the conductance 0.0579 and sizes of the parts 3228 and 966331. This is again not exactly what I wanted to see. If we limit the search to the partitions with the sizes of the parts greater than 10,000, we would get a partition of conductance 0.250 and sizes of the parts 318612 and 650947.
There is also a nice property called Cheeger's inequality that allows one to set bounds for the conductance h using the second-smallest eigenvalue of the normalized Laplacian matrix of the graph.

The first and the second computed eigenvalues are 1.6e-15 and 0.00878701452518. The first value is meant to be exactly zero, but it differs a bit due to a numerical error. In fact, all eigenvalues of the normalized Laplacian matrix of a graph are non-negative and there is a zero eigenvalue that corresponds to (sqrt(d1), sqrt(d2), ... sqrt(dn)) eigenvector, where di are the degrees of the vertices. If the graph is connected (which is our case), all other eigenvalues are positive.
So, using Cheeger's inequality we have 0.00439 < h < 0.133.
The 37-community instance of partition gives a new upper bound for conductance h, and 0.00439 < h < 0.0133.
When the 37-community is excluded, the second eigenvalue of the normalized Laplacian matrix of the new graph is 0.0093441, so, using Cheeger's inequality, we get the following bounds for the conductance of the co-authorship graph without the 37-community: 0.00467 < h-37 < 0.137, which is not much different from the bound on the conductance of the entire graph. It is worth noting that the exclusion of 37 random people from the network shouldn't affect the second eigenvalue of the normalized Laplacian matrix significantly, but since the 37 people had large components of the eigenvector corresponding to them, I was not sure if the exclusion of the 37-community would change the second eigenvalue a lot.

Annealing

Another relatively standard general approach is to anneal the problem. The initial partition is random. The step of annealing is to pick a random vertex and try to move it to another part of the partition. The diameter of the network of states in this process is 106, which is not very good for annealing, but since there are about 210 6 possible partitions, there is not much we can do to keep both the diameter of the network of possible states and the number of the adjacent states for a given state relatively small. 1010 steps of annealing gave a partition with conductance 0.0422 and sizes of the parts 482538 and 487021.
So far, the best-known instance of partition is cutting off the 37-community, with conductance 0.0133, obtained with spectral partitioning. The best-known instance of partition with the sizes of the parts that are not small has conductance 0.0422 obtained with annealing.

The degree distribution of the graph

This is the plot of the probability of finding a vertex of a given degree for the GS coauthorship graph and Erdos-Renyi random graph with the same number of vertices and edges. As one can see, the GS coauthorship graph has a huge amount of leaves, in fact, 35% of all profiles (in the connected component I am focusing on). There is also an increase in the density of vertices at degree 20, which I can't explain.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published