Li-Scraper

LiScraper.py is a single-pass script meant to extract a list of profiles from within your own LinkedIn network, according to the results of a search query e.g. data scientist AND Singapore. This script was used to fulfill the use case described in the following Medium article. Please remember that LinkedIn has a User Agreement that expressly prohibits crawlers and scraping activity on their website - use this resource at your own risk.

Also note that LinkedIn will not display profiles beyond 3rd degree connections. As such, please be aware of biases arising from the inspection paradox when attempting to make inferences about the population of Data Scientists or other professionals of interest using your own social network.

Within the script, the following parameters can be changed according to your requirements:

pd.json Supply your LinkedIn credentials in this json file with the following structure:

{
    "username": "username",
    "pd": "password"
}

results.csv Profiles will be written to this file in .csv format, one profile per row. You do not have to create this file before running the script.

dataKey.json This will store the unique profile links that have been scraped. The scraper might halt for various reasons after execution, and this store helps to keep track of scraped profiles in case it needs to be restarted. You do not have to create this file before running the script.

self.searchQuery This will contain the search query for profiles of interest.

self.completedPages In case the scraper halts midway, this variable will enable you to pick up at the page where you left off. Specify for example self.completedPages = 9 if the scraper crashes at page 10 of your results.

self.totalPages This determines the total number of pages that your scraper will evaluate. By default, LinkedIn does not display more than 100 pages of results for any search query.

self.expo This is a sample of 1000 interval times in units of seconds, drawn from an exponential distribution with scale parameter 2 (i.e. mean of 1 click every 2 seconds). It is meant to simulate the distribution of gap times between human clicks in normal browser interactions. Change the scale parameter to simulate your preferred distribution of interval times.

Pipfile

A Pipfile is supplied with this package to install dependencies within a python virtual environment. In Terminal, pipenv install to install and pipenv shell to access the virtual environment.

ChromeDriver

Li-Scraper uses ChromeDriver to emulate browser interactions, so make sure you have it installed. On a Mac, the easiest way is to use brew cask install chromedriver in Terminal. Otherwise, please refer to the ChromeDriver documentation: https://chromedriver.chromium.org/getting-started

Execution

Once inside the virtual environment, run pipenv run python Li-Scraper.py.

Results

Results are written to the file results.csv, one profile per row. Because each profile may have multiple entries for education and experience, the data for each of these parameters are written as a json object array, with the following structure:

Experience

{
    "data":[
        {
            "position": "Position 1",
            "company": "Company A",
            "duration": "X1 yrs Y1 mos"
        },
        {
            "position": "Position 2",
            "company": "Company B",
            "duration": "X2 yrs Y2 mos"
        }
    ]
}

Education

{
    "data":[
        {
            "school": "School 1",
            "certification": "Bachelor's P1:::Domain Q1:::Merit R1"
        },
        {
            "school": "School 2",
            "certification": "Master's P2:::Domain Q2"
        }
    ]
}

The field certification contains up to 3 sub-parameters: Certificate (Bachelors/Masters etc.), Domain (Biology/Computer Science etc.) and Merit (First Class Honours/ 4.0 GPA etc.). These sub-parameters are delimited by triple colon :::.

Troubleshooting

The scraper works by emulating actual browser interactions, so often times it will fail if the page has not updated to the correct state and does not contain the HTML elements being looked for. In particular, if the comment Finding next icon... is being repeated multiple times in Terminal, the driver is attempting to scroll down to the bottom of the page but the browser is not responding for some reason. In this situation, it suffices to manually scroll down the page to help the driver along. Otherwise, if the scraper crashes, it does not hurt to restart the execution. You can specify the page to start using self.completedPages, and the file dataKey.json will ensure that the scraper will pick up where it left off.

Helper files

keyset.json

This is a json file containing the XPath selectors mapping to the identity of the html tags containing the parameters of interest i.e. Name, Experience and Education. Considerable manual effort was invested into identifying these tags and their corresponding selectors. Do not be surprised if LinkedIn changes the identity of the html tags, in which case you will have to update the XPath selectors in this file manually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Li-Scraper

Pipfile

ChromeDriver

Execution

Results

Experience

Education

Troubleshooting

Helper files

keyset.json

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
LiScraper.py		LiScraper.py
Pipfile		Pipfile
README.md		README.md
keyset.json		keyset.json

Folders and files

Latest commit

History

Repository files navigation

Li-Scraper

Pipfile

ChromeDriver

Execution

Results

Experience

Education

Troubleshooting

Helper files

keyset.json

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages