Skip to content

Hbouaz/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

com.web.crawler

A simple selenium/java com.web.crawler using Selenium, Java and Gradle

Dependencies:

All dependencies will be handled by Gradle. chromedriver executable has been included into the project.

Prerequisites:

Latest version of Chrome needs to be installed on the machine (75 at the time the project was compiled) Latest Java version is needed (12 used)

How to configure:

I have provided a config.properties file to include any configurations you want.

  • url: base url for the crawler. "Default = https://www.google.com/"
  • statusCode: Status code to search for the good unbroken links. "Default = 200”
  • asset: This is the type of links you want the page to search for. Either JS, IMAGES, CSS, or PAGES. “Default = PAGES”
  • maxLinksPerPage: maximum good links each parent page should retrieve. “Default = 2”
  • maxDepth: maximum depth of the traversal. “Default = 3”

How to run the project:

To run from terminal simply do gradle run

How to build and run a JAR:

To build an executable JAR simply run gradle clean then run fatJar gradle task gradle fatJar this will create an executable Jar under build/libs/crawler-1.0-SNAPSHOT.jar To run it go to terminal, cd to project directory and run java -jar build/libs/crawler-1.0-SNAPSHOT.jar (run from the project root and do not move the jar as it will break the resources)

Output

The output of the program is displayed on the console using slf4j/log4J library

About

A simple selenium/java crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages