GitHub - elasticjavajk/webcrawler: This is to crawl any provided website URL, limited to scraping the internal links and media used.

#Goal:#

To create an app to crawl http://wiprodigital.com. The crawling should be limited to the internal links wit in the domain, i.e,wiprodigital and should also display the media/images used in the crawled page.

#Technology:# The technology stack used to implement this application are as below

~ jsoup ~ java ~ log4j ~ junit & mockito ~ Maven

#Build & Excecution:#

+Prerequisites required to build and run this application is

Java 1.8 installed and setup on the local box
Maven installed and setup on the local box
CheckOut the project from github
Run the cmd in console 'mvn clean install' to build and package the jar
On successfull execution of the above step, the jar file should be available in target folder as 'webcrawler-1.0.jar'
Run the jar from the target folder by using cmd 'java -jar webcrawler-1.0.jar > crawloutput.txt '

On successfully completing the execution, the output file will have the list of links along with the media/images in a tree structure in the crawloutput.txt file, which will be available parallel to the jar file.

#Improvements:#

Implement the presentation layer to display the spilled output
Provide the output in a json response format to the client

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.settings		.settings
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.MD		README.MD
dependency-reduced-pom.xml		dependency-reduced-pom.xml
output.txt		output.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages