Skip to content

diskeu/WEB_SCRAPER_SCRATCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper Built Completely From Scratch

I decided to build a web scraper entirely from scratch, without using any external libraries. The request sender, HTTP parser, and HTML parser are all implemented by hand.

Request Sender

The request sender is built using only the ssl and socket modules. It supports: • HTTPS connections • Chunked transfer encoding

The request sender is split into:

•	a request builder
•	a request sender

HTTP Parser

The HTTP parser consists of:

•	a file for the actual parsing logic
•	a file for defining and managing parser attributes

HTML Parser

The HTML parser is the core of the project. It includes:

•	two tokenizer versions (tokenizer_v1_0_0 is the stable one)
•	a token class
•	a tree constructor for building the DOM

The tokenizer features:

•	a debug mode (prints the current state of all internal buffers)
•	a pretty printer to visualize the generated DOM tree

Hope you enjoy i

About

Web scraper built completly from scratch in python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors