Skip to content

lukasanukvari/parsify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parsify

Stop writing multiple parser scripts for parsing different websites. With Parsify you can have a single few lines script and the configuration file to fit your parser to different websites.

Contents

Installation

pip install parsify

Usage

Make sure you have your configuration file (usually handbook.json) ready.

import parsify as pf


# Create Parsify engine
ngn = pf.Engine(handbook='handbook.json')

# Run a single step
# Provide step name as an argument
# Should be in Engine.current_parser
# Should not have any "dynamic_variables" when custom using this method
# By default Engine.current_parser is the first parser in the Handbook
step_result = ngn.stepshot(step='get_products')
# print(step_result)

# Parse a single website (must be configured in "handbook.json")
# Provide scope name as an argument
scope_result = ngn.scopeshot(parser='example.com')
# print(scope_result)

# Run all the parsers that are configured in "handbook.json"
final_result = ngn.parse()
# print(final_result)

Handbook Tutorial

Required Fields

  • Handbook file should start with "parser" key value of which is the array of parsers.
  • Each parser in the array should have two keys:
    • "scope" - String: Name of the parser. Usually website name, i.e. "example.com".
    • "steps" - Array: Steps to parse.
  • Each step should have at least following fields:
    • "name" - String: Unique name of the step. This field will make possible to access this step's results and dynamic variables in the proceeding steps (if needed).
    • "chain_id" - Integer: Steps with the same chain id will be executed as a sequence of steps on every iteration.
    • "url" - String: Target url of the request(s) for the current step.
    • "method" - String: Request method for the current step.
    • "output_path" String: Path of the result data in response. Use dots if it's multi-nested, for example, if needed result is in response -> "data" -> "products", "output_path" should be "data.products".
    • "output" Dictionary:

License

Distributed under the MIT License. See LICENSE file for more information.

Contact

Luka Sosiashvili - @lukasanukvari - luksosiashvili@gmail.com

Project Link: https://github.com/lukasanukvari/parsify

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

About

Parsing package for massive multi-step parsing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages