Skip to content

xhujerr/Sumid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sumid

Motivation:

This program is used as a software support for my reserach project. The research project focuses on searching patterns in URL. I try to find, describe and quantify patterns in URL. The program could be categorized as a crawler (spider). Folowing articles describing experiments with the program:

Program operation - in short

  • Take input url from linklist.

  • Expand it to a tree.

  • Try to find a pattern(s) in input url

  • Try to iterate the patern (there might be about 1000 iterations per url)

  • Do an operation (log/download)

  • The program is intended to run for very long time (about a week). From that reason is written in multithreaded manner with too chatty logging).

Usage

  • The essential configuration is in file sumid.ini. The fine tuning is in settings.py.

  • Set-up at least the linklist parameter.

  • It's also good idea to set-up the WorkDir and LogDir.

  • Currently the script probably won't work under Windows, because of problems with filesystem paths.

Files and contents

Sumid is more a toolbox than a single program. Would be good to separate it in several smaller pieces. The state of art now is a result of how the program evolved satisfying concrete needs rather than seeing it as a product. Currently I am exploring the scrapy framework in order to transfer the core functionality in it.

  • sumid.py - the main program containing the four classes of consumer/producer line. Each producer runs in separate thread.
  • comptree.py - basically implements a tree structure for exploring web resources.
  • linklist.py - takes care of the input data.
  • miscutil.py - holds settings, debugging and some misc funcionality.
  • bow.py - implements bag of words. Analyses URLs and looks for words with highest frequencies.
  • sls.py - adaptor to pydigg library. Used to collect links for further experiments.

About

Script used for mass items downloading

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages