This is a cli tool build for crawling through a given web page and download all avaialable .pdf file in the page. It comes with handy feature likes options that allow you to select which kind of commad to be executed at any given time. In order to improve permformance, all retieved pdf files are cached in the system disk using the file system and are invalidated after 24 hours. Scecondly the tool also utilised the node thread pools by executing majority of I/O tasks asynchronously with allowing blocking. In addition to this the tool also spwan up node clusters based on the number of core of the under running system, this is also a means to improve concurrency and parrallel processing. ** Note** this is for demo purposes only and have not fully cover all the edge cases for a production ready tool.
1.generate a list of all available pdf files in a web page and save them as a json object for the gievn url 2. Download all the retrieved pdf files from the page and save them in a downloads directory at the root of the tool folder this can be either in a concurrent option or by synchronous option 3. Retrieved a pdf file based on a search string and download it present on the web page "" 4. Merge the pdf files into a single file 5. Translate the pdf file to a target locale
This project is build using pnpm so you need to install it global for some of the script like link. In the project directory, you can run:
Runs the app in the development mode by generating the build files in watch mode
This is helpful in generating the javascript files for each code change or edits.
Runs unit test for the majority of utility functions
Builds the tool the build folder.
npm run link:cli or pnpm link this script requires the pnpm package manager to be installed first on your system but you can change it to use npm or yarn instead
This register and link the crawler-rpa tool to your path environment variable globally. so that you can easily use the tool by typying crawler-rpa <url> [search] [-options]
this unregister and unlink the tool making it unavailable for you to use
This invalidate the cache by removing it from the system disc
This delete all the dowloaded pdf files from the system
-
clone the repository
git clone https://github.com/Onesco/crawler-rpa.git -
install the dependencies
npm install or pnpm install
In order to use tool after you might clone and install all the dependencies for the project for need to ensure you have pnpm installed globally so you can just run npm install -g pnpm this is necessary in the linking stage of the project which will enable you to run the tool on your machine.
- Register and link the cli tool
npm run link:cli or pnpm link:clithis stage makes use if pnpm so you need to have pnpm installed globally or better still modify the script to either use npm or yarn instead
After then you can now run:
crawler-rpa -h to see get the help menu of the tool and see all the available commands and options
crawler-rpa -u https://www.google.com/search?q=cardiovascular+risk+factors+pdf&rlz=1C5CHFA_enNG1080NG1081&oq=cardiovascular+risk+factors+pdf&gs_lcrp=EgZjaHJvbWUqBggAEEUYOzIGCAAQRRg7MgYIARBFGDwyBggCEEUYPDIGCAMQRRg80gEIMzA0NWowajeoAgCwAgA&sourceid -conc
Will go through the google page for this search link and dowload all the available pdf files.
-ufor short or--urlis for the url to search and it optional so you can remove it and just runcrawler-rpa <url link>-concfor short or--concurrentwill ensure that for each found pdf file, the download will be handle by a thread will the seach for the next one continue with asynchronous. it is optional if provide it improve performance but if not provieded the download of pdf files will only occur after we might have seen all the pdf files in the page.-sfor short or--searchis used to provide a search key word which will be will allow us to get get pdf file[s] for that given parent node the search key work is found.
% crawler-rpa -h
Usage: crawler-rpa [options] [command] <url> [search]
this crawler require url argument for a website in order to process it and retrieve all available pdf files in the
page and also download them either after all pdf files have been retrieved or currently available pdf files for each
pdf file seen. It also has an internal caching mechanism that save all retrieved filed to the disk with a ttl of 24
hours upon which a further query to the provided website will return invalidated the cache and make a free query for
the given url.
It also has option to that allows you to merge list of pdf files into one by providing the file path to the pdf files
Options:
-u --url the url to site to process
-s --search <string> the string to search for pdf files on the site
-conc --concurrent the boolean value that is provided in order to execute both the get of pdf files
and download of the pdf files concurrently
-t --ttl <number> the time to live (in seconds) of retrieved pdf files links from a given web page
(default: 86_400 seconds that is 24 hours)
-sa --save-as <string> json file name to save the cached retrieved pdf file cached-crawled-pdf-links
directory, if absent default to pdf-links
-f --from <string> the locale of the pdf file (default"auto"). It can be any of the following
supported locales: en,nl,kr,
-ta --translated-as <string> the name you want to save the translated file as. if not provide it is default to
old-filename-<to>.pdf
-h, --help display help for command
Commands:
merge <files> [marge-as] This command will merge lists of pdf files into a single pdf file. The pdf files
must be provided as a the first argument as string sepearated by comma delimiter.
The second argument is optional and when provided will be the name of the merged
file
translate [options] <file> <to> translate PDF file text only to a provided locale based on google translation api.
The supported locale to translate to are seen in this link
<https://cloud.google.com/translate/docs/languages>
crawler-rpa merge [options]
this command is used for merging pdf files
run crawler-rpa merge -h to see the help menu at the merge command level
crawler-rpa merge "first-pdf-file-path.pdf, second-pdf-file-path.pdf, third-pdf-file-path.pdf" "merge-output-name"
% crawler-rpa merge -h
Usage: crawler-rpa merge [options] <files> [marge-as]
This command will merge lists of pdf files into a single pdf file. The pdf files must be provided as a the first
argument as string sepearated by comma delimiter. The second argument is optional and when provided will be the name
of the merged file
Arguments:
files string of file paths to the pdf files to be merge separate by comma delimiter[","] example- to
"myfirstpdffile.pdf,mysecondpdffile.pdf"
marge-as the name you want want to merge the files to. if not provide it is default a string of all the file
names delimited by "__" between each names
Options:
-h, --help display help for command
crawler-rpa translate [options]
this command is used for merging pdf files
run crawler-rpa translate -h to see the help menu at the merge command level
crawler-rpa translate "first-pdf-file-path.pdf" "fr" -ta "my-first-pdf-file-in-fr"
% crawler-rpa translate -h
Usage: crawler-rpa translate [options] <file> <to>
translate PDF file text only to a provided locale based on google translation api. The supported locale to translate
to are seen in this link <https://cloud.google.com/translate/docs/languages>
Arguments:
file string to the path to the pdf file to be translated "mysecondpdffile.pdf"
to the target locale to translate the pdf file to (default"en"). It can be any of the
following supported locales: en,nl,pt see
<https://cloud.google.com/translate/docs/languages> for supported lcales
Options:
-f --from <string> the locale of the pdf file (default"auto"). It can be any of the following supported
locales: en,nl,kr,
-ta --translated-as <string> the name you want to save the translated file as. if not provide it is default to
old-filename-<to>.pdf
-h, --help display help for command