jiasheng-li089/COSC431_Assignment01
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
This file describes how to compile this project, how to run it, the implementation of this project, and the funcitonality of each source file. For your information: - The inverted index file was generated on x64 architecture. It cannot be used on other architectures (like ARM, aarch64, etc.) because of the endianness and the implementation. Ideally, the inverted index file should be only used on Lin- ux with 64 bit, and on x64 architecture. That is the only set up I have verified. - Any document that contains one of the words will be regards as matched. - The search command use the weighted TF.IDF ranking to calcuate scores. 1. How to compile this project. In one word, one can use the command to compile the whole project (including th- ree commands): `` make clean && make all ``` The command `make` will utilize the compiler gnu g++ to compile the source file- s into executable files. But for now, I only have verified it with gnu g++ 13.3- .0. In theory, other versions of gnu g++ may also work, but this is not guarant- eed. After compiling, three executable files will be generated: `parser`, `indexer`, and `searcher`. These files are in the newly created `build` directory. - `parser`: the parser which parses all the documents and their content from the file `wsj.xml`. - `indexer`: the indexer which creates the inverted index file based on the `ws- j.xml`. - `searcher`: the searcher which searches words from the generated inverted ind- ex file. While there are multiple words in one query, any document that contains one of the words will be regarded as matched. ATTENTION!!! DO NOT uncomment the following line in `common.h`: // #define __DEBUG__ 1 Or the generated executable files will generate a lot of debug information to t- he standard output. 2. How to run this project. The most straightforward way to run these three generated executable files is: ``` parser <path to wsj.xml> indexer <path to wsj.xml> <path to generated index file> echo query1 query2 query3 | searcher <path to generated index file> ``` Explanation of the arguments for these commands: - <path to wsj.xml>: for `parser` and `indexer`, this is the absolute or relati- ve path to the `wsj.xml`. - <path to generated index file>: for `indexer`, this is the absolute or relati- ve path to the target file where the inverted index file will be generated. - <path to generated index file>: for `searcher`, this is the absolute or relat- ive path to the generated inverted index file. This argument is optional. If not provided, the `searcher` will use the 'index.idx' in current directory as the default inverted index file. 3. The implementation details of this project. The `indexer` generates the inverted index file based on the `wsj.xml`. The for- mat of the inverted index file is as follows: - Meta information - L1 index - L2 index - Vocabularies - Document records - Document names - Posting list Except for the sections of vocabulary and document names, the structures of oth- er sections are defined in the source file `index/index_data.hpp`. When the `se- arch` runs, it will load the meta information first. In this case, because the L1 index is small engough, so it will be loaded into memory with the meta infor- mation. Then `search` utilizes multithreads to load the vocabulary, the document records, and the document names respectively based on the size to each section (if the L1 index is too large, it can be loaded in this way, too). Then, `searc- h` will accept the queries from the standard input. The search process is as fo- llows: Step 1: If there are multiple queries, create a new thread for each query. Othe- rwise, just search on the current thread. Step 2: Binary search the L2 index, to find the posting location or the range of the L2 index. If the posting location is found, go to step 5. Otherwise, go to step 3. Step 3: Load the partially L2 index into memory based on the location and size obtained from step 2. Step 4: Binary search the partial L2 index to find the posting location. If not found, return an empty posting list. Step 5: Load the partially posting list into memory based on the result from st- eps above and return. Step 6: If there are multiple queries, merge the posting lists from each thread, convert each entity in posting list to an search result, and calcuate their sc- ores. Step 7: Sort the search results based on their scores in descending order. Step 8: Print the search results to the standard output. 3.1 Optimization. To improve the performance of the search, I used several tricks to optimize the search process: 1. Considering the bottleneck is on the disk I/O, I implemented a efficient file reader to avoid the data copy from kernel space to user space during the file reading process. 2. In the initial step of loading vocabulary, document records, and document na- mes, I used multi-threading to load them concurrently. 3. While searching for several words, I used multi-threading to search for each word. 4. To reduce the overhead of building index in the memory, I shifted the work i- nto the step of building the index. In the process of building the index, all t- he data for each section is dumped from the memory directly. In the process of searching, all this data is converted into the pointer of structures respective- ly. However, there might be a problem on the inverted index file generated on one p- latform and loaded on another platform, because of the endianess issue (this im- plementation depends on the offsets badly). 4. The functionality of the source files. `parser_main.cpp`: the entry file of the `parser`. `indexer_main.cpp`: the entry file of the `indexer`. `search_main.cpp`: the entry file of the `search`. `parser/parser.cpp`: the main logic of the `parser`. `parser/doc_record.cpp`: the parsed document records. `index/index_data.hpp`: the data structure of each section in the inverted index file. `index/index_builder.cpp`: the main logic of the `indexer`. `index/index_loader.cpp`: the main logic of the `search`. `io/e_file_reader.cpp`: efficient file reader which could avoid the data copy f- rom kernel space to user space during the file reading process. `io/index_file_writer.cpp`: file writer to write data into each section of the inverted index file. `utils/mm_chunk.cpp`: memory chunk management to allocate memory to save the un- ique words during the parsing process.