GitHub - jiasheng-li089/COSC431_Assignment01

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
index		index
io		io
parser		parser
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.txt		README.txt
common.h		common.h
indexer_main.cpp		indexer_main.cpp
parser_main.cpp		parser_main.cpp
search_main.cpp		search_main.cpp

Repository files navigation

This file describes how to compile this project, how to run it,
the implementation of this project, and the funcitonality of each
source file.

For your information:
- The inverted index file was generated on x64 architecture. It cannot be used 
on other architectures (like ARM, aarch64, etc.) because of the endianness and 
the implementation. Ideally, the inverted index file should be only used on Lin-
ux with 64 bit, and on x64 architecture. That is the only set up I have verified.

- Any document that contains one of the words will be regards as matched.

- The search command use the weighted TF.IDF ranking to calcuate scores.


1. How to compile this project.
In one word, one can use the command to compile the whole project (including th-
ree commands):
``
make clean && make all
```

The command `make` will utilize the compiler gnu g++ to compile the source file-
s into executable files. But for now, I only have verified it with gnu g++ 13.3-
.0. In theory, other versions of gnu g++ may also work, but this is not guarant-
eed.

After compiling, three executable files will be generated: `parser`, `indexer`, 
and `searcher`. These files are in the newly created `build` directory.
- `parser`: the parser which parses all the documents and their content from the 
 file `wsj.xml`.
- `indexer`: the indexer which creates the inverted index file based on the `ws-
j.xml`.
- `searcher`: the searcher which searches words from the generated inverted ind-
ex file. While there are multiple words in one query, any document that contains
 one of the words will be regarded as matched.

ATTENTION!!!
DO NOT uncomment the following line in `common.h`:
// #define __DEBUG__ 1
Or the generated executable files will generate a lot of debug information to t-
he standard output.


2. How to run this project.
The most straightforward way to run these three generated executable files is:
```
parser <path to wsj.xml>
indexer <path to wsj.xml> <path to generated index file>
echo query1 query2 query3 | searcher <path to generated index file>
```

Explanation of the arguments for these commands:
- <path to wsj.xml>: for `parser` and `indexer`, this is the absolute or relati-
ve path to the `wsj.xml`.

- <path to generated index file>: for `indexer`, this is the absolute or relati-
ve path to the target file where the inverted index file will be generated.

- <path to generated index file>: for `searcher`, this is the absolute or relat-
ive path to the generated inverted index file. This argument is optional. If not
 provided, the `searcher` will use the 'index.idx' in current directory as the 
default inverted index file.


3. The implementation details of this project.
The `indexer` generates the inverted index file based on the `wsj.xml`. The for-
mat of the inverted index file is as follows:
- Meta information
- L1 index
- L2 index
- Vocabularies
- Document records
- Document names
- Posting list

Except for the sections of vocabulary and document names, the structures of oth-
er sections are defined in the source file `index/index_data.hpp`. When the `se-
arch` runs, it will load the meta information first. In this case, because the 
L1 index is small engough, so it will be loaded into memory with the meta infor-
mation. Then `search` utilizes multithreads to load the vocabulary, the document
 records, and the document names respectively based on the size to each section 
(if the L1 index is too large, it can be loaded in this way, too). Then, `searc-
h` will accept the queries from the standard input. The search process is as fo-
llows:
Step 1: If there are multiple queries, create a new thread for each query. Othe-
rwise, just search on the current thread.

Step 2: Binary search the L2 index, to find the posting location or the range of
 the L2 index. If the posting location is found, go to step 5. Otherwise, go to 
step 3.

Step 3: Load the partially L2 index into memory based on the location and size 
obtained from step 2.

Step 4: Binary search the partial L2 index to find the posting location. If not 
found, return an empty posting list.

Step 5: Load the partially posting list into memory based on the result from st-
eps above and return.

Step 6: If there are multiple queries, merge the posting lists from each thread,
 convert each entity in posting list to an search result, and calcuate their sc-
ores.

Step 7: Sort the search results based on their scores in descending order.

Step 8: Print the search results to the standard output.


3.1 Optimization.
To improve the performance of the search, I used several tricks to optimize the 
search process:
1. Considering the bottleneck is on the disk I/O, I implemented a efficient file
 reader to avoid the data copy from kernel space to user space during the file 
reading process.

2. In the initial step of loading vocabulary, document records, and document na-
mes, I used multi-threading to load them concurrently.

3. While searching for several words, I used multi-threading to search for each 
word.

4. To reduce the overhead of building index in the memory, I shifted the work i-
nto the step of building the index. In the process of building the index, all t-
he data for each section is dumped from the memory directly. In the process of 
searching, all this data is converted into the pointer of structures respective-
ly.

However, there might be a problem on the inverted index file generated on one p-
latform and loaded on another platform, because of the endianess issue (this im-
plementation depends on the offsets badly).


4. The functionality of the source files.
`parser_main.cpp`: the entry file of the `parser`.
`indexer_main.cpp`: the entry file of the `indexer`.
`search_main.cpp`: the entry file of the `search`.

`parser/parser.cpp`: the main logic of the `parser`.
`parser/doc_record.cpp`: the parsed document records.

`index/index_data.hpp`: the data structure of each section in the inverted index
 file.
`index/index_builder.cpp`: the main logic of the `indexer`.
`index/index_loader.cpp`: the main logic of the `search`.

`io/e_file_reader.cpp`: efficient file reader which could avoid the data copy f-
rom kernel space to user space during the file reading process.
`io/index_file_writer.cpp`: file writer to write data into each section of the 
inverted index file.

`utils/mm_chunk.cpp`: memory chunk management to allocate memory to save the un-
ique words during the parsing process.