Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 2.7 KB

File metadata and controls

30 lines (22 loc) · 2.7 KB

DatabaseTechnology

Project for Database Technology class. Partially described in dbtproj.h.

We had to implement 4 algorithms that exist in a typical DBMS. These algorithms were:

  1. MergeSort (external - should work for much more data than the available memory)
  2. MergeJoin
  3. EliminateDuplicates
  4. HashJoin

In MergeJoin we suppose that we have at least 4 blocks of available memory, and no more than 128 records can have the same key. That is because in MergeJoin for each record of file 1, we output all records of same key from file 2. If we reach the end of the current block from file 2, we gotta keep going to find more possible records that satisfy the join condition. This could keep on going forever, so we limit it by saying that we will only read one more block (maximum 128 records that satisfy the join condition - 128 is the default size for a block).

You can generate any data you like randomly, the only condition is the above. No more than 128 records (or block's max size) with the same key, or else the join algorithms will loop infinitely :).

Here are some benchmarks:

Memory available in blocks Size of Input File 1 in blocks Size of Input File 2 in blocks MergeSort for file 1 MergeSort for file 2 MergeJoin Eliminate Duplicates for file 1 HashJoin
400 7664 Valid Records: 799137 16812 Valid Records: 1747161 2.6s 6.1s 9.3s 2.4s 60.5s
4000 Same Same 2.3s 5.3s 8s 2.5s 6.6s
4000 239030 (1.60GB) Valid Records: 24796769 161413 (1.08GB) Valid Records: 16749615 215.6s 113.5s 337.6s 237.4s 1946s
40000 Same Same 158.8s 100.4s 288.5s 180.8s 237.7s

Algorithm implementations in dbtproj.cpp.
Functions related to hash join in hash.cpp.
Functions related to merge join in merge.cpp.
Functions to compare records based on a field in compare.cpp.
Read and write operations in disk_io.cpp.

To compile, you have to add -std=c++11 in your compiler's options/flags you name it.