Skip to content

Daniel-Jarka/Project-Gerp

Repository files navigation

/******************************************************************************

  • Project 4: Gerp
  • README
  • Authors: Daniel Jarka (djarka01) and Paul Wang (pwang12) ******************************************************************************/

B. Program Purpose:

The program is a search query. It will take in commands, and with a given
directory output the location of the apperance of the given word into an 
output file. It can consider case insensitive searches where capitalization
doesn't matter and case sensitive search where the specific word is 
searched. Additionally, the program can create a new output file if asked.

C. Acknowledgements:

Nadia the TA

used the cctype reference:
https://cplusplus.com/reference/cctype/

D. Files: main.cpp: The main driver file that we used to create the instance of the gerp class to execute the program

 README:
      this file

 Makefile:
      the compiling program that compiles working c++ files to a runnable
      executable

 gerp.h:
      The interface for the main program class, gerp. Contains query 
      function and all search commands

 gerp.cpp:
      The implementation of the gerp class

 HashMap.h:
      The interface to the hashmap class, contains public functions to
      insert and retrieve hasmap values and locations

 HashMap.cpp:
      The implementation of hashmap class

 Sensitivehash.h:
      The interface of the sensitive hash map class. This class is used as 
      the hashmap for the specific case snesitive buckets of larger words.

 SensitiveHash.cpp:
      the implementation of the sensitive hashmap class.

 unit_test.h:
      A unit testing file for the processing functions, and for testing
      of inital hashmapping and gerp. Functions were made public for
      testing, so they are now all commented out.

 file1.in:
      A testing file used to test basic queries, punctuation, and non
      alphanumeric characters.
 
 file2.in:
      Testing file used to test sensitive, insenstive, and new file
      commands

 file3.in:
      Testing file used to test tricky text like equations and hyperlinks

E. Compile/run:

 import all relevant c++ files stated above and the make file.
 write "make" while in the directory with the files to compile.

 To run:
      The program takes three arguments:
      write "./gerp" and then the name of the directory to search, and then
      the name of the output file that will be intially written into.

F. Architectural Overview:

 To start, the entire program runs through the gerp class. In the gerp
 constructor, we build the hashmap and file directory. 
 
 The hashmap itself contains a pointer to another hashmap in each of its
 buckets. The top-level hashmap is built using the hashmap class, where the 
 buckets contain the sensitive hash class. Both hash map classes contain a 
 bucket struct with information that aids in both indexing and querying
 words.

 For the file directory data, we used the FStree to build an n-arry tree
 that will represent the working directories. The DirNode is the node of
 the FS tree with the data to traverse and builds the path through the 
 tree. The file information is stored in the gerp class as a private member
 variable in a vector that contains directory structs. Each struct has a
 directory id, a vector that contains all of the lines in the file, and
 a string that holds the filepath to the directory. 

 The query is located in the gerp class, and depending on the command, it
 will execute a case sensitive or insensitve search. For sensitive search,
 it will first hash through the hashmap class to find the general bucket of
 that key. In this bucket, the key is sent to the sensitive hash, which
 will hash the key again and send this key to a case-unique bucket if it
 exists. For insensitive search, the program will only hash to the general
 bucket and return a pointer to all of the potential buckets that underly
 this general bucket.

G. Data Structures/ADT & Algorithms:

The primary ADTS used for the function of this program are Hash Map,
Vector, and Unordered Set.

Vectors:   

The first is to store the file location data and the actual line as 
strings. One use of vectos stores directory structs where each element is a 
file with its respective data. Each file also has its own vector which 
stores the line information of the file. To access we would go to the first
vector and go to the file with its respective index and then go to the 
specific line number by using the line number as the index for the second 
vector.

Both the Hashmap and Sensitive Hash use pointers to buckets to represent
the table. This table is accessed like an array, so it functions like an
array of buckets. These buckets in the Hashmap contain a pointer to a 
sensitive hash.

The sensitive hash is the crux of the hashmap and hash function. The 
buckets contain another vector that is the location structs of the key that
mapped to that specific bucket. This allows us to store all of the 
occurrences of a particular word in a directory without taking up too much
space. 

The reason why we used a vector was because of itss ability to provide us
with instant access to elements. Especially with the file location data
vector, our time complexity remains constant. Additionally, to avoid 
a time complexity of O(n), we inserted all elements at the back to avoid
shifting other values.

Hashmap:

Our hashmap while a bit complex was the most optimal way of checking for
instances of a queried word. Our first hashmap uses a hash function where
the key is the lowercased version of the queried word. This will map to a 
bucket that contains a hashmap with the general insensitive cases of the
word. Each sensitive case has its own bucket it maps to. In a
sensitive search, it would use the sensitive hash function and the
oirginal queried word as the key to find the bucket. In an insensitive 
search, it would just iterate through the vector, and print out all the 
information in each bucket.

Unordered Sets:

Sets are a data structure that stores only one of a unique element. We used
an unordered set data structure to make sure that time was not wasted by
trying to sort the insertions. A set is useful for making sure a line is 
not outputted twice. In a vector of sets, the vector index represents the 
file number, and our program inserts the line number of the respective 
file into that respective set. This allows for our program to check if the 
line has already been outputted, even when accessing different buckets.


Algorithms:

InsensitiveSearch():

This function was decievingly tricky. Using the hash function, we can
easily access the right bucket with all the insensitive cases. However,
printing out all the lines is not straightforward. Beyond iterating through
the entire vector of buckets, we also have to make sure that a line is not
printed out twice. In order to do so, we would have to account for
different sensivity cases within the same general case. For example in 
large gutenberg, we had to account for "keeps" when "Keeps" and "keeps" 
were on the same line. In order to do so, we implemented a vector of sets 
to track if a line had been printed out yet during the query. We would 
insert the line location to show that it had been inserted. This would 
ensure that even when we are iterating through seperate buckets, there 
wouldn't be repeated lines. 


ContainsS():

This function tests if a bucket is contained in our hashmap. For a 
sensitive case, this would require the program to check the insensitive 
larger bucket and the sensitive hash buckets as well. The complexity here
is interesting, since there are two "contains" passes that must run.
First the program checks if a bucket exists for the lowercase queried word
in the general hash map. If it does, then the program must then check
if a bucket for the original case of the queried word exists in the
sensitive hash map. This took some time to get right, and certainly is
not the most time efficient, but it was necessary for our design.

Pros and Cons:

Our implementation and use of imbedded Hash Maps comes with a variety of
pros and cons. Our design is generally memory efficient because we only
store each line occurence one time. We also use one consistent hash map,
so there is no problem of repeatedly storing multiple instances of the same
element. By design, our querying is extremely fast because the program
is always a few constant time operations away from directly accessing
a key from a bucket and then printing out all of the lines associated with
that key. Our design does come with some redundencies, however, such as
how our "contains" function essentially performs the same operations that
our search functions perform, only it returns a different value. We do
believe there is room for further optimization with this design, but it is
currently very functional and quite fast (especially for querying).

H. Testing:

 1. unit_test
 
 We used the unit testing framework to test the following:

 a. Directory Traversal
      Tested with directories of varying depth and structure, including:
      Empty directories.
      Directories with nested subdirectories.
      Mixed directories containing files and subdirectories.
      Verified that the directory paths and file names were correctly 
      extracted using the direct_traverse function.

 b. Hash Map Initialization
      Checked:
      Words were correctly hashed and inserted into the primary hash map.
      The sensitive hash map correctly stored multiple cases of the same 
      word.
      Collisions in hash map buckets were handled without overwriting 
      existing data.

 c. Struct Data Integrity
      directory struct: Ensured accurate storage of file paths, file 
      indices, and line data.
      line_loc struct: Checked proper linking of line numbers, file 
      indices, and word occurrences.

 d. Edge Cases
      Tested input strings with:
      Non-alphanumeric characters at the edges.
      Case variations (e.g., Word, WORD, word).
      Words appearing on the same line in different cases.
      Ensured proper functioning of stripNonAlphaNum and lowerWord 
      functions.

2. diff output testing

 To confirm the overall functionality of our gerp implementation, we used
 a variety of testing methods that mainly compared our program's outputs
 against the reference. 

 a. tinyData Input
 Initially used tinyData directory to query both case-sensitive and 
 case-insensitive words to ensure functionality on a small sample size.
 also tested words: not present in the files, present in multiple files and
 lines, and repeated multiple times on the same line for both case
 sensitive and insensitive.
 
 b. largeGutenberg Input
 After confirming functionality on tinyData, we ran gerp on largeGutenberg
 to test our program's handling of hundreds of thousands of lines and
 Input files with repetitive content to test hash map efficiency.
 We also used gerp_perf to measure runtime and RAM usage.

 c. Collision and Linear Probing
 Because we used a linear probing model for our hashmap, we used cout
 statements to track the behavior of insertion and retreival when a key
 would hash to index that was filled with a different key. 

 d. Input files
 Finally, we constructed a few input files to test how our program handled
 file input for different queries. The specfic content and nature of these
 testing files is described further above.

H. Time Spent:

 22 hours

About

Query/Index Program

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors