Skip to content

Daniel-Jarka/Project-Zap

Repository files navigation

/******************************************************************************

  • Project 3: Zap
  • README
  • Author: Daniel Jarka djarka01 ******************************************************************************/

B. Program Purpose:

 Zap is a compression program used for compressing and decompressing text
 files using the Huffman coding algorithm. The client has no access to the
 underlying logic that compresses and decompresses text files, but they can
 run two commands from the terminal being "zap" for compression and "unzap"
 for decompression. The client will have to additionally specify the input
 file to read from and the output file to send the compressed or 
 decompressed text to.

C. Acknowledgements:

I used the following reference to read about pairs:
https://cplusplus.com/reference/utility/pair/pair/

This reference for priority queues:
https://cplusplus.com/reference/queue/priority_queue/

And this reference for using the peek function:
https://cplusplus.com/reference/istream/istream/peek/

D. Files:

 HuffmanCoder.cpp:
      Implementation of HuffmanCoder class. Used to compress and decompress
      text files using the huffman coding algorithm.

 HuffmanCoder.h:
      Interface of HuffmanCoder class

 main.cpp:
      Driver file which can be interacted with by the user to create an
      instance of HuffmanCoder.

 unit_test.h:
      A unit testing file for the HuffmanCoder functions. Because these
      functions are private, the unit_tests are commented out. They
      were made public for these tests, however.

 Makefile:
      File used to run the "make" command which compiles and builds
      the program.

 README: 
      This file.

 empty.txt:
      Input file used to test program behavior when opening an empty file.
 
 empty_nline.txt:
      Input file used to test program behavior when reading a file with a
      single new line character.

 single_char.txt:
      Input file used to test program behavior when reading a file that
      only contains a single character.

 all_conll_english.txt:
      Input file used to test program behavior for large input

 banana.txt:
      Input file used to test basic program behavior

 banana_apple.txt:
      Input file used to test basic program behavior

 ecoli.coli:
      Input file used to test program for large non .txt input

 hi.txt:
      Input file used to test basic program behavior

 sentences.txt:
      Input file used to test basic program behavior

 works_of_shakespeare.txt:
      Input file used to test program behavior for large input

E. Compile/run:

 - Compile using
        make
 - run executable with
        ./zap [zap | unzap] inputFile outputFile

F. Data Structure:

 The primary ADTs of this implementation are a List, a Priority Queue, 
 and a Binary Tree. To implement these ADTs, an array, std::vector, and
 huffman tree data structure was used respectively. 

 I used a standard array to store character frequencies and
 the binary encodings for those characters. I decided on using an array
 because we know that there are 256 total ASCII encodings. Because every
 character corresponds to an integer value from 0 - 255, this means that
 every index in the array is a one to one correspondance with the ASCII
 value of a character. There is no reason to add or remove elements from
 the array, so the important operations lie in accessing elements and
 manipulating their values. For a standard array, this occurs in constant
 time. An array of this form would also be useful for a problem to count
 how many characters do not appear in a text file, which uses an algorithm
 with very similar logic to counting frequencies. Additionally, this type
 of array would be useful for counting the traffic of the stores
 in a mall. Since the number of stores remains constant, an algorithm
 could count the people coming in/out of each store and save that
 information in an array.

 I used a std::vector data structure to create my priority queue, which in
 turn was used to assist in the process of building my huffman tree.
 The huffman coding algorithm finds the two character nodes with the
 smallest frequencies and joins them together with a parent node whose
 frequency is the sum of its children's frequencies. Then, the next two
 character nodes with the smallest frequencies (now including the parent)
 are joined together by a new parent. This continues until there are
 no more nodes to join together, leaving behind a huffman binary tree.
 The prioirity queue is useful for this problem because it is implemented
 as a min heap, where the top element is always the minimum value in
 the queue. This trvializes the issue of finding the smallest frequency
 character nodes and joining them with a parent node. A priority queue
 data structure can also be useful in an emergency room setting where
 patients need to kept track of by level of priority. Someone near death
 would be much higher on the queue than someone with a broken finger. 
 Additionally, a priority queue can also be used for a store that has
 different tiers of membership. "Gold" membership status receives higher
 priority on the queue than "Standard" membership status.

 I used a binary tree data structure to encode and decode text using the
 huffman coding algorithm. With a built tree, the path traversals to each
 leaf create the unique character encodings for each character contained
 in the tree. These traversals are also used to create a tree serialization
 which allows for a decoding algorithm to know what the huffman tree is
 supposed to look like after decoding. When decoding, after building a 
 huffman tree from a serialized string, the tree is then used to
 reconstruct the original text by traversing the tree according to the
 binary text (accessing the left child when reading "0" or a right child
 when reading "1" until a leaf node is reached). Binary trees have a lot
 of very useful implementations for other problems. They can be implemented
 as BSTs which are very useful for storing and finding sorted information.
 They can be used for dictionaries to quickly find the node containing
 that word and its corresponding definition. 

 The huffman coding algorithm is evidently the most important algorithm of
 this implementation. Its complexity under the hood is difficult to examine
 by simply looking at my implementation, so it's useful to discuss the
 process. First, a priority queue is populated with nodes that correspond
 to each character in the text. Next, the two smallest nodes are
 iteratively joined together with a parent node until only one (root) node
 remains. The time complexity of this algorithm is O(nlogn) where O(n)
 represents the number of nodes created and O(logn) represents the
 insertions into the priority queue. 

 Encoding characters is also quite important, where the huffman code for
 each character is generated recursively using the huffman tree. This tree
 is traveresed using a depth-first search, assigning 0 for left branches
 and 1 for right branches. These codes are then stored in a fixed-size
 array. The time complexity of generating binary code is O(n) where n is
 the number of leaves in the tree, and the complexity of converting an
 input into binary code is also O(n), where n is the size of the input.

G. Testing:

 1. unit_test
 
 I used the unit_test framework to test each of the individual functions
 used during my encoding and decoding processes. I unit_tested to test for
 functionality and edge cases, as I wanted to make sure that my functions 
 produced expected and necessary outputs for the rest of my implementation 
 to work as expected. I wrote getter functions for the private frequency
 and binary code arrays to unit_test the outputs for standard input and 
 single characters. For encoder, I tested building a tree using standard 
 input, using a single type of character, using the provided banana.txt 
 file, and using a single whitespace. I then tested the binary conversion 
 of a standard input, a single character type input, and a file input. I 
 also tested the serialization of a normal input and a single character 
 type input. For decoder, I tested the deserialization of a standard input,
 of a single leaf input, and an empty leaf input. I also checked that
 serializing after deserializing produced the original output. Finally,
 I tested decoding a binary string on a normal input and a single character
 type input, as well as printing this decoded text to ostream. 

 2. diff test

 I used several testing files I created to compare my program's edge and
 general case behavior with the reference implementation. I wrote some
 details of what each testing file specfically tests for above,
 but in general I checked cases for an empty file, a file with a single
 character, a file with a whitespace, and then some general behavior with
 the provided files. My actual method of testing this program was by 
 sending the program's "zap" outputs to stdout and diff testing these 
 outputs with the reference. I also sent "bad" inputs to stdout when the
 user enters a wrong command, the incorrect number of arguments, or when
 an input file that does not exist is named. Additionally, I sent the
 program's output to stdout when attempting to zap "bad_zap_file" and
 diff testing that output with the reference. To test "unzap" and decoding
 functionality, I diff tested my output files with the reference's output
 files. I did this two ways: using unzap on my own zapped file and using
 unzap on the reference's zapped file and then diff testing both types of
 output with the reference's unzapped files. 

H. Time Spent:

 15

About

Compression Program

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors