"Perhaps the best test of a man's intelligence is his capacity for making a summary." - Lytton Strachey, English writer and critic.
Synopsys is a discord-bot that summaries conversations and records them for future use.
A published version of this README is also available here. Use this link to checkout the demo webapp. Checkout this video for a working demo.
So, go ahead! Use Synopsis to find your TL;DRs.
Today, especially when the world lies in the grasps of the Corona-virus, a considerable chunk of information exchange happens in the form of online chat conversations.
Such conversations involve a substantial amount of participants and a single conversation tends to span a wide range of topics interspersed with irrelevant segments. This results in the necessity of having proper summarization, so that people who were not present during a long (not necessarily coherent) conversation need not read through the big chain of chats to follow-up on it and can rather read a summary of the same, thus saving a lot of time.
This calls for summarization algorithms that work on scraped data from chat services and yield a summary as required.
Among the several instant messaging platforms available, Discord is one of the most popular ones. Because of its several innovative features like server-channel systems, awesome call quality, permission management and tools to integrate bots, Discord has become a major platform for people to collaborate, converse and share ideas.
In this project, we aim to make a Discord bot that effectively summarizes conversations and allows the user to keep a record of these summaries on a dedicated website.
This not only allows the user to obtain a automated-summary of any given chain of chats, but also allows the user to have easy access to these stored summaries for future reference.
- Python 3
- Firebase and Firestore
- Google Cloud for hosting the bot on a virtual machine
Discord.pyfor functionality of discord bot- React for creating the frontend interface
For a detailed description regarding the current implementation check this.
Automated-summarization of text has been applied to quite a lot of genres including varities of articles, scientific papers and blogs.
However, when compared to the above examples, very little work has been done in the field of chat summarization. This is because there are several problems associated with it due to fact that chats are inherently noisy, unstructured, informal and involves frequent shifts in topic.
Our current working version uses a basic cosine-similarity model to generate a unique set of words (keywords) and thus use these keywords to evaluate the given data set of chats and return only most unique sentences as a part of the summary. This basic summarization is extractive by nature. For a detailed description on implementation of this model, look here.
In Further Ideas, we also attempt at constructing a better summarization algorithm, based on present research on the topic.
The team involved in the project comprises of Kunwar Shaanjeet Singh Grover, Vishva Saravanan, Mayank Goel and Alapan Chaudhuri, respectively.
- Easy to use conversation summerisation based on discord messages
- Sick of scrolling back thousands of messages to get an important conversation you had? Record the conversation and review it again anytime the webinterface which gives a summary of the conversation as well as the keywords.
- Can be added to any required server
-
Create a virtual environment and install dependencies:
$ python3 -m venv .env $ . .env/bin/activate $ pip3 install -r requirements.txt
-
Install the nltk corpus required:
$ python3 nltkmodules.py
-
Export the required environment variables:
$ export BOT_TOKEN="TOKEN_FOR_DISCORD_BOT" $ export BOT_PREFIX="PREFIX_FOR_BOT"
-
Run the bot:
python3 main.py
To add the record functionality, you need to connect the bot to a firestore database. Place the serviceAccount.json as firestore/secret.json. This allows the bot to use the record functionality to record database on the corresponding firestore database.
The discord bot works by obtaining all the messages between the given starting message id and the ending message id. The bot then uses the text summarizer we built and obtains keywords and a short summary.
The text summarizer works on the mathematical principle of cosine similarity for non-zero vectors.
For this, we have represented each line as a vector, of unique words, quantifying it on basis of how "important" or frequent it is, and this idea is done using a graph-based TextRank algorithm on the similarity matrix generated on the above vectors.
Additional challenges were cleaning and parsing the data to include only relevant keywords, and this involved removal of stopwords and manual addition of common words. Additionally, discord usernames and other special characters like emojis were removed.
The summarizer also outputs a list of keywords, on basis of frequency. This list is also cleaned for stopwords and other common words that do not convey the meaning of a sentence.
The output (after text summarization) is then stored on a Firebase (Firestore) database, which is exposed by a ReactJS app.
The webapp allows to view the recordings anytime with a summary and keywords.
The following image shows the original conversation thread for the above attached image of summary.
Link to the mentioned web-app: Synopsis App.
Now that we have explained how our working version deals with summarization, we would like to elaborate upon how we plan to better the summarization algorithm.
Given, a conversation as data-set in the form of a series of chats, we shall first remove noise in the form of spelling errors and use text segmentation to formalize the chats to some extent.
Then, we differentiate chunks of conversation using topic modeling and then using similarity-index upon the few sets of topics to segregate the large chunk of chats.
Once we have identified the primary topic (tag) of a certain series of chats, we build a semantic space of words. With the help of a co-occurrence HAL model, we use the given space we calculate cumulative scores of sentences. Using these scores, we include sentences and generate required summary.
- Text Segmentation: We can use Longest Contiguous Messages (LCM) for this.
- Topic Modeling: Latent Dirichlet Allocation can be used for this.
- Summary Generation: HAL space can be used to create a probabilistic model to generate scores by normalizing the count of terms.





