Text Generation in Python Using Bigrams

This script reads text files, analyzes word pairs (bigrams), and generates text based on these pairs. It leverages the frequency of bigrams to predict and construct new sequences of words.

Purpose

The main purpose of this script is to take one or more text files, compute the frequency of bigrams, and then use these bigrams to generate new, moderately coherent text sequences. This is useful for understanding text construction techniques, random text generation, and experimenting with prediction algorithms.

Prerequisites

Ensure you have Python installed on your system. This script is confirmed to run with Python 3.9.

The script processes all .txt files within a specified directory (data by default).

Usage

Run the script from the command line with the following options:

❯ python3 main.py -h
usage: main.py [-h] [--input-dir INPUT_DIR] [--initial-bigram INITIAL_BIGRAM] [--no-clean] [--top-bigrams TOP_BIGRAMS] [--print-bigram] [--num-words NUM_WORDS] [--top-bigrams-to-choose TOP_BIGRAMS_TO_CHOOSE]

Generate text from .txt files using bigrams.

options:
  -h, --help            show this help message and exit
  --input-dir, -d INPUT_DIR
                        Path to the directory with text files. Default is 'data'.
  --initial-bigram, -i INITIAL_BIGRAM
                        Comma-separated words to start generation. They need to form a bigram that is present in the source text.
  --no-clean, -n        Disable word cleaning (stripping punctuation and converting to lowercase).
  --top-bigrams, -t TOP_BIGRAMS
                        Print top N most common bigrams found in the source text.
  --print-bigram, -p    Print the randomly chosen bigram among the top N bigrams.
  --num-words, -w NUM_WORDS
                        Number of words to generate. Default is 15.
  --top-bigrams-to-choose, -tb TOP_BIGRAMS_TO_CHOOSE
                        Number of top bigrams to consider for choosing starting bigram. Default is 50.

Examples

Analyzes the text files in the default data directory, prints the top 10 most common bigrams, and disables word cleaning (i.e., keeps punctuation and original casing):

python3 main.py -t 10 -n

Prints the top 20 bigrams from the input files and randomly selects and displays one of them as the initial bigram for text generation:

python3 main.py -t 20 -p

Starts text generation using the bigram "The lawyer" as the seed. Word cleaning is disabled, so exact casing and punctuation matter when matching the initial bigram:

python3 main.py -i The,lawyer -n

Generates 25 words of text, selecting the initial bigram randomly from the top 10 most frequent bigrams, and prints the top 10 bigrams from the source text:

python3 main.py -w 25 -tb 10 -t 10

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Generation in Python Using Bigrams

Purpose

Prerequisites

Usage

Examples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Generation in Python Using Bigrams

Purpose

Prerequisites

Usage

Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages