A Retrieval-Augmented Generation (RAG) system that enables natural language question-answering on financial statements.
This system allows users to:
- Query financial statements using natural language
- Get detailed answers about specific companies' financial data
- Navigate through different years of financial reports
- Access information through an intuitive web interface
- Clone the repository:
git clone https://github.com/yanhua-wang/FinancialReport_QnA.git
cd FinancialReport_QnA- Install the required dependencies:
pip install -r requirements.txt- Set up your Together API key:
- Sign up for a Together API key at https://www.together.ai
- Create a
.envfile in the project root directory - Add your API key to the
.envfile:TOGETHER_API_KEY=your_api_key_here
Before running the application or building the RAG index, you need to download the 10-K financial statements for the supported companies and years. These statements must then be organized into a specific directory structure that the system expects.
Supported Tickers: As listed in app.py (e.g., AAPL, AEE, BA, CMCSA, CNP, CRL, D, ED, HWM, VRSN). (Modifiable in sampled_tickers.txt)
Supported Years: As listed in app.py (e.g., 2010 through 2019). (Modifiable in data_downloading.py)
- Important: Before running
data_downloading.py, you must provide your email address in theEMAILvariable within thedata_downloading.pyscript. This is required by the SEC EDGAR Downloader. - Run
data_downloading.py,data_processing.py, andvector_store_construction.pyin this order to download, process, and index the financial data.python data_downloading.py python data_processing.py python vector_store_construction.py
- Start the Streamlit application:
streamlit run app.py-
Open your web browser and navigate to the URL shown in the terminal
-
Using the app:
- Select a company ticker from the dropdown menu
- Choose the year of the financial report
- Enter your question in natural language
- Click "Submit" to get your answer
You can ask questions like:
- "What was the company's revenue in 2018?"
- "How much did the company spend on R&D?"
- "What were the major risks identified in the annual report?"
- "What is the company's current debt-to-equity ratio?"
This project leverages several key technologies and frameworks:
- LlamaIndex: Used for building the RAG (Retrieval-Augmented Generation) system
- ChromaDB: Vector database for storing and retrieving document embeddings
- Streamlit: Web application framework for the user interface
- Together AI: LLM API for generating responses to user queries
- Embedding Model: sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for document embeddings
- LLM Model: meta-llama/Llama-3-70b-chat-hf for generating responses
- SEC EDGAR Downloader: For fetching financial statements and reports
- BeautifulSoup: For parsing and cleaning HTML content from financial documents
The Streamlit app runs efficiently on a local CPU for querying the current dataset. To fully replicate the project, GPU access is required. Reach out to me if you would like sample data.
Contributions are welcome! Please feel free to submit a Pull Request.