diff --git a/.github/workflows/mkdocs.yml b/.github/workflows/mkdocs.yml new file mode 100644 index 0000000..ef7229d --- /dev/null +++ b/.github/workflows/mkdocs.yml @@ -0,0 +1,17 @@ +name: Deploy MkDocs + +on: + push: + branches: + - main + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - uses: actions/setup-python@v4 + with: + python-version: 3.x + - run: pip install mkdocs-material + - run: mkdocs gh-deploy --force \ No newline at end of file diff --git a/README.md b/README.md index 3b142df..77af155 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,8 @@ We develop a cascaded voice assistant system that includes ASR, TTS and a ReAct based Agent for reasoning and action taking. +[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue)](https://sentientia.github.io/Aura/) + ## Aura: Demo [![Aura Demo](https://img.youtube.com/vi/cb7w0GVwwF0/0.jpg)](https://www.youtube.com/watch?v=cb7w0GVwwF0) @@ -11,18 +13,28 @@ ReAct based Agent for reasoning and action taking. ![Aura System Architecture](docs/images/aura_system_white.png) +## Documentation + +For detailed documentation, please visit our [Documentation Website](https://sentientia.github.io/Aura/). + +The documentation includes: +- [Installation Guide](https://sentientia.github.io/Aura/installation/) +- [Architecture Overview](https://sentientia.github.io/Aura/architecture/) +- [Agent Documentation](https://sentientia.github.io/Aura/agents/) +- [Action Documentation](https://sentientia.github.io/Aura/actions/) +- [UI Documentation](https://sentientia.github.io/Aura/ui/) +- [Contributing Guide](https://sentientia.github.io/Aura/contributing/) ## Repository Structure ``` . ├── agent/ # Core agent implementation -│ ├── actions/ # Action handlers for different tasks. +│ ├── actions/ # Action handlers for different tasks │ ├── controller/ # Agent state and control logic │ ├── llm/ # Language model integration │ ├── secrets/ # Secure credential storage │ └── agenthub/ # Agent implementations -│ │ ├── ui/ # User interface components │ ├── local_speech_app.py # Speech interface implementation (using gradio) @@ -32,12 +44,12 @@ ReAct based Agent for reasoning and action taking. │ ├── llm_serve/ # Language model serving script │ -├── dst/ # Dialog State Tracking. Has the scripts for finetuning LLMs for DST -| -└── environment.yaml # Conda environment configuration +├── dst/ # Dialog State Tracking. Has the scripts for finetuning LLMs for DST +│ +└── environment.yaml # Conda environment configuration ``` -## Setup +## Quick Setup 1. Create the conda environment: ```bash @@ -62,4 +74,8 @@ ReAct based Agent for reasoning and action taking. python ui/local_speech_app.py ``` - ## Human in the Loop Data: https://docs.google.com/spreadsheets/d/16_DApAlgunmG3pR4f8p9JYjO-v-2m8ZxduN9fZ-AblI/edit?usp=sharing \ No newline at end of file +For more detailed setup instructions, please refer to the [Installation Guide](https://sentientia.github.io/Aura/installation/) in our documentation. + +## Human in the Loop Data + +For human-in-the-loop data, please visit [this Google Sheets document](https://docs.google.com/spreadsheets/d/16_DApAlgunmG3pR4f8p9JYjO-v-2m8ZxduN9fZ-AblI/edit?usp=sharing). \ No newline at end of file diff --git a/docs/actions/answer_action.md b/docs/actions/answer_action.md new file mode 100644 index 0000000..695ee29 --- /dev/null +++ b/docs/actions/answer_action.md @@ -0,0 +1,76 @@ +# Answer Action + +The Answer Action is used by the QA Agent to provide direct answers to questions. It is a specialized action for question-answering tasks. + +## Overview + +The Answer Action is implemented in the `AnswerAction` class, which extends the `Action` class. It is designed to provide direct answers to questions without engaging in extended conversations. + +## Capabilities + +The Answer Action can: + +- Provide direct answers to questions +- Format answers for different question types +- Handle multiple-choice questions +- Provide explanations for answers + +## Implementation + +The Answer Action is implemented in the `agent/actions/answer_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **Result Processing**: Formats the answer for the user. + +## Usage + +The Answer Action is used by the QA Agent to provide answers to questions. To use the Answer Action: + +1. Create a new instance of the `AnswerAction` class with the appropriate thought and payload: + ```python + from agent.actions.answer_action import AnswerAction + + action = AnswerAction( + thought="I know that the capital of France is Paris", + payload="The capital of France is Paris." + ) + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will contain the answer. + +## Example + +Here's an example of how the Answer Action is used to answer a question: + +1. Agent creates an Answer Action: + ```python + action = AnswerAction( + thought="Based on the search results, I can see that the current president of the United States is Joe Biden", + payload="The current president of the United States is Joe Biden." + ) + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The action provides the answer "The current president of the United States is Joe Biden." + +4. The observation contains the answer, which is returned to the user. + +## Integration with Other Actions + +The Answer Action is typically used as the final action in a question-answering flow. For example: + +1. User asks a question. +2. Agent uses a Web Search Action to find information to answer the question. +3. Agent processes the search results. +4. Agent uses an Answer Action to provide the answer to the user. + +This combination of actions allows the agent to provide accurate and informative answers to user questions. \ No newline at end of file diff --git a/docs/actions/calendar_action.md b/docs/actions/calendar_action.md new file mode 100644 index 0000000..78d8be2 --- /dev/null +++ b/docs/actions/calendar_action.md @@ -0,0 +1,94 @@ +# Calendar Action + +The Calendar Action is used to manage calendar events. It can create, read, update, and delete events on the user's calendar. + +## Overview + +The Calendar Action is implemented in the `CalendarAction` class, which extends the `Action` class. It is designed to interact with calendar services to manage events. + +## Capabilities + +The Calendar Action can: + +- Create new calendar events +- Delete existing calendar events +- Retrieve calendar events +- Update calendar events + +## Implementation + +The Calendar Action is implemented in the `agent/actions/calendar_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **Calendar API Integration**: Uses external calendar APIs to manage events. +- **Result Processing**: Processes and formats calendar operation results for the agent. + +## Usage + +The Calendar Action is used by the agent to manage calendar events. To use the Calendar Action: + +1. Create a new instance of the `CalendarAction` class with the appropriate thought and payload: + ```python + from agent.actions.calendar_action import CalendarAction + + action = CalendarAction( + thought="I should create a calendar event for the meeting", + payload={ + "event": "create", + "start_time": "2025-07-10T14:00:00", + "end_time": "2025-07-10T15:00:00", + "title": "Team Meeting", + "description": "Weekly team sync-up meeting" + } + ) + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will contain the result of the calendar operation. + +## Example + +Here's an example of how the Calendar Action is used to create a calendar event: + +1. Agent creates a Calendar Action: + ```python + action = CalendarAction( + thought="I should create a calendar event for the doctor's appointment", + payload={ + "event": "create", + "start_time": "2025-07-15T10:00:00", + "end_time": "2025-07-15T11:00:00", + "title": "Doctor's Appointment", + "description": "Annual check-up with Dr. Smith" + } + ) + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The action creates a calendar event for the doctor's appointment on July 15, 2025, from 10:00 AM to 11:00 AM. + +4. The observation contains the result of the calendar operation, which might include: + - Confirmation that the event was created + - Details of the created event + - Any errors or warnings that occurred during the operation + +5. The agent processes the observation and creates a new action based on the result, typically a Chat Action to confirm the calendar operation with the user. + +## Integration with Other Actions + +The Calendar Action is often used in conjunction with other actions to provide a complete interaction flow. For example: + +1. User asks to schedule a meeting. +2. Agent uses a Chat Action to gather details about the meeting. +3. Agent uses a Calendar Action to create the meeting event. +4. Agent uses a Chat Action to confirm the meeting details with the user. + +This combination of actions allows the agent to provide a seamless and natural interaction experience for the user when managing calendar events. \ No newline at end of file diff --git a/docs/actions/chat_action.md b/docs/actions/chat_action.md new file mode 100644 index 0000000..3e01ac4 --- /dev/null +++ b/docs/actions/chat_action.md @@ -0,0 +1,75 @@ +# Chat Action + +The Chat Action is used to engage in conversation with the user. It is the most basic action and is used for all text-based interactions. + +## Overview + +The Chat Action is implemented in the `ChatAction` class, which extends the `Action` class. It is designed to handle text-based interactions between the agent and the user. + +## Capabilities + +The Chat Action can: + +- Send text messages to the user +- Process user responses +- Update the conversation history + +## Implementation + +The Chat Action is implemented in the `agent/actions/chat_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **State Management**: Updates the state with the conversation history. + +## Usage + +The Chat Action is used by the agent to communicate with the user. To use the Chat Action: + +1. Create a new instance of the `ChatAction` class with the appropriate thought and payload: + ```python + from agent.actions.chat_action import ChatAction + + action = ChatAction(thought="I should greet the user", payload="Hello, how can I help you today?") + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will be the user's response to the message. + +## Example + +Here's an example of how the Chat Action is used in a conversation: + +1. Agent creates a Chat Action: + ```python + action = ChatAction(thought="I should ask about the user's preferences", payload="What kind of restaurant are you looking for?") + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The message "What kind of restaurant are you looking for?" is sent to the user. + +4. The user responds with "I'm looking for an Italian restaurant." + +5. The observation contains the user's response: "I'm looking for an Italian restaurant." + +6. The agent processes the observation and creates a new action based on the user's response. + +## Integration with Other Actions + +The Chat Action is often used in conjunction with other actions to provide a complete interaction flow. For example: + +1. Agent uses a Web Search Action to find information about Italian restaurants. +2. Agent processes the search results. +3. Agent uses a Chat Action to present the information to the user. +4. User responds with a preference. +5. Agent uses a Calendar Action to make a reservation. +6. Agent uses a Chat Action to confirm the reservation with the user. + +This combination of actions allows the agent to provide a seamless and natural interaction experience for the user. \ No newline at end of file diff --git a/docs/actions/contact_action.md b/docs/actions/contact_action.md new file mode 100644 index 0000000..070e4d2 --- /dev/null +++ b/docs/actions/contact_action.md @@ -0,0 +1,78 @@ +# Contact Action + +The Contact Action is used to manage contact information. It can retrieve contact details from the user's address book. + +## Overview + +The Contact Action is implemented in the `ContactAction` class, which extends the `Action` class. It is designed to interact with contact services to retrieve contact information. + +## Capabilities + +The Contact Action can: + +- Retrieve contact information +- Search for contacts by name +- Get recently contacted email addresses +- Format contact details for the agent + +## Implementation + +The Contact Action is implemented in the `agent/actions/contact_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **Contact API Integration**: Uses external contact APIs to retrieve contact information. +- **Result Processing**: Processes and formats contact information for the agent. + +## Usage + +The Contact Action is used by the agent to retrieve contact information. To use the Contact Action: + +1. Create a new instance of the `ContactAction` class with the appropriate thought: + ```python + from agent.actions.contact_action import ContactAction + + action = ContactAction(thought="I should get the contact information for John Doe") + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will contain the contact information. + +## Example + +Here's an example of how the Contact Action is used to retrieve contact information: + +1. Agent creates a Contact Action: + ```python + action = ContactAction(thought="I should get the contact information for John Doe") + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The action retrieves contact information for John Doe from the user's address book. + +4. The observation contains the contact information, which might include: + - Name + - Email address + - Phone number + - Address + - Other contact details + +5. The agent processes the observation and creates a new action based on the contact information, typically a Chat Action to present the contact information to the user or an Email Action to send an email to the contact. + +## Integration with Other Actions + +The Contact Action is often used in conjunction with other actions to provide a complete interaction flow. For example: + +1. User asks to send an email to John Doe. +2. Agent uses a Contact Action to retrieve John Doe's email address. +3. Agent uses an Email Action to send the email to John Doe. +4. Agent uses a Chat Action to confirm that the email was sent. + +This combination of actions allows the agent to provide a seamless and natural interaction experience for the user when working with contacts. \ No newline at end of file diff --git a/docs/actions/email_action.md b/docs/actions/email_action.md new file mode 100644 index 0000000..9b48514 --- /dev/null +++ b/docs/actions/email_action.md @@ -0,0 +1,90 @@ +# Email Action + +The Email Action is used to send and manage emails. It can compose and send emails on behalf of the user. + +## Overview + +The Email Action is implemented in the `EmailAction` class, which extends the `Action` class. It is designed to interact with email services to send emails. + +## Capabilities + +The Email Action can: + +- Compose and send emails +- Format email content +- Handle email recipients +- Set email subjects + +## Implementation + +The Email Action is implemented in the `agent/actions/email_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **Email API Integration**: Uses external email APIs to send emails. +- **Result Processing**: Processes and formats email operation results for the agent. + +## Usage + +The Email Action is used by the agent to send emails. To use the Email Action: + +1. Create a new instance of the `EmailAction` class with the appropriate thought and payload: + ```python + from agent.actions.email_action import EmailAction + + action = EmailAction( + thought="I should send an email to confirm the meeting", + payload={ + "to": "recipient@example.com", + "subject": "Meeting Confirmation", + "content": "Hello,\n\nThis is to confirm our meeting tomorrow at 2 PM.\n\nBest regards,\nAura" + } + ) + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will contain the result of the email operation. + +## Example + +Here's an example of how the Email Action is used to send an email: + +1. Agent creates an Email Action: + ```python + action = EmailAction( + thought="I should send an email to the team about the project update", + payload={ + "to": "team@example.com", + "subject": "Project Update - July 2025", + "content": "Hello Team,\n\nHere is the latest update on our project:\n\n- Feature A is complete\n- Feature B is in progress\n- Feature C is scheduled for next week\n\nPlease let me know if you have any questions.\n\nBest regards,\nAura" + } + ) + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The action sends an email to team@example.com with the subject "Project Update - July 2025" and the specified content. + +4. The observation contains the result of the email operation, which might include: + - Confirmation that the email was sent + - Details of the sent email + - Any errors or warnings that occurred during the operation + +5. The agent processes the observation and creates a new action based on the result, typically a Chat Action to confirm the email operation with the user. + +## Integration with Other Actions + +The Email Action is often used in conjunction with other actions to provide a complete interaction flow. For example: + +1. User asks to send an email to the team. +2. Agent uses a Chat Action to gather details about the email. +3. Agent uses an Email Action to send the email. +4. Agent uses a Chat Action to confirm that the email was sent. + +This combination of actions allows the agent to provide a seamless and natural interaction experience for the user when sending emails. \ No newline at end of file diff --git a/docs/actions/index.md b/docs/actions/index.md new file mode 100644 index 0000000..3af5639 --- /dev/null +++ b/docs/actions/index.md @@ -0,0 +1,59 @@ +# Actions Overview + +Aura provides a flexible action framework that allows agents to perform various tasks to fulfill user requests. This page provides an overview of the available actions and how they work. + +## Action Framework + +The action framework is built around the concept of a base action that provides common functionality, with specific action implementations extending this base action to provide specialized behavior. + +### Base Action + +The base action provides common functionality such as: + +- Execution logic +- State management +- Result formatting + +All specific action implementations extend this base action. + +## Available Actions + +Aura currently provides the following action implementations: + +### [Chat Action](chat_action.md) + +The Chat Action is used to engage in conversation with the user. It is the most basic action and is used for all text-based interactions. + +### [Web Search Action](web_search_action.md) + +The Web Search Action is used to search the web for information. It can be used to find answers to questions, get up-to-date information, or research topics. + +### [Calendar Action](calendar_action.md) + +The Calendar Action is used to manage calendar events. It can create, read, update, and delete events on the user's calendar. + +### [Email Action](email_action.md) + +The Email Action is used to send and manage emails. It can compose and send emails on behalf of the user. + +### [Contact Action](contact_action.md) + +The Contact Action is used to manage contact information. It can retrieve contact details from the user's address book. + +### [Answer Action](answer_action.md) + +The Answer Action is used by the QA Agent to provide direct answers to questions. It is a specialized action for question-answering tasks. + +## Action Selection + +The appropriate action is selected by the agent based on the current state of the conversation and the user's request. The agent uses its reasoning capabilities to determine which action is most appropriate for the current situation. + +## Extending with New Actions + +The action framework is designed to be extensible, allowing for new action implementations to be added as needed. To create a new action: + +1. Create a new class that extends the `Action` class. +2. Implement the `execute` method to define the action's behavior. +3. Register the action in the agent's `step` method. + +For more details on creating new actions, see the [Contributing](../contributing.md) guide. \ No newline at end of file diff --git a/docs/actions/web_search_action.md b/docs/actions/web_search_action.md new file mode 100644 index 0000000..d14d757 --- /dev/null +++ b/docs/actions/web_search_action.md @@ -0,0 +1,83 @@ +# Web Search Action + +The Web Search Action is used to search the web for information. It can be used to find answers to questions, get up-to-date information, or research topics. + +## Overview + +The Web Search Action is implemented in the `WebSearchAdvancedAction` class, which extends the `Action` class. It is designed to perform web searches and return the results to the agent. + +## Capabilities + +The Web Search Action can: + +- Perform Google searches +- Perform Wikipedia searches +- Process and format search results +- Handle different types of search queries + +## Implementation + +The Web Search Action is implemented in the `agent/actions/web_search_advanced_action.py` file. It uses the following components: + +- **Action Base Class**: Extends the `Action` class defined in `agent/actions/action.py`. +- **Search API Integration**: Uses external search APIs to perform searches. +- **Result Processing**: Processes and formats search results for the agent. + +## Usage + +The Web Search Action is used by the agent to find information on the web. To use the Web Search Action: + +1. Create a new instance of the `WebSearchAdvancedAction` class with the appropriate thought and payload: + ```python + from agent.actions.web_search_advanced_action import WebSearchAdvancedAction + + action = WebSearchAdvancedAction( + thought="I should search for information about Italian restaurants", + payload={"google_search_query": "best Italian restaurants downtown", "wikipedia_search_query": "Italian cuisine"} + ) + ``` + +2. Execute the action with the current state: + ```python + observation = action.execute(state) + ``` + +3. The observation will contain the search results. + +## Example + +Here's an example of how the Web Search Action is used to find information: + +1. Agent creates a Web Search Action: + ```python + action = WebSearchAdvancedAction( + thought="I should search for information about the weather", + payload={"google_search_query": "weather forecast today", "wikipedia_search_query": "weather"} + ) + ``` + +2. Agent executes the action: + ```python + observation = action.execute(state) + ``` + +3. The action performs a Google search for "weather forecast today" and a Wikipedia search for "weather". + +4. The observation contains the search results, which might include: + - Current weather conditions + - Weather forecast for the day + - General information about weather patterns + - Links to weather-related resources + +5. The agent processes the observation and creates a new action based on the search results, typically a Chat Action to present the information to the user. + +## Integration with Other Actions + +The Web Search Action is often used in conjunction with other actions to provide a complete interaction flow. For example: + +1. User asks about the weather. +2. Agent uses a Web Search Action to find weather information. +3. Agent processes the search results. +4. Agent uses a Chat Action to present the weather information to the user. + +This combination of actions allows the agent to provide accurate and up-to-date information to the user. \ No newline at end of file diff --git a/docs/agents/chat_agent.md b/docs/agents/chat_agent.md new file mode 100644 index 0000000..e594555 --- /dev/null +++ b/docs/agents/chat_agent.md @@ -0,0 +1,86 @@ +# Chat Agent + +The Chat Agent is a general-purpose agent for conversational interactions. It is designed to engage in natural conversations with users, understand their requests, and take appropriate actions to fulfill those requests. + +## Overview + +The Chat Agent is implemented in the `ChatAgent` class, which extends the `BaseAgent` class. It is designed to work in the UI mode, where it interacts with users through a conversational interface. + +## Capabilities + +The Chat Agent can perform the following actions: + +- **Chat**: Engage in conversation with the user. +- **Web Search**: Search the web for information. +- **Calendar**: Manage calendar events. +- **Contact**: Manage contact information. +- **Email**: Send and manage emails. + +## Implementation + +The Chat Agent is implemented in the `agent/agenthub/chat_agent/ChatAgent.py` file. It uses the following components: + +- **Prompt Template**: Defined in `agent/agenthub/chat_agent/prompts.py`. +- **LLM Integration**: Uses the OpenAI Chat Completion API through `agent/llm/openai_chat_completion.py`. +- **Action Framework**: Uses the action classes defined in `agent/actions/`. + +## Usage + +The Chat Agent is used in the UI mode, which is the default mode of operation for the Aura system. To use the Chat Agent: + +1. Initialize the controller in UI mode: + ```python + from agent.controller.controller import Controller + from agent.controller.modes import Mode + + controller = Controller(operation_mode=Mode.UI) + ``` + +2. Add user input to the state: + ```python + controller.add_user_input("Hello, how can you help me?") + ``` + +3. Get the next action and observation: + ```python + action, observation = controller.get_next_chat_action() + ``` + +4. The action will be a `ChatAction` or another action type, and the observation will be the result of executing that action. + +## Prompt Format + +The Chat Agent uses a structured prompt format to communicate with the language model. The prompt includes: + +- **System Prompt**: Defines the agent's role, capabilities, and response format. +- **Conversation History**: The history of the conversation between the user and the agent. +- **Dialog State**: The current state of the dialog, including information gathered from the user. +- **Action-Observation History**: The history of actions taken by the agent and the observations from those actions. + +The agent's response is expected to include: + +- **Thought**: The agent's reasoning about the current state and what action to take. +- **Action**: The type of action to take (chat, web_search, calendar, contact, email). +- **Payload**: The content of the action, such as the message to send to the user or the search query to execute. + +## Example + +Here's an example of how the Chat Agent processes a user request: + +1. User input: "I need to find a good Italian restaurant in downtown." + +2. Agent thought: "The user is looking for restaurant recommendations. I should ask for more specific information about their preferences." + +3. Agent action: "chat" + +4. Agent payload: "I'd be happy to help you find an Italian restaurant in downtown. Do you have any specific preferences for price range or type of Italian cuisine?" + +5. User input: "I'm looking for a mid-range restaurant with good pasta dishes." + +6. Agent thought: "Now I have more specific information. I should search for Italian restaurants in downtown that match these criteria." + +7. Agent action: "web_search" + +8. Agent payload: {"google_search_query": "mid-range Italian restaurant downtown good pasta dishes", "wikipedia_search_query": "Italian restaurants downtown"} + +9. After receiving the search results, the agent processes them and responds to the user with recommendations. \ No newline at end of file diff --git a/docs/agents/index.md b/docs/agents/index.md new file mode 100644 index 0000000..d070c29 --- /dev/null +++ b/docs/agents/index.md @@ -0,0 +1,46 @@ +# Agents Overview + +Aura provides a flexible agent framework that allows for different agent implementations to be used for different use cases. This page provides an overview of the available agents and how they work. + +## Agent Framework + +The agent framework is built around the concept of a base agent that provides common functionality, with specific agent implementations extending this base agent to provide specialized behavior. + +### Base Agent + +The base agent provides common functionality such as: + +- State management +- Action selection +- Response generation + +All specific agent implementations extend this base agent. + +## Available Agents + +Aura currently provides the following agent implementations: + +### [Chat Agent](chat_agent.md) + +The Chat Agent is a general-purpose agent for conversational interactions. It is designed to engage in natural conversations with users, understand their requests, and take appropriate actions to fulfill those requests. + +### [QA Agent](qa_agent.md) + +The QA Agent is a specialized agent for question-answering tasks. It is designed to answer specific questions from users, using web search when necessary to find the most up-to-date information. + +## Agent Selection + +The appropriate agent is selected based on the mode of operation: + +- **UI Mode**: Uses the Chat Agent for interactive conversations. +- **QA Evaluation Mode**: Uses the QA Agent for question-answering tasks. + +## Extending with New Agents + +The agent framework is designed to be extensible, allowing for new agent implementations to be added as needed. To create a new agent: + +1. Create a new class that extends the `BaseAgent` class. +2. Implement the `step` method to define the agent's behavior. +3. Register the agent in the controller for the appropriate mode of operation. + +For more details on creating new agents, see the [Contributing](../contributing.md) guide. \ No newline at end of file diff --git a/docs/agents/qa_agent.md b/docs/agents/qa_agent.md new file mode 100644 index 0000000..9214224 --- /dev/null +++ b/docs/agents/qa_agent.md @@ -0,0 +1,89 @@ +# QA Agent + +The QA Agent is a specialized agent for question-answering tasks. It is designed to answer specific questions from users, using web search when necessary to find the most up-to-date information. + +## Overview + +The QA Agent is implemented in the `QAAgent` class, which extends the `BaseAgent` class. It is designed to work in the QA Evaluation mode, where it answers specific questions without engaging in extended conversations. + +## Capabilities + +The QA Agent can perform the following actions: + +- **Answer**: Provide a direct answer to the user's question. +- **Web Search**: Search the web for information to help answer the question. + +## Implementation + +The QA Agent is implemented in the `agent/agenthub/qa_agent/QAAgent.py` file. It uses the following components: + +- **Prompt Template**: Defined in `agent/agenthub/qa_agent/qa_prompt.py`. +- **LLM Integration**: Uses the OpenAI Chat Completion API through `agent/llm/openai_chat_completion.py`. +- **Action Framework**: Uses the action classes defined in `agent/actions/`. + +## Usage + +The QA Agent is used in the QA Evaluation mode. To use the QA Agent: + +1. Initialize the controller in QA Evaluation mode: + ```python + from agent.controller.controller import Controller + from agent.controller.modes import Mode + + controller = Controller(operation_mode=Mode.QA_EVAL) + ``` + +2. Run the QA evaluation with a specific question: + ```python + history, instructions = controller.qa_eval({ + "instruction": "What is the capital of France?", + "additional_instruction": None + }) + ``` + +3. The history will contain the trajectory of actions and observations, and the final answer will be in the last observation. + +## Prompt Format + +The QA Agent uses a structured prompt format to communicate with the language model. The prompt includes: + +- **System Prompt**: Defines the agent's role, capabilities, and response format. +- **Question**: The specific question to be answered. +- **Action-Observation History**: The history of actions taken by the agent and the observations from those actions. +- **Instructions**: Any additional instructions for the agent, such as whether to answer in the current step or continue searching. + +The agent's response is expected to include: + +- **Thought**: The agent's reasoning about the question and how to answer it. +- **Action**: The type of action to take (answer, web_search). +- **Payload**: The content of the action, such as the answer to the question or the search query to execute. + +## Example + +Here's an example of how the QA Agent processes a question: + +1. Question: "What is the capital of France?" + +2. Agent thought: "I know that the capital of France is Paris." + +3. Agent action: "answer" + +4. Agent payload: "The capital of France is Paris." + +For more complex questions that require up-to-date information: + +1. Question: "Who is the current president of the United States?" + +2. Agent thought: "This is a time-sensitive question, so I should use web search to get the most up-to-date information." + +3. Agent action: "web_search" + +4. Agent payload: {"google_search_query": "who is the current president of the United States", "wikipedia_search_query": "United States President"} + +5. After receiving the search results, the agent processes them and provides the answer: + +6. Agent thought: "Based on the search results, I can see that the current president of the United States is [Name]." + +7. Agent action: "answer" + +8. Agent payload: "The current president of the United States is [Name]." \ No newline at end of file diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..6309334 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,93 @@ +# Architecture + +Aura is built with a modular architecture that allows for easy extension and customization. This page provides an overview of the system architecture and how the different components interact with each other. + +## System Overview + +![Aura System Architecture](images/aura_system_white.png) + +The Aura system consists of several key components: + +1. **User Interface**: The entry point for user interactions, which can be text or speech-based. +2. **Speech Processing**: Handles speech recognition (ASR) and speech synthesis (TTS). +3. **Agent Controller**: Manages the state of the conversation and coordinates between different components. +4. **Agent Hub**: Contains different agent implementations for different use cases. +5. **Action Framework**: Provides a set of actions that the agent can take to fulfill user requests. +6. **Dialog State Tracking**: Keeps track of the conversation context to enable more natural interactions. +7. **Language Model Integration**: Connects to external language models for natural language understanding and generation. + +## Component Details + +### User Interface + +The user interface is implemented using Gradio, which provides a web-based interface for interacting with the system. The UI supports both text and speech inputs, and can provide responses in both text and speech formats. + +### Speech Processing + +The speech processing component includes: + +- **Automatic Speech Recognition (ASR)**: Converts speech to text using models like Whisper or OWSM. +- **Text-to-Speech (TTS)**: Converts text to speech using models like ESPnet. + +### Agent Controller + +The agent controller is responsible for: + +- Managing the state of the conversation +- Coordinating between different components +- Handling the flow of information between the user and the agent + +### Agent Hub + +The agent hub contains different agent implementations: + +- **Chat Agent**: A general-purpose agent for conversational interactions. +- **QA Agent**: A specialized agent for question-answering tasks. + +### Action Framework + +The action framework provides a set of actions that the agent can take: + +- **Chat Action**: Engage in conversation with the user. +- **Web Search Action**: Search the web for information. +- **Calendar Action**: Manage calendar events. +- **Email Action**: Send and manage emails. +- **Contact Action**: Manage contact information. + +### Dialog State Tracking + +The dialog state tracking component keeps track of the conversation context, including: + +- User preferences +- Previous interactions +- Current conversation state + +### Language Model Integration + +The language model integration component connects to external language models for: + +- Natural language understanding +- Natural language generation +- Reasoning about user requests + +## Data Flow + +1. The user interacts with the system through the UI, providing either text or speech input. +2. If the input is speech, it is converted to text by the ASR component. +3. The text is passed to the agent controller, which updates the conversation state. +4. The agent controller passes the updated state to the appropriate agent in the agent hub. +5. The agent decides on an action to take based on the current state. +6. The action is executed by the action framework. +7. The result of the action is passed back to the agent controller. +8. The agent controller updates the state and generates a response. +9. If speech output is enabled, the response is converted to speech by the TTS component. +10. The response is presented to the user through the UI. + +## Extending the Architecture + +The modular architecture of Aura makes it easy to extend and customize: + +- **New Agents**: Add new agent implementations to the agent hub. +- **New Actions**: Add new action implementations to the action framework. +- **New Models**: Integrate new language models or speech processing models. +- **New UI Components**: Add new UI components for different interaction modes. \ No newline at end of file diff --git a/docs/contributing.md b/docs/contributing.md new file mode 100644 index 0000000..0e8d768 --- /dev/null +++ b/docs/contributing.md @@ -0,0 +1,124 @@ +# Contributing + +Thank you for your interest in contributing to Aura! This guide will help you get started with contributing to the project. + +## Getting Started + +1. Fork the repository on GitHub. +2. Clone your fork locally: + ```bash + git clone https://github.com/your-username/Aura.git + cd Aura + ``` +3. Create a new branch for your changes: + ```bash + git checkout -b feature/your-feature-name + ``` +4. Make your changes. +5. Commit your changes: + ```bash + git commit -m "Add your commit message here" + ``` +6. Push your changes to your fork: + ```bash + git push origin feature/your-feature-name + ``` +7. Create a pull request on GitHub. + +## Development Environment + +Follow the [Installation](installation.md) guide to set up your development environment. + +## Project Structure + +Familiarize yourself with the [project structure](index.md#repository-structure) before making changes. + +## Adding New Agents + +To add a new agent: + +1. Create a new directory in `agent/agenthub/` for your agent. +2. Create a new class that extends the `BaseAgent` class. +3. Implement the `step` method to define the agent's behavior. +4. Register the agent in the controller for the appropriate mode of operation. + +Example: + +```python +from agent.agenthub.base_agent import BaseAgent +from agent.controller.state import State +from agent.actions.action import Action + +class MyNewAgent(BaseAgent): + def __init__(self, mode=Mode.UI, io_mode=Mode.TEXT_2_TEXT_CASCADED): + super().__init__() + self.mode = mode + self.io_mode = io_mode + + def step(self, state: State) -> Action: + # Implement your agent's behavior here + pass +``` + +## Adding New Actions + +To add a new action: + +1. Create a new file in `agent/actions/` for your action. +2. Create a new class that extends the `Action` class. +3. Implement the `execute` method to define the action's behavior. +4. Register the action in the agent's `step` method. + +Example: + +```python +from agent.actions.action import Action +from agent.controller.state import State + +class MyNewAction(Action): + def __init__(self, thought=None, payload=None): + super().__init__(thought, payload) + + def execute(self, state: State): + # Implement your action's behavior here + pass +``` + +## Documentation + +When adding new features or making changes, please update the documentation accordingly: + +1. Update the relevant Markdown files in the `docs/` directory. +2. Add new Markdown files for new features if necessary. +3. Update the `mkdocs.yml` file to include new documentation pages. + +## Testing + +Before submitting a pull request, please test your changes: + +1. Run the existing tests: + ```bash + python -m unittest discover + ``` +2. Add new tests for your changes if necessary. + +## Code Style + +Please follow the existing code style: + +- Use 4 spaces for indentation. +- Use descriptive variable and function names. +- Add docstrings to classes and functions. +- Follow PEP 8 guidelines. + +## Pull Request Process + +1. Ensure your code follows the code style guidelines. +2. Update the documentation if necessary. +3. Add tests for your changes if necessary. +4. Ensure all tests pass. +5. Submit your pull request with a clear description of the changes. + +## License + +By contributing to Aura, you agree that your contributions will be licensed under the project's license. \ No newline at end of file diff --git a/docs/docs_development.md b/docs/docs_development.md new file mode 100644 index 0000000..37494ae --- /dev/null +++ b/docs/docs_development.md @@ -0,0 +1,93 @@ +# Documentation Development + +This guide explains how to develop and maintain the Aura documentation website. + +## Overview + +The Aura documentation is built using [MkDocs](https://www.mkdocs.org/) with the [Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) theme. The documentation is hosted on GitHub Pages and is automatically deployed when changes are pushed to the main branch. + +## Local Development + +To develop the documentation locally, follow these steps: + +1. Install the required dependencies: + ```bash + pip install mkdocs mkdocs-material + ``` + +2. Clone the repository: + ```bash + git clone https://github.com/Sentientia/Aura.git + cd Aura + ``` + +3. Start the local development server: + ```bash + mkdocs serve + ``` + +4. Open your browser and navigate to `http://localhost:8000` to see the documentation. + +5. Make changes to the Markdown files in the `docs/` directory and see the changes reflected in real-time. + +## Documentation Structure + +The documentation is organized as follows: + +- `docs/index.md`: The homepage of the documentation. +- `docs/installation.md`: Installation instructions. +- `docs/architecture.md`: System architecture overview. +- `docs/agents/`: Documentation for the agent components. +- `docs/actions/`: Documentation for the action components. +- `docs/ui.md`: Documentation for the user interface. +- `docs/contributing.md`: Contributing guidelines. +- `docs/docs_development.md`: This guide for documentation development. + +## Adding New Pages + +To add a new page to the documentation: + +1. Create a new Markdown file in the appropriate directory. +2. Add the page to the navigation in `mkdocs.yml`: + ```yaml + nav: + - Home: index.md + - ... + - Your New Page: your_new_page.md + ``` + +## Deployment + +The documentation is automatically deployed to GitHub Pages when changes are pushed to the main branch. The deployment is handled by a GitHub Actions workflow defined in `.github/workflows/mkdocs.yml`. + +To manually deploy the documentation: + +1. Build the documentation: + ```bash + mkdocs build + ``` + +2. Deploy to GitHub Pages: + ```bash + mkdocs gh-deploy + ``` + +## Best Practices + +When writing documentation, follow these best practices: + +- Use clear and concise language. +- Include code examples where appropriate. +- Use headings to organize content. +- Include links to related documentation. +- Use images and diagrams to illustrate complex concepts. +- Keep the documentation up-to-date with the codebase. + +## Troubleshooting + +If you encounter issues with the documentation: + +1. Check the MkDocs logs for errors. +2. Verify that the Markdown syntax is correct. +3. Ensure that all links and images are valid. +4. Check that the navigation in `mkdocs.yml` is correctly formatted. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..fdc808d --- /dev/null +++ b/docs/index.md @@ -0,0 +1,51 @@ +# Aura: Agent for Understanding, Reasoning and Automation + +Welcome to the official documentation for Aura, a cascaded voice assistant system that includes Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and a ReAct-based Agent for reasoning and action taking. + +## Overview + +Aura is designed to be a comprehensive voice assistant system that can understand natural language, reason about user requests, and take appropriate actions. The system is built with a modular architecture that allows for easy extension and customization. + +## Demo + +[![Aura Demo](https://img.youtube.com/vi/cb7w0GVwwF0/0.jpg)](https://www.youtube.com/watch?v=cb7w0GVwwF0) + +## System Architecture + +![Aura System Architecture](images/aura_system_white.png) + +## Key Features + +- **Modular Architecture**: Easily extend and customize the system with new components. +- **Multiple Agents**: Choose between different agent implementations for different use cases. +- **Action Framework**: Execute various actions like web search, calendar management, email, and contact management. +- **Speech Interface**: Interact with the system using natural speech with accent-adaptive ASR. +- **Dialog State Tracking**: Keep track of conversation context for more natural interactions. + +## Getting Started + +To get started with Aura, check out the [Installation](installation.md) guide. + +## Repository Structure + +``` +. +├── agent/ # Core agent implementation +│ ├── actions/ # Action handlers for different tasks +│ ├── controller/ # Agent state and control logic +│ ├── llm/ # Language model integration +│ ├── secrets/ # Secure credential storage +│ └── agenthub/ # Agent implementations +│ +├── ui/ # User interface components +│ ├── local_speech_app.py # Speech interface implementation (using gradio) +│ └── requirements.txt # UI dependencies +│ +├── accent_adaptive_asr/ # Accent-adaptive speech recognition including finetuning +│ +├── llm_serve/ # Language model serving script +│ +├── dst/ # Dialog State Tracking. Has the scripts for finetuning LLMs for DST +│ +└── environment.yaml # Conda environment configuration +``` diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 0000000..1c8e508 --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,76 @@ +# Installation + +This guide will walk you through the process of setting up Aura on your local machine. + +## Prerequisites + +Before installing Aura, make sure you have the following prerequisites installed: + +- Python 3.8 or higher +- Conda (recommended for environment management) +- Git + +## Setup Steps + +1. Clone the repository: + ```bash + git clone https://github.com/Sentientia/Aura.git + cd Aura + ``` + +2. Create the conda environment: + ```bash + conda env create -f environment.yaml + ``` + +3. Activate the conda environment: + ```bash + conda activate aura + ``` + +4. Set the Python path: + ```bash + export PYTHONPATH=$PYTHONPATH:$(pwd) + ``` + +5. Set LLM related environment variables: + - `LLM_API_KEY`: API key for the language model + - `LLM_API_BASE`: Base URL for the language model API + - `LLM_MODEL`: Model identifier for the language model + + Example: + ```bash + export LLM_API_KEY="your-api-key" + export LLM_API_BASE="https://api.openai.com/v1" + export LLM_MODEL="gpt-4" + ``` + +6. Setup secrets (Required for tool use): + + Secrets are used to communicate with external APIs. Follow the format in `agent/secrets_example` and change the name of the directory to `agent/secrets`. + + You will need to: + - Set up a Google Cloud Platform account + - Give necessary permissions + - Get the credential.json file + - Get a SerpAPI key for web search + +## Running the Application + +Launch the Gradio app: +```bash +python ui/local_speech_app.py +``` + +This will start a local web server that you can access in your browser. + +## Troubleshooting + +If you encounter any issues during installation, check the following: + +1. Make sure all environment variables are set correctly +2. Ensure that the conda environment is activated +3. Check that all dependencies are installed correctly +4. Verify that the secrets directory is set up properly + +If you continue to experience issues, please open an issue on the [GitHub repository](https://github.com/Sentientia/Aura/issues). \ No newline at end of file diff --git a/docs/ui.md b/docs/ui.md new file mode 100644 index 0000000..b783ae0 --- /dev/null +++ b/docs/ui.md @@ -0,0 +1,86 @@ +# User Interface + +Aura provides a user-friendly interface for interacting with the system. This page describes the UI components and how to use them. + +## Overview + +The Aura UI is implemented using Gradio, which provides a web-based interface for interacting with the system. The UI supports both text and speech inputs, and can provide responses in both text and speech formats. + +## Components + +The UI consists of the following components: + +- **Input Section**: Where users can provide input through text or speech. +- **Output Section**: Where the system's responses are displayed. +- **Conversation History**: A record of the conversation between the user and the system. +- **Model Selection**: Options to select different ASR, TTS, and LLM models. +- **Settings**: Configuration options for the system. + +## Usage + +To launch the UI, run the following command: + +```bash +python ui/local_speech_app.py +``` + +This will start a local web server that you can access in your browser. + +## Speech Interface + +The speech interface allows users to interact with the system using natural speech. To use the speech interface: + +1. Click the microphone button to start recording. +2. Speak your request or question. +3. Click the stop button to stop recording. +4. The system will transcribe your speech, process your request, and provide a response. +5. If speech output is enabled, the system will also speak the response. + +## Text Interface + +The text interface allows users to interact with the system using text. To use the text interface: + +1. Type your request or question in the input field. +2. Press Enter or click the submit button. +3. The system will process your request and provide a response. + +## Model Selection + +The UI allows users to select different models for ASR, TTS, and LLM: + +- **ASR Models**: Options include Whisper, OWSM, and other speech recognition models. +- **TTS Models**: Options include ESPnet and other speech synthesis models. +- **LLM Models**: Options include different language models for natural language understanding and generation. + +To select a model: + +1. Click the dropdown menu for the desired component (ASR, TTS, or LLM). +2. Select the desired model from the list. +3. The system will use the selected model for future interactions. + +## Settings + +The UI provides various settings to customize the system's behavior: + +- **Speech Input**: Enable or disable speech input. +- **Speech Output**: Enable or disable speech output. +- **Conversation History**: Clear the conversation history. +- **Debug Mode**: Enable or disable debug information. + +To access the settings: + +1. Click the settings button in the UI. +2. Adjust the settings as desired. +3. Click the save button to apply the changes. + +## Example Interaction + +Here's an example of how to interact with the system through the UI: + +1. User speaks or types: "What's the weather like today?" +2. System processes the request and responds: "I'll check the weather for you. Where are you located?" +3. User speaks or types: "New York City" +4. System uses the Web Search Action to find weather information for New York City. +5. System responds with the current weather conditions and forecast for New York City. + +This interaction demonstrates how the system can engage in a natural conversation with the user, gather necessary information, and provide helpful responses. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..976f5b1 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,50 @@ +site_name: Aura Documentation +site_description: Documentation for Aura - Agent for Understanding, Reasoning and Automation +site_author: Sentientia +repo_url: https://github.com/Sentientia/Aura +repo_name: Sentientia/Aura + +theme: + name: material + palette: + primary: indigo + accent: indigo + features: + - navigation.tabs + - navigation.sections + - navigation.top + - search.suggest + - search.highlight + - content.tabs.link + +nav: + - Home: index.md + - Installation: installation.md + - Architecture: architecture.md + - Agents: + - Overview: agents/index.md + - Chat Agent: agents/chat_agent.md + - QA Agent: agents/qa_agent.md + - Actions: + - Overview: actions/index.md + - Chat Action: actions/chat_action.md + - Web Search Action: actions/web_search_action.md + - Calendar Action: actions/calendar_action.md + - Email Action: actions/email_action.md + - Contact Action: actions/contact_action.md + - Answer Action: actions/answer_action.md + - UI: ui.md + - Contributing: contributing.md + - Documentation Development: docs_development.md + +markdown_extensions: + - pymdownx.highlight + - pymdownx.superfences + - pymdownx.tabbed + - pymdownx.tasklist + - admonition + - toc: + permalink: true + +plugins: + - search