prompt2code/ARCHITECTURE.md

252 lines
12 KiB
Markdown
Raw Normal View History

2024-07-01 22:49:56 +03:00
# Devika Architecture
Devika is an advanced AI software engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve a given objective. This document provides a detailed technical overview of Devika's system architecture and how the various components work together.
## Table of Contents
1. [Overview](#overview)
2. [Agent Core](#agent-core)
3. [Agents](#agents)
- [Planner](#planner)
- [Researcher](#researcher)
- [Coder](#coder)
- [Action](#action)
- [Runner](#runner)
- [Feature](#feature)
- [Patcher](#patcher)
- [Reporter](#reporter)
- [Decision](#decision)
4. [Language Models](#language-models)
5. [Browser Interaction](#browser-interaction)
6. [Project Management](#project-management)
7. [Agent State Management](#agent-state-management)
8. [Services](#services)
9. [Utilities](#utilities)
10. [Conclusion](#conclusion)
## Overview
At a high level, Devika consists of the following key components:
- **Agent Core**: Orchestrates the overall AI planning, reasoning and execution process. Communicates with various sub-agents.
- **Agents**: Specialized sub-agents that handle specific tasks like planning, research, coding, patching, reporting etc.
- **Language Models**: Leverages large language models (LLMs) like Claude, GPT-4, GPT-3 for natural language understanding and generation.
- **Browser Interaction**: Enables web browsing, information gathering, and interaction with web elements.
- **Project Management**: Handles organization and persistence of project-related data.
- **Agent State Management**: Tracks and persists the dynamic state of the AI agent across interactions.
- **Services**: Integrations with external services like GitHub, Netlify for enhanced capabilities.
- **Utilities**: Supporting modules for configuration, logging, vector search, PDF generation etc.
Let's dive into each of these components in more detail.
## Agent Core
The `Agent` class serves as the central engine that drives Devika's AI planning and execution loop. Here's how it works:
1. When a user provides a high-level prompt, the `execute` method is invoked on the Agent.
2. The prompt is first passed to the Planner agent to generate a step-by-step plan.
3. The Researcher agent then takes this plan and extracts relevant search queries and context.
4. The Agent performs web searches using Bing Search API and crawls the top results.
5. The raw crawled content is passed through the Formatter agent to extract clean, relevant information.
6. This researched context, along with the step-by-step plan, is fed to the Coder agent to generate code.
7. The generated code is saved to the project directory on disk.
8. If the user interacts further with a follow-up prompt, the `subsequent_execute` method is invoked.
9. The Action agent determines the appropriate action to take based on the user's message (run code, deploy, write tests, add feature, fix bug, write report etc.)
10. The corresponding specialized agent is invoked to perform the action (Runner, Feature, Patcher, Reporter).
11. Results are communicated back to the user and the project files are updated.
Throughout this process, the Agent Core is responsible for:
- Managing conversation history and project-specific context
- Updating agent state and internal monologue
- Accumulating context keywords across agent prompts
- Emulating the "thinking" process of the AI through timed agent state updates
- Handling special commands through the Decision agent (e.g. git clone, browser interaction session)
## Agents
Devika's cognitive abilities are powered by a collection of specialized sub-agents. Each agent is implemented as a separate Python class. Agents communicate with the underlying LLMs through prompt templates defined in Jinja2 format. Key agents include:
### Planner
- Generates a high-level step-by-step plan based on the user's prompt
- Extracts focus area and provides a summary
- Uses few-shot prompting to provide examples of the expected response format
### Researcher
- Takes the generated plan and extracts relevant search queries
- Ranks and filters queries based on relevance and specificity
- Prompts the user for additional context if required
- Aims to maximize information gain while minimizing number of searches
### Coder
- Generates code based on the step-by-step plan and researched context
- Segments code into appropriate files and directories
- Includes informative comments and documentation
- Handles a variety of languages and frameworks
- Validates code syntax and style
### Action
- Determines the appropriate action to take based on the user's follow-up prompt
- Maps user intent to a specific action keyword (run, test, deploy, fix, implement, report)
- Provides a human-like confirmation of the action to the user
### Runner
- Executes the written code in a sandboxed environment
- Handles different OS environments (Mac, Linux, Windows)
- Streams command output to user in real-time
- Gracefully handles errors and exceptions
### Feature
- Implements a new feature based on user's specification
- Modifies existing project files while maintaining code structure and style
- Performs incremental testing to verify feature is working as expected
### Patcher
- Debugs and fixes issues based on user's description or error message
- Analyzes existing code to identify potential root causes
- Suggests and implements fix, with explanation of the changes made
### Reporter
- Generates a comprehensive report summarizing the project
- Includes high-level overview, technical design, setup instructions, API docs etc.
- Formats report in a clean, readable structure with table of contents
- Exports report as a PDF document
### Decision
- Handles special command-like instructions that don't fit other agents
- Maps commands to specific functions (git clone, browser interaction etc.)
- Executes the corresponding function with provided arguments
Each agent follows a common pattern:
1. Prepare a prompt by rendering the Jinja2 template with current context
2. Query the LLM to get a response based on the prompt
3. Validate and parse the LLM's response to extract structured output
4. Perform any additional processing or side-effects (e.g. save to disk)
5. Return the result to the Agent Core for further action
Agents aim to be stateless and idempotent where possible. State and history is managed by the Agent Core and passed into the agents as needed. This allows for a modular, composable design.
## Language Models
Devika's natural language processing capabilities are driven by state-of-the-art LLMs. The `LLM` class provides a unified interface to interact with different language models:
- **Claude** (Anthropic): Claude models like claude-v1.3, claude-instant-v1.0 etc.
- **GPT-4/GPT-3** (OpenAI): Models like gpt-4, gpt-3.5-turbo etc.
- **Self-hosted models** (via [Ollama](https://ollama.com/)): Allows using open-source models in a self-hosted environment
The `LLM` class abstracts out the specifics of each provider's API, allowing agents to interact with the models in a consistent way. It supports:
- Listing available models
- Generating completions based on a prompt
- Tracking and accumulating token usage over time
Choosing the right model for a given use case depends on factors like desired quality, speed, cost etc. The modular design allows swapping out models easily.
## Browser Interaction
Devika can interact with webpages in an automated fashion to gather information and perform actions. This is powered by the `Browser` and `Crawler` classes.
The `Browser` class uses Playwright to provide high-level web automation primitives:
- Spawning a browser instance (Chromium)
- Navigating to a URL
- Querying DOM elements
- Extracting page content as text, Markdown, PDF etc.
- Taking a screenshot of the page
The `Crawler` class defines an agent that can interact with a webpage based on natural language instructions. It leverages:
- Pre-defined browser actions like scroll, click, type etc.
- A prompt template that provides examples of how to use these actions
- LLM to determine the best action to take based on current page content and objective
The `start_interaction` function sets up a loop where:
1. The current page content and objective is passed to the LLM
2. The LLM returns the next best action to take (e.g. "CLICK 12" or "TYPE 7 machine learning")
3. The Crawler executes this action on the live page
4. The process repeats from the updated page state
This allows performing a sequence of actions to achieve a higher-level objective (e.g. research a topic, fill out a form, interact with an app etc.)
## Project Management
The `ProjectManager` class is responsible for creating, updating and querying projects and their associated metadata. Key functions include:
- Creating a new project and initializing its directory structure
- Deleting a project and its associated files
- Adding a message to a project's conversation history
- Retrieving messages for a given project
- Getting the latest user/AI message in a conversation
- Listing all projects
- Zipping a project's files for export
Project metadata is persisted in a SQLite database using SQLModel. The `Projects` table stores:
- Project name
- JSON-serialized conversation history
This allows the agent to work on multiple projects simultaneously and retain conversation history across sessions.
## Agent State Management
As the AI agent works on a task, we need to track and display its internal state to the user. The `AgentState` class handles this by providing an interface to:
- Initialize a new agent state
- Add a state to the current sequence of states for a project
- Update the latest state for a project
- Query the latest state or entire state history for a project
- Mark the agent as active/inactive or task as completed
Agent state includes information like:
- Current step or action being executed
- Internal monologue reflecting the agent's current "thoughts"
- Browser interactions (URL visited, screenshot)
- Terminal interactions (command executed, output)
- Token usage so far
Like projects, agent states are also persisted in the SQLite DB using SQLModel. The `AgentStateModel` table stores:
- Project name
- JSON-serialized list of states
Having a persistent log of agent states is useful for:
- Providing real-time visibility to the user
- Auditing and debugging agent behavior
- Resuming from interruptions or failures
## Services
Devika integrates with external services to augment its capabilities:
- **GitHub**: Performing git operations like clone/pull, listing repos/commits/files etc.
- **Netlify**: Deploying web apps and sites seamlessly
The `GitHub` and `Netlify` classes provide lightweight wrappers around the respective service APIs.
They handle authentication, making HTTP requests, and parsing responses.
This allows Devika to perform actions like:
- Cloning a repo given a GitHub URL
- Listing a user's GitHub repos
- Creating a new Netlify site
- Deploying a directory to Netlify
- Providing the deployed site URL to the user
Integrations are done in a modular way so that new services can be added easily.
## Utilities
Devika makes use of several utility modules to support its functioning:
- `Config`: Loads and provides access to configuration settings (API keys, folder paths etc.)
- `Logger`: Sets up logging to console and file, with support for log levels and colors
- `ReadCode`: Recursively reads code files in a directory and converts them into a Markdown format
- `SentenceBERT`: Extracts keywords and semantic information from text using SentenceBERT embeddings
- `Experts`: A collection of domain-specific knowledge bases to assist in certain areas (e.g. webdev, physics, chemistry, math)
The utility modules aim to provide reusable functionality that is used across different parts of the system.
## Conclusion
Devika is a complex system that combines multiple AI and automation techniques to deliver an intelligent programming assistant. Key design principles include:
- Modularity: Breaking down functionality into specialized agents and services
- Flexibility: Supporting different LLMs, services and domains in a pluggable fashion
- Persistence: Storing project and agent state in a DB to enable pause/resume and auditing
- Transparency: Surfacing agent thought process and interactions to user in real-time
By understanding how the different components work together, we can extend, optimize and scale Devika to take on increasingly sophisticated software engineering tasks. The agent-based architecture provides a strong foundation to build more advanced AI capabilities in the future.