init devika repo

2024-07-01 22:49:56 +03:00
commit f0b94ab9bd
164 changed files with 8016 additions and 0 deletions
--- a/docs/architecture/ARCHITECTURE.md
+++ b/docs/architecture/ARCHITECTURE.md
@@ -0,0 +1,251 @@
+# Devika Architecture
+
+Devika is an advanced AI software engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve a given objective. This document provides a detailed technical overview of Devika's system architecture and how the various components work together.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [Agent Core](#agent-core)
+3. [Agents](#agents)
+   - [Planner](#planner)
+   - [Researcher](#researcher) 
+   - [Coder](#coder)
+   - [Action](#action)
+   - [Runner](#runner)
+   - [Feature](#feature)
+   - [Patcher](#patcher)
+   - [Reporter](#reporter)
+   - [Decision](#decision)
+4. [Language Models](#language-models)
+5. [Browser Interaction](#browser-interaction) 
+6. [Project Management](#project-management)
+7. [Agent State Management](#agent-state-management)
+8. [Services](#services)
+9. [Utilities](#utilities)
+10. [Conclusion](#conclusion)
+
+## Overview
+
+At a high level, Devika consists of the following key components:
+
+- **Agent Core**: Orchestrates the overall AI planning, reasoning and execution process. Communicates with various sub-agents.
+- **Agents**: Specialized sub-agents that handle specific tasks like planning, research, coding, patching, reporting etc.  
+- **Language Models**: Leverages large language models (LLMs) like Claude, GPT-4, GPT-3 for natural language understanding and generation.
+- **Browser Interaction**: Enables web browsing, information gathering, and interaction with web elements.
+- **Project Management**: Handles organization and persistence of project-related data. 
+- **Agent State Management**: Tracks and persists the dynamic state of the AI agent across interactions.
+- **Services**: Integrations with external services like GitHub, Netlify for enhanced capabilities.
+- **Utilities**: Supporting modules for configuration, logging, vector search, PDF generation etc.
+
+Let's dive into each of these components in more detail.
+
+## Agent Core
+
+The `Agent` class serves as the central engine that drives Devika's AI planning and execution loop. Here's how it works:
+
+1. When a user provides a high-level prompt, the `execute` method is invoked on the Agent. 
+2. The prompt is first passed to the Planner agent to generate a step-by-step plan.
+3. The Researcher agent then takes this plan and extracts relevant search queries and context.
+4. The Agent performs web searches using Bing Search API and crawls the top results. 
+5. The raw crawled content is passed through the Formatter agent to extract clean, relevant information.
+6. This researched context, along with the step-by-step plan, is fed to the Coder agent to generate code.
+7. The generated code is saved to the project directory on disk.
+8. If the user interacts further with a follow-up prompt, the `subsequent_execute` method is invoked.
+9. The Action agent determines the appropriate action to take based on the user's message (run code, deploy, write tests, add feature, fix bug, write report etc.)
+10. The corresponding specialized agent is invoked to perform the action (Runner, Feature, Patcher, Reporter).
+11. Results are communicated back to the user and the project files are updated.
+
+Throughout this process, the Agent Core is responsible for:
+- Managing conversation history and project-specific context
+- Updating agent state and internal monologue 
+- Accumulating context keywords across agent prompts
+- Emulating the "thinking" process of the AI through timed agent state updates
+- Handling special commands through the Decision agent (e.g. git clone, browser interaction session)
+
+## Agents
+
+Devika's cognitive abilities are powered by a collection of specialized sub-agents. Each agent is implemented as a separate Python class. Agents communicate with the underlying LLMs through prompt templates defined in Jinja2 format. Key agents include:
+
+### Planner
+- Generates a high-level step-by-step plan based on the user's prompt
+- Extracts focus area and provides a summary
+- Uses few-shot prompting to provide examples of the expected response format
+
+### Researcher
+- Takes the generated plan and extracts relevant search queries 
+- Ranks and filters queries based on relevance and specificity
+- Prompts the user for additional context if required
+- Aims to maximize information gain while minimizing number of searches
+
+### Coder
+- Generates code based on the step-by-step plan and researched context
+- Segments code into appropriate files and directories
+- Includes informative comments and documentation
+- Handles a variety of languages and frameworks
+- Validates code syntax and style
+
+### Action
+- Determines the appropriate action to take based on the user's follow-up prompt
+- Maps user intent to a specific action keyword (run, test, deploy, fix, implement, report)
+- Provides a human-like confirmation of the action to the user
+
+### Runner
+- Executes the written code in a sandboxed environment 
+- Handles different OS environments (Mac, Linux, Windows)
+- Streams command output to user in real-time
+- Gracefully handles errors and exceptions
+
+### Feature
+- Implements a new feature based on user's specification
+- Modifies existing project files while maintaining code structure and style
+- Performs incremental testing to verify feature is working as expected
+
+### Patcher
+- Debugs and fixes issues based on user's description or error message
+- Analyzes existing code to identify potential root causes
+- Suggests and implements fix, with explanation of the changes made
+
+### Reporter
+- Generates a comprehensive report summarizing the project
+- Includes high-level overview, technical design, setup instructions, API docs etc.
+- Formats report in a clean, readable structure with table of contents
+- Exports report as a PDF document
+
+### Decision
+- Handles special command-like instructions that don't fit other agents
+- Maps commands to specific functions (git clone, browser interaction etc.)
+- Executes the corresponding function with provided arguments
+
+Each agent follows a common pattern:
+1. Prepare a prompt by rendering the Jinja2 template with current context
+2. Query the LLM to get a response based on the prompt
+3. Validate and parse the LLM's response to extract structured output
+4. Perform any additional processing or side-effects (e.g. save to disk)
+5. Return the result to the Agent Core for further action
+
+Agents aim to be stateless and idempotent where possible. State and history is managed by the Agent Core and passed into the agents as needed. This allows for a modular, composable design.
+
+## Language Models
+
+Devika's natural language processing capabilities are driven by state-of-the-art LLMs. The `LLM` class provides a unified interface to interact with different language models:
+
+- **Claude** (Anthropic): Claude models like claude-v1.3, claude-instant-v1.0 etc.
+- **GPT-4/GPT-3** (OpenAI): Models like gpt-4, gpt-3.5-turbo etc.
+- **Self-hosted models** (via [Ollama](https://ollama.com/)): Allows using open-source models in a self-hosted environment
+
+The `LLM` class abstracts out the specifics of each provider's API, allowing agents to interact with the models in a consistent way. It supports:
+- Listing available models
+- Generating completions based on a prompt
+- Tracking and accumulating token usage over time
+
+Choosing the right model for a given use case depends on factors like desired quality, speed, cost etc. The modular design allows swapping out models easily.
+
+## Browser Interaction
+
+Devika can interact with webpages in an automated fashion to gather information and perform actions. This is powered by the `Browser` and `Crawler` classes.
+
+The `Browser` class uses Playwright to provide high-level web automation primitives:
+- Spawning a browser instance (Chromium)
+- Navigating to a URL
+- Querying DOM elements 
+- Extracting page content as text, Markdown, PDF etc.
+- Taking a screenshot of the page
+
+The `Crawler` class defines an agent that can interact with a webpage based on natural language instructions. It leverages:
+- Pre-defined browser actions like scroll, click, type etc.
+- A prompt template that provides examples of how to use these actions
+- LLM to determine the best action to take based on current page content and objective
+
+The `start_interaction` function sets up a loop where:
+1. The current page content and objective is passed to the LLM 
+2. The LLM returns the next best action to take (e.g. "CLICK 12" or "TYPE 7 machine learning")
+3. The Crawler executes this action on the live page
+4. The process repeats from the updated page state
+
+This allows performing a sequence of actions to achieve a higher-level objective (e.g. research a topic, fill out a form, interact with an app etc.)
+
+## Project Management
+
+The `ProjectManager` class is responsible for creating, updating and querying projects and their associated metadata. Key functions include:
+
+- Creating a new project and initializing its directory structure
+- Deleting a project and its associated files
+- Adding a message to a project's conversation history
+- Retrieving messages for a given project
+- Getting the latest user/AI message in a conversation
+- Listing all projects
+- Zipping a project's files for export
+
+Project metadata is persisted in a SQLite database using SQLModel. The `Projects` table stores:
+- Project name
+- JSON-serialized conversation history
+
+This allows the agent to work on multiple projects simultaneously and retain conversation history across sessions.
+
+## Agent State Management
+
+As the AI agent works on a task, we need to track and display its internal state to the user. The `AgentState` class handles this by providing an interface to:
+
+- Initialize a new agent state 
+- Add a state to the current sequence of states for a project
+- Update the latest state for a project
+- Query the latest state or entire state history for a project
+- Mark the agent as active/inactive or task as completed
+
+Agent state includes information like:
+- Current step or action being executed
+- Internal monologue reflecting the agent's current "thoughts"
+- Browser interactions (URL visited, screenshot)
+- Terminal interactions (command executed, output)
+- Token usage so far
+
+Like projects, agent states are also persisted in the SQLite DB using SQLModel. The `AgentStateModel` table stores:
+- Project name
+- JSON-serialized list of states
+
+Having a persistent log of agent states is useful for:
+- Providing real-time visibility to the user
+- Auditing and debugging agent behavior
+- Resuming from interruptions or failures
+
+## Services
+
+Devika integrates with external services to augment its capabilities:
+
+- **GitHub**: Performing git operations like clone/pull, listing repos/commits/files etc.
+- **Netlify**: Deploying web apps and sites seamlessly
+
+The `GitHub` and `Netlify` classes provide lightweight wrappers around the respective service APIs. 
+They handle authentication, making HTTP requests, and parsing responses.
+
+This allows Devika to perform actions like:
+- Cloning a repo given a GitHub URL
+- Listing a user's GitHub repos 
+- Creating a new Netlify site
+- Deploying a directory to Netlify 
+- Providing the deployed site URL to the user
+
+Integrations are done in a modular way so that new services can be added easily.
+
+## Utilities  
+
+Devika makes use of several utility modules to support its functioning:
+
+- `Config`: Loads and provides access to configuration settings (API keys, folder paths etc.) 
+- `Logger`: Sets up logging to console and file, with support for log levels and colors
+- `ReadCode`: Recursively reads code files in a directory and converts them into a Markdown format
+- `SentenceBERT`: Extracts keywords and semantic information from text using SentenceBERT embeddings
+- `Experts`: A collection of domain-specific knowledge bases to assist in certain areas (e.g. webdev, physics, chemistry, math)
+
+The utility modules aim to provide reusable functionality that is used across different parts of the system.
+
+## Conclusion
+
+Devika is a complex system that combines multiple AI and automation techniques to deliver an intelligent programming assistant. Key design principles include:
+
+- Modularity: Breaking down functionality into specialized agents and services
+- Flexibility: Supporting different LLMs, services and domains in a pluggable fashion  
+- Persistence: Storing project and agent state in a DB to enable pause/resume and auditing
+- Transparency: Surfacing agent thought process and interactions to user in real-time
+
+By understanding how the different components work together, we can extend, optimize and scale Devika to take on increasingly sophisticated software engineering tasks. The agent-based architecture provides a strong foundation to build more advanced AI capabilities in the future.
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -0,0 +1,16 @@
+## System Architecture
+
+Devika's system architecture consists of the following key components:
+
+1. **User Interface**: A web-based chat interface for interacting with Devika, viewing project files, and monitoring the agent's state.
+2. **Agent Core**: The central component that orchestrates the AI planning, reasoning, and execution process. It communicates with various sub-agents and modules to accomplish tasks.
+3. **Large Language Models**: Devika leverages state-of-the-art language models like **Claude**, **GPT-4**, and **Local LLMs via Ollama** for natural language understanding, generation, and reasoning.
+4. **Planning and Reasoning Engine**: Responsible for breaking down high-level objectives into actionable steps and making decisions based on the current context.
+5. **Research Module**: Utilizes keyword extraction and web browsing capabilities to gather relevant information for the task at hand.
+6. **Code Writing Module**: Generates code based on the plan, research findings, and user requirements. Supports multiple programming languages.
+7. **Browser Interaction Module**: Enables Devika to navigate websites, extract information, and interact with web elements as needed.
+8. **Knowledge Base**: Stores and retrieves project-specific information, code snippets, and learned knowledge for efficient access.
+9. **Database**: Persists project data, agent states, and configuration settings.
+
+Read [ARCHITECTURE.md](https://github.com/stitionai/devika/Docs/architecture/ARCHITECTURE.md) for the detailed architecture of Devika.
+Read [UNDER_THE_HOOD.md](https://github.com/stitionai/devika/Docs/architecture/UNDER_THE_HOOD.md) for the detailed working of Devika.
--- a/docs/architecture/UNDER_THE_HOOD.md
+++ b/docs/architecture/UNDER_THE_HOOD.md
@@ -0,0 +1,50 @@
+## Under The Hood
+
+Let's dive deeper into some of the key components and techniques used in Devika:
+
+### AI Planning and Reasoning
+
+Devika employs advanced AI planning and reasoning algorithms to break down high-level objectives into actionable steps. The planning process involves the following stages:
+
+1. **Objective Understanding**: Devika analyzes the given objective or task description to understand the user's intent and requirements.
+2. **Context Gathering**: Relevant context is collected from the conversation history, project files, and knowledge base to inform the planning process.
+3. **Step Generation**: Based on the objective and context, Devika generates a sequence of high-level steps to accomplish the task.
+4. **Refinement and Validation**: The generated steps are refined and validated to ensure their feasibility and alignment with the objective.
+5. **Execution**: Devika executes each step in the plan, utilizing various sub-agents and modules as needed.
+
+The reasoning engine constantly evaluates the progress and makes adjustments to the plan based on new information or feedback received during execution.
+
+### Keyword Extraction
+
+To enable focused research and information gathering, Devika employs keyword extraction techniques. The process involves the following steps:
+
+1. **Preprocessing**: The input text (objective, conversation history, or project files) is preprocessed by removing stop words, tokenizing, and normalizing the text.
+2. **Keyword Identification**: Devika uses the BERT (Bidirectional Encoder Representations from Transformers) model to identify important keywords and phrases from the preprocessed text. BERT's pre-training on a large corpus allows it to capture semantic relationships and understand the significance of words in the given context.
+3. **Keyword Ranking**: The identified keywords are ranked based on their relevance and importance to the task at hand. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and TextRank are used to assign scores to each keyword.
+4. **Keyword Selection**: The top-ranked keywords are selected as the most relevant and informative for the current context. These keywords are used to guide the research and information gathering process.
+
+By extracting contextually relevant keywords, Devika can focus its research efforts and retrieve pertinent information to assist in the task completion.
+
+### Browser Interaction
+
+Devika incorporates browser interaction capabilities to navigate websites, extract information, and interact with web elements. The browser interaction module leverages the Playwright library to automate web interactions. The process involves the following steps:
+
+1. **Navigation**: Devika uses Playwright to navigate to specific URLs or perform searches based on the keywords or requirements provided.
+2. **Element Interaction**: Playwright allows Devika to interact with web elements such as clicking buttons, filling forms, and extracting text from specific elements.
+3. **Page Parsing**: Devika parses the HTML structure of the web pages visited to extract relevant information. It uses techniques like CSS selectors and XPath to locate and extract specific data points.
+4. **JavaScript Execution**: Playwright enables Devika to execute JavaScript code within the browser context, allowing for dynamic interactions and data retrieval.
+5. **Screenshot Capture**: Devika can capture screenshots of the web pages visited, which can be useful for visual reference or debugging purposes.
+
+The browser interaction module empowers Devika to gather information from the web, interact with online resources, and incorporate real-time data into its decision-making and code generation processes.
+
+### Code Writing
+
+Devika's code writing module generates code based on the plan, research findings, and user requirements. The process involves the following steps:
+
+1. **Language Selection**: Devika identifies the programming language specified by the user or infers it based on the project context.
+2. **Code Structure Generation**: Based on the plan and language-specific patterns, Devika generates the high-level structure of the code, including classes, functions, and modules.
+3. **Code Population**: Devika fills in the code structure with specific logic, algorithms, and data manipulation statements. It leverages the research findings, code snippets from the knowledge base, and its own understanding of programming concepts to generate meaningful code.
+4. **Code Formatting**: The generated code is formatted according to the language-specific conventions and best practices to ensure readability and maintainability.
+5. **Code Review and Refinement**: Devika reviews the generated code for syntax errors, logical inconsistencies, and potential improvements. It iteratively refines the code based on its own analysis and any feedback provided by the user.
+
+Devika's code writing capabilities enable it to generate functional and efficient code in various programming languages, taking into account the specific requirements and context of each project.