SemanticAgent: taking PandasAI to the Next Level

SemanticAgent: taking PandasAI to the Next Level

At PandasAI, we are constantly striving to push the boundaries of data analysis and provide our users with the most advanced and powerful tools available. Our PandasAI library has revolutionized the way businesses approach data, enabling them to extract valuable insights with unprecedented ease and efficiency. However, we are not ones to rest on our laurels. Recognizing the ever-evolving needs of the modern data landscape, we are thrilled to introduce the SemanticAgent, a groundbreaking extension that takes PandasAI to new heights of accuracy, interpretability, and explainability.

The SemanticAgent is a cutting-edge tool that enhances the capabilities of our renowned PandasAI library by adding a semantic layer to its outputs. While the standard Agent has proven invaluable, the SemanticAgent takes things a step further by generating a JSON query that can then be used to produce Python or SQL code. This approach ensures that the results are not only accurate but also easily understandable and traceable, empowering users to gain deeper insights into their data and make more informed decisions.

In this article, we will delve into the world of the SemanticAgent, exploring its inner workings and demonstrating how it can elevate your data analysis experience to unprecedented levels. We'll guide you through practical examples, provide you with valuable insights, and equip you with the knowledge necessary to leverage the full potential of this game-changing tool.

The Power of Explainability

While PandasAI has already revolutionized the way businesses approach data analysis, the SemanticAgent takes things to a whole new level by introducing a crucial element: explainability. By generating a JSON query that serves as an intermediary step, the SemanticAgent makes the analysis process more transparent and easier to understand.

Instead of relying solely on complex queries, the SemanticAgent breaks down the analysis into digestible components, allowing users to gain a deeper understanding of what the system is doing and why. This level of explainability is invaluable in today's data-driven world, where trustworthiness and interpretability are paramount.

Moreover, the SemanticAgent's approach enhances the accuracy of results by reducing the potential for errors and ambiguities. By generating a JSON query that can be easily verified and validated, the SemanticAgent ensures that the analysis aligns with the user's intentions, minimizing the risk of misinterpretations or erroneous conclusions.

Streamlining the Workflow

The SemanticAgent operates in a two-step process: schema generation and JSON query generation. Let's explore each of these steps in detail:

  1. Schema Generation
    The first step in the SemanticAgent's workflow is schema generation. This process involves structuring the data into a schema that the agent can use to generate JSON queries. By default, the SemanticAgent automatically generates a schema based on the dataframes provided, streamlining the process and eliminating the need for manual schema creation.However, for those who prefer a more customized approach, the SemanticAgent also offers the flexibility to provide a custom schema. This feature is particularly useful when working with complex data structures or when specific schema configurations are required.
  2. JSON Query Generation
    The second step in the SemanticAgent's workflow is JSON query generation. Based on the schema, the SemanticAgent generates a structured JSON query that can be used to produce the Python or SQL code required for execution.

Example of a JSON query generated by the SemanticAgent:

{
  "type": "number",
  "dimensions": [],
  "measures": ["Salaries.avg_salary"],
  "timeDimensions": [],
  "filters": [],
  "order": []
}

This JSON query is then interpreted by the SemanticAgent and converted into executable Python or SQL code, enabling users to perform their desired analysis with unparalleled accuracy and transparency.

Diving Deeper: Understanding the Schema and Query Structures

To fully unlock the potential of the SemanticAgent, it's essential to understand the structure and components of both the schema and the JSON query. Let's take a closer look at each of these elements.

    • name: The identifier for the measure.
    • type: The type of aggregation (e.g., count, avg, sum, max, min).
    • sql: The column or expression in SQL to compute the measure.
    • name: The identifier for the dimension.
    • type: The data type (e.g., string, date).
    • sql: The column or expression in SQL to reference the dimension.
    • name: The name of the related table.
    • join_type: The type of join (e.g., left, right, inner).
    • sql: The SQL expression to perform the join.
    • type: Can be "number", "pie", "bar", "line", or "table".
    • dimensions: An array of dimension identifiers.
    • measures: An array of measure identifiers.
    • filters: An array of filter conditions.
    • order: An array of ordering specifications.

Understanding the Query Structure
The JSON query is a structured representation of the request, specifying what data to retrieve and how to process it. It consists of the following fields:

a. Type
The type of query determines the format of the result, such as a single number, a table, or a chart.Example:

{
  "type": "number",
  ...
}

b. Dimensions
Columns used to group the data. In an SQL GROUP BY clause, these would be the columns listed.Example:

{
  ...,
  "dimensions": ["Department"]
}

c. Measures
Columns used to calculate data, typically involving aggregate functions like sum, average, count, etc.Example:

{
  ...,
  "measures": ["Salaries.avg_salary"]
}

d. Time Dimensions
Columns used to group the data by time, often involving date functions. Each timeDimensions entry specifies a time period and its granularity. The dateRange field allows various formats, including specific dates such as ["2022-01-01", "2023-03-31"], relative periods like "last week", "last month", "this month", "this week", "today", "this year", and "last year".Example:

{
  ...,
  "timeDimensions": [
    {
      "dimension": "Sales.time_period",
      "dateRange": ["2023-01-01", "2023-03-31"],
      "granularity": "day"
    }
  ]
}

e. Filters
Conditions to filter the data, equivalent to SQL WHERE clauses. Each filter specifies a member, an operator, and a set of values. The operators allowed include: "equals", "notEquals", "contains", "notContains", "startsWith", "endsWith", "gt" (greater than), "gte" (greater than or equal to), "lt" (less than), "lte" (less than or equal to), "set", "notSet", "inDateRange", "notInDateRange", "beforeDate", and "afterDate".Example:

{
  ...,
  "filters": [
    {
      "member": "Ticket.category",
      "operator": "notEquals",
      "values": ["null"]
    }
  ]
}

f. Order
Columns used to order the data, equivalent to SQL ORDER BY clauses. Each entry in the order array specifies an identifier and the direction of sorting. The direction can be either "asc" for ascending or "desc" for descending order.Example:

{
  ...,
  "order": [
    {
      "id": "Contratti.contract_count",
      "direction": "asc"
    }
  ]
}

Understanding the Schema Structure

The schema in the SemanticAgent is a comprehensive representation of the data, including tables, columns, measures, dimensions, and relationships between tables. It consists of the following key components:

a. Measures
Measures are the quantitative metrics used in the analysis, such as sums, averages, counts, and more. Each measure is defined by the following properties:Example:

{
  "name": "avg_salary",
  "type": "avg",
  "sql": "Salary"
}

b. Dimensions
Dimensions are the categorical variables used to slice and dice the data. Each dimension is defined by the following properties:Example:

{
  "name": "Department",
  "type": "string",
  "sql": "Department"
}

c. Joins
Joins define the relationships between tables, specifying how they should be connected in queries. Each join is defined by the following properties:Example:

{
  "name": "Salaries",
  "join_type": "left",
  "sql": "Employees.EmployeeID = Salaries.EmployeeID"
}

When these components come together, they form a complete query that the SemanticAgent can interpret and execute. This level of granularity and transparency not only enhances the accuracy of the results but also ensures that users can fully understand and trace the analysis process, further solidifying PandasAI's position as a leader in the data analysis domain.

Putting the SemanticAgent into Action

Now that we've explored the inner workings of the SemanticAgent, let's dive into a practical example that showcases its capabilities. Imagine you have a dataset containing employee information, including their names, departments, and salaries. With the SemanticAgent, you can easily analyze this data and gain valuable insights.

Here's an example of how you can create a SemanticAgent instance and query it:

from pandasai.semantic_agent import SemanticAgent
import pandas as pd

# Load the data
employees_df = pd.DataFrame({
    "EmployeeID": [1, 2, 3, 4, 5],
    "Name": ["John", "Emma", "Liam", "Olivia", "William"],
    "Department": ["HR", "Marketing", "IT", "Marketing", "Finance"]
})

salaries_df = pd.DataFrame({
    "EmployeeID": [1, 2, 3, 4, 5],
    "Salary": [5000, 6000, 4500, 7000, 5500]
})

# Create the SemanticAgent instance
agent = SemanticAgent([employees_df, salaries_df])

# Query the SemanticAgent
query = agent.chat("What is the average salary in the Marketing department?")

In this example, we first load the employee and salary data into separate dataframes. We then create an instance of the SemanticAgent, passing in these dataframes.

Next, we can query the SemanticAgent by using the chat method and providing a natural language question: "What is the average salary in the Marketing department?"

The SemanticAgent will process this query, generate the corresponding JSON query, and convert it into executable code to retrieve the desired result.

But the SemanticAgent's capabilities go far beyond simple queries. You can also perform more complex analyses, such as filtering data based on specific conditions, grouping data by multiple dimensions, and ordering results in ascending or descending order.

For example, let's say you want to find the top three departments with the highest average salaries, ordered from highest to lowest. Here's how you can achieve this with the SemanticAgent:

query = {
    "type": "table",
    "dimensions": ["Department"],
    "measures": ["Salaries.avg_salary"],
    "timeDimensions": [],
    "filters": [],
    "order": [
        {
            "measure": "Salaries.avg_salary",
            "direction": "desc"
        }
    ],
    "limit": 3
}

result = agent.query(query)
print(result)

Unleashing the Full Potential with Advanced Queries

While the simple example demonstrated the ease of use and power of the SemanticAgent, it is capable of handling much more complex queries and analyses. By leveraging the full structure of the JSON query, you can unlock a wide range of advanced capabilities, enabling you to extract even deeper insights from your data.

Let's explore a more advanced scenario where we want to analyze ticket data from a customer support system. Suppose we have the following dataframes:

tickets_df = pd.DataFrame({
    "TicketID": [1, 2, 3, 4, 5],
    "Category": ["Billing", "Technical", "Billing", "Technical", "General"],
    "Priority": ["High", "Low", "Medium", "High", "Low"],
    "CreatedDate": ["2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05"]
})

agents_df = pd.DataFrame({
    "AgentID": [1, 2, 3, 4, 5],
    "Name": ["John", "Emily", "Michael", "Sarah", "David"],
    "Team": ["Support", "Sales", "Support", "Sales", "Support"]
})

ticket_assignments_df = pd.DataFrame({
    "TicketID": [1, 2, 3, 4, 5],
    "AgentID": [1, 2, 3, 4, 5],
    "AssignedDate": ["2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05", "2023-05-06"]
})

In this scenario, we want to analyze the distribution of tickets across different categories and priorities, as well as the average time it takes to assign a ticket to an agent, grouped by the agent's team.

To achieve this, we can define a custom schema and pass it to the SemanticAgent:

schema = [
    {
        "name": "Tickets",
        "table": "Tickets",
        "measures": [
            {
                "name": "count",
                "type": "count",
                "sql": "TicketID"
            }
        ],
        "dimensions": [
            {
                "name": "Category",
                "type": "string",
                "sql": "Category"
            },
            {
                "name": "Priority",
                "type": "string",
                "sql": "Priority"
            },
            {
                "name": "CreatedDate",
                "type": "datetime",
                "sql": "CreatedDate"
            }
        ],
        "joins": [
            {
                "name": "TicketAssignments",
                "join_type": "left",
                "sql": "Tickets.TicketID = TicketAssignments.TicketID"
            }
        ]
    },
    {
        "name": "Agents",
        "table": "Agents",
        "dimensions": [
            {
                "name": "Team",
                "type": "string",
                "sql": "Team"
            }
        ],
        "joins": [
            {
                "name": "TicketAssignments",
                "join_type": "left",
                "sql": "Agents.AgentID = TicketAssignments.AgentID"
            }
        ]
    },
    {
        "name": "TicketAssignments",
        "table": "TicketAssignments",
        "measures": [
            {
                "name": "avg_assignment_time",
                "type": "avg_date_diff",
                "sql": "AssignedDate - CreatedDate"
            }
        ],
        "dimensions": [
            {
                "name": "TicketID",
                "type": "string",
                "sql": "TicketID"
            },
            {
                "name": "AgentID",
                "type": "string",
                "sql": "AgentID"
            },
            {
                "name": "AssignedDate",
                "type": "datetime",
                "sql": "AssignedDate"
            }
        ]
    }
]

agent = SemanticAgent([tickets_df, agents_df, ticket_assignments_df], schema=schema)

With the custom schema defined, we can now construct a JSON query to perform the desired analysis:

query = {
    "type": "table",
    "dimensions": ["Tickets.Category", "Tickets.Priority", "Agents.Team"],
    "measures": ["Tickets.count", "TicketAssignments.avg_assignment_time"],
    "timeDimensions": [],
    "filters": [],
    "order": [
        {
            "id": "Tickets.count",
            "direction": "desc"
        }
    ]
}

result = agent.query(query)
print(result)

This query will produce a tabular result showing the ticket count and average assignment time, grouped by category, priority, and the agent's team. The results will be ordered by the ticket count in descending order.

By combining the power of the SemanticAgent with custom schemas and advanced JSON queries, you can perform complex analyses tailored to your specific needs, unlocking a world of possibilities for extracting valuable insights from your data.

Check out the official docs for more: https://docs.pandas-ai.com/semantic-agent