Skip to main content

Understanding the Data Map

Learn how Perplexity automatically generates a Data Map for your Snowflake or Databricks data to improve query accuracy

Written by Emilio Morales

When you connect Snowflake or Databricks to Perplexity, Computer generates a Data Map of your data warehouse or lakehouse. The Data Map captures key information about your data model — important tables and columns, common query patterns, and relationships between objects — so that Computer can translate natural-language questions into accurate queries.

Think of it as a map of your data environment that helps Computer understand what lives where and how it's typically used. Once generated, the Data Map continues to improve over time: it learns from user feedback, can be edited directly by admins, and is versioned so changes can always be reviewed and rolled back.

How it works

Once Data Map generation is initiated, Computer explores your data model using your connected account's permissions. This process examines your schemas, tables, views, and historical usage patterns to build a comprehensive understanding of your data.

The Data Map is stored securely in a per-organization, versioned repository accessible only to members of your organization. Every change — whether from regeneration, admin edits, or self-learning updates — is recorded so it can be reviewed and rolled back.

Generating the Data Map

There is one Data Map per organization, shared by every member of that organization. When an admin runs generation, it's on behalf of the whole organization, not a single user.

Snowflake

Data Map generation for Snowflake is initiated by an org admin from the Snowflake connector settings. Perplexity ships two Snowflake connectors and admins generate the Data Map from whichever one their organization has configured:

  • Snowflake (key-pair or PAT) — queries run as the configured service account.

  • Snowflake (User OAuth) — queries run under the Snowflake identity of the admin who initiates generation.

Whichever identity is used must be able to read Snowflake's account-usage views (see Requirements below). If those grants are missing, generation fails up front with a clear permissions error rather than producing a partial Data Map.

Databricks

For Databricks, generation is initiated from the connector settings using the initiator's Databricks OAuth identity. Computer enumerates the catalogs, schemas, and tables that user can see in Unity Catalog and reads Databricks system tables for usage signal. Whatever the initiator can see in Unity Catalog defines what ends up in the Data Map.

Supplementary context (Snowflake and Databricks)

You can add Supplementary context — upload files or add notes describing your data (e.g., what key tables represent, business definitions, common query patterns) — to help Computer interpret your data more accurately. Supplementary context is fed into every generation run and is not affected by regeneration, so you can keep adding to it over time without worrying about losing it.

View Knowledge

After generation completes, the Generate data map button on the connector modal becomes View knowledge. Clicking View knowledge opens the Data Map Editor, where admins can browse, edit, and manage everything Computer has learned about your data. Regenerate data map is also available from the connector modal once an initial Data Map exists — see Regenerating the Data Map for what it does and what's preserved.

How Long Does It Take?

Generating a Data Map can take up to 90 minutes, depending on the size and complexity of your data warehouse or lakehouse. You don't need to keep the page open — the process runs in the background and the View knowledge button will appear on the connector modal once it completes.

The Data Map Editor

The Data Map Editor is the admin-facing view of your Data Map, accessible from the organization admin tools. It separates Snowflake and Databricks knowledge into distinct sections, and the underlying business context, table clusters, and query patterns are organized as files you can read and edit directly.

From the editor, admins can:

  • Browse the full Data Map — business context, table clusters, common query patterns.

  • Edit files directly. Saved edits are applied to the live Data Map immediately and become the new ground truth; they do not need to go through the review pipeline.

  • Review and act on AI-proposed changes from the self-learning pipeline. Admins can approve the proposal (changes are applied to the Data Map) or reject it (the proposal is discarded). Modifying a proposal in place before approving isn't supported today — admins who want a different outcome can reject and then make the edit themselves.

  • View version history of any file and roll back if needed.

Direct edits in the editor are durable for normal use, but they live alongside the auto-generated content and are not preserved if you regenerate the Data Map for that warehouse — see Regenerating the Data Map below.

Self-Learning From Feedback

The Data Map gets better the more your team uses it. When a user corrects the data agent in a thread — for example, "use fct_queries instead of query_events for query counts" or "exclude type = 'internal' from query volume metrics" — Computer captures that feedback and uses it to improve the Data Map for everyone.

The pipeline is designed to be safe, reviewable, and shared across the organization:

1. Capturing feedback

When a user gives feedback in a Data Scientist thread, Computer logs a structured correction — the file it should affect, the section, the proposed change, and the thread context that produced it. The Data Map itself is never edited live from a user thread; feedback only ever lands in this log first.

2. Daily compaction into a proposed update

Once a day, Computer reviews the corrections logged in the previous 24 hours for each organization and produces a single consolidated proposed update to the Data Map:

  • Multiple corrections to the same area are merged into one edit.

  • Conflicting corrections (e.g., one says "always include cron jobs," another says "always exclude cron jobs") are set aside for human review rather than auto-resolved.

  • Each correction is routed to the right warehouse — a Snowflake-specific correction won't bleed into the Databricks Data Map and vice versa.

  • Corrections that can't be confidently merged or routed are flagged for an admin to look at instead of being applied silently.

The result is surfaced in the Data Map Editor as a single proposal admins can review.

3. Admin review

Admins approve the proposal (the changes are applied to the Data Map) or reject it (the proposal is discarded). Approval is what "deploys" the changes — the next data question your team asks will use the updated Data Map. There's no separate publish step.

This human-in-the-loop pattern is intentional: it lets Computer learn continuously from real usage while keeping admins in control of what's trusted as ground truth.

Who can do what?

  • Any user in a Data Scientist thread can give feedback that feeds into the next day's proposed update.

  • Org admins can browse and edit Data Map files directly, and approve or reject the daily proposed updates in the Data Map Editor.

  • There is one Data Map per organization — every member of the organization queries against the same shared Data Map. There is no per-user Data Map.

Regenerating the Data Map

Admins can run Regenerate Data Map at any time from the connector modal. Today, regeneration is a rebuild from scratch for the warehouse you regenerate:

  • The Data Map for that warehouse is fully replaced with a fresh result. Manual admin edits to that warehouse's Data Map are not preserved.

  • The other warehouse's Data Map is left untouched — regenerating Snowflake does not affect Databricks, and vice versa.

  • Supplementary context is preserved and re-applied to the new run.

  • Pending feedback (corrections logged that day but not yet rolled up into a proposed update) is preserved. Feedback that's already been approved and applied to the Data Map is part of what gets replaced.

  • The full version history is preserved, so prior versions of the Data Map remain inspectable.

Because regeneration replaces a warehouse's Data Map and any admin edits to it, it should be run intentionally. A safer regeneration that preserves admin edits — and a separate explicit "full reset" — are on the roadmap but not in the product today.

Requirements

Snowflake

The identity used to generate the Data Map — the service account for key-pair / PAT auth, or the initiating admin's Snowflake user for OAuth auth — must be able to read both of these views in the SNOWFLAKE.ACCOUNT_USAGE schema:

  • SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY

  • SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY (Snowflake Enterprise Edition or higher)

Important: ACCOUNT_USAGE is admin-only by default. The most common cause of generation failing is a role that can read your regular databases but doesn't have access to ACCOUNT_USAGE. The fix is to grant IMPORTED PRIVILEGES on the SNOWFLAKE database.

If you haven't already granted this access during initial setup, run the following as ACCOUNTADMIN:

GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE <your_role>;

Then verify from the connecting role:

SELECT 1 FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY LIMIT 1; SELECT 1 FROM SNOWFLAKE.ACCOUNT_USAGE.ACCESS_HISTORY LIMIT 1;

If ACCESS_HISTORY isn't available (Snowflake Standard Edition), generation falls back to QUERY_HISTORY alone. The Data Map will still work but with somewhat less precise signal on column lineage and table-access counts.

For full setup instructions, see Connecting Perplexity with Snowflake.

Databricks

Databricks generation uses the initiator's OAuth identity and inherits their Unity Catalog permissions — no extra grants are required. A few practical things to know:

  • Unity Catalog is required. Computer reads Databricks system tables for query history, which depend on Unity Catalog. Workspaces operating only on hive_metastore will fail the access check during generation.

  • A SQL warehouse must be running when generation is initiated. If no warehouse is running, start one in Databricks first.

  • Whatever the initiator can see in Unity Catalog defines what ends up in the Data Map. If a catalog or schema is hidden from that user, Computer can't include it.

For full setup instructions, see Connecting Perplexity with Databricks.