Text-2-SQL

Context/Background:

Companies in e-commerce have large amounts of data stored in various databases.
Many potential users of this data (data analysts, business stakeholders) may not have expertise in SQL.
Directly writing complex SQL queries can be time-consuming and error-prone.

Problems/Challenges:

How to enable non-technical users to easily access and analyze data?
How to bridge the gap between natural language questions and structured SQL queries?
How to ensure the accuracy and efficiency of generated SQL queries?

Forces/Considerations/Trade-offs:

Accuracy vs. Speed: Large Language Models (LLMs) can generate SQL, but their accuracy can vary. It’s crucial to balance generating correct queries with a fast user experience.
Data Freshness: SQL query logs used for training might become outdated. It’s essential to consider how to keep the system up-to-date with changes in data schemas and usage patterns.
Complexity of Questions: Natural language can be ambiguous. The system needs to handle a wide range of question complexity, from simple aggregations to complex joins and filters.

Solution:

Pinterest’s solution is a Text-to-SQL system that leverages LLMs and vector embeddings.

Solution (Detailed, Intense Steps):

SQL Query Logs: Historical SQL queries are collected.
Summarization: These logs, along with table metadata, are summarized using an LLM.
Embedding Model: An embedding model (e.g., SentenceBERT) converts these summaries into numerical vectors.
Vector Store: Embeddings are stored in a vector database (e.g., FAISS), enabling efficient similarity search.

User Input: The user enters a natural language question and optionally specifies relevant tables.
Embedding: The user’s question is converted into a vector using the same embedding model.
Similarity Search: The vector store is queried to find the most similar historical queries or table summaries.
Table Selection: The system selects the top-N most relevant tables based on the search results.
Text2SQL Prompt: The LLM uses the selected tables, the question, and optionally the similar historical queries to generate a SQL prompt.
SQL Generation: The LLM generates the final SQL query.

Resulting Consequences:

Improved Accessibility: Non-technical users can now easily extract insights from data.
Increased Productivity: Data analysts can quickly generate SQL queries, reducing manual effort.
Potential for Errors: The generated SQL might not always be perfect and may require manual verification or adjustment.

Related Patterns:

Retrieval Augmented Generation (RAG): The core pattern used here, combining information retrieval (similarity search) with LLM generation.
Semantic Search: The use of embeddings for finding relevant information in a large corpus.
Prompt Engineering: Carefully crafting prompts to guide the LLM to generate accurate SQL queries.
Active Learning: Potentially, the system could learn from user feedback on generated SQL to improve over time.
Federated Analytics: The system could be extended to query across multiple data sources.