LlamaIndex & Databricks Integration Announcement
Mar 20, 2024 5 Minute Read
LlamaIndex & Databricks Integration Announcement
By Nickhil Nabar

Today we’re happy to introduce an integration between LlamaIndex and Databricks Vector Search, authored by BAM Elevate. Our work allows LLM application developers to leverage the Databricks storage layer for semantically indexing unstructured data while using the full suite of advanced retrieval functionality offered by the LlamaIndex framework. In this post, we’ll dive into how the integration works and why our team decided to build it.

Why We Built It

Most VCs don’t spend their time writing open-source software – so let’s start with why we decided to invest our time here.

  • If we “believe” in open-source, we must contribute to open-source. Period.

  • We can’t predict the future unless we deeply understand the present. The more time we spend building AI applications, the sharper our understanding of the capabilities and limitations of today’s technology. Candidly, a selfless opportunity to contribute to the community is also a selfish opportunity to learn.

  • Generative AI solutions add value directly to our business today. An increasingly loud corner of Silicon Valley has asserted that Generative AI isn’t ready to add value in the enterprise. We vehemently disagree. Improvements in reasoning capabilities, inference speed, and unstructured data infrastructure will certainly pour fuel on the fire – but we already have what we need to build applications that can introduce significant efficiencies into business processes without introducing risk. We’re using our LlamaIndex and Databricks integration to roll out push-based information retrieval workflows (with citations!) into our idea generation and investment diligence processes.

  • The best way to support our portfolio companies is by rolling up our sleeves. We are both investors in and heavy users of Databricks, and we believe that deepening the platform’s interoperability with high-impact frameworks like LlamaIndex will support the adoption of Databricks’ growing generative AI feature set.

What is Databricks Vector Search?

Databricks Vector Search (DVS) is a Databricks-native vector database solution. Vector databases provide the storage and infrastructure layer that underpins many LLM-based applications: particularly retrieval-augmented generation (RAG) applications. RAG apps rely on indexing corpuses of unstructured data into embeddings vectors and using nearest neighbors algorithms to search across these indexes to find semantically related information to a given query.

Where Databricks’ vector database offering shines is in its interoperability with the rest of the Databricks platform’s governance and orchestration tooling. Organizations that are already using Databricks to manage their data access controls might choose DVS to avoid the need for an independent access management layer on top of their vector storage solution. AI teams that are already producing or consuming embeddings vectors via Databricks Workflows might choose DVS for ease of implementation, deployment, and workflow management.

What is LlamaIndex?

LlamaIndex has emerged as the de facto framework for building RAG applications. It provides developers with abstractions for parsing unstructured data, storing embeddings vectors, querying indexes, and evaluating application performance, with an enterprise offering for the parsing and evaluation features.

Where LlamaIndex shines is in its implementation of advanced retrieval methods like re-ranking, summary-based retrieval, and dynamic metadata filtering. These techniques elevate “naive” RAG applications from proof-of-concept prototypes to fully-fledged enterprise products with consistent and reliable performance.

The Integration

We built the llama_index.vector_stores.databricks module, of which the core contribution is the DatabricksVectorStore class (a subclass of llama_index.core.vector_stores.types.BasePydanticVectorStore). This allows developers to simply add an initialized DatabricksVectorStore into their LlamaIndex StorageContext and automatically leverage Databricks Vector Search as the storage layer for their apps. Under the hood, our module handles uploading and deleting vector embeddings, managing the node metadata, and executing semantic search across the vector database (including hybrid search with metadata filtering). Detailed usage instructions can be found in our example notebook.

The module provides support both for Direct Vector Access Indexes (where the developer self-manages the embeddings that are stored by Databricks) and for Delta Sync Indexes (where the embeddings are automatically managed by Databricks). Most LlamaIndex developers will want to use Direct Vector Access Indexes since they are likely using LlamaIndex to customize the parsing, chunking, or modeling logic used to produce embeddings from their unstructured text data.

The below showcases how you can integrate LlamaIndex and Databricks Vector Search with a few simple lines of code. See our example notebook for detailed usage instructions.

%pip install llama-index llama-index-vector-stores-databricks
%pip install databricks-vectorsearch # requires databricks runtime

from databricks.vector_search.client import ( # databricks client dependencies
    VectorSearchIndex,
    VectorSearchClient,
)
from llama_index.core import ( # llamaindex core and integration dependencies
    VectorStoreIndex,
    StorageContext,
)
from llama_index.vector_stores.databricks import DatabricksVectorSearch

databricks_index = VectorSearchClient().create_endpoint(...) # initialize your databricks client and endpoint

# configure it for llamaindex
databricks_vector_store = DatabricksVectorSearch(
    index=databricks_index,
    ...
)
index = VectorStoreIndex.from_documents(
    <your llamaindex documents here>, # e.g. SimpleDirectoryReader(...).load_documents()
    storage_context=StorageContext.from_defaults(vector_store=databricks_vector_store)
)

# use it!
index.as_query_engine().query(...)

Our Team and Our Mission

BAM Elevate is the growth-stage investment arm of Balyasny Asset Management.

Our data science team exists to deliver a competitive advantage to our portfolio companies. We build data and analytics products to support the Go-to-Market, Finance, Talent, and Product functions for our partners. Visit our website for more information on how we can help.

The LlamaIndex/Databricks integration was authored by Nickhil Nabar (Head of Data Science, BAM Elevate) and Alberto Da Costa (Engineering Lead, BAM Elevate) in collaboration with Jerry Liu and Logan Markewich from LlamaIndex.

Office Locations
New York
San Francisco
BAM Elevate
Portfolio
Our Team
How We Help
News
Jobs
Contact
Legal Documents
BAM Elevate User Agreement
BAM Elevate Privacy Policy
Social
https://www.linkedin.com/company/bam-elevatehttps://mobile.twitter.com/bamelevatehttps://instagram.com/bamfunds?igshid=YmMyMTA2M2Y=
© 2022 Balyasny Asset Management L.P. All Rights Reserved.