Hugging Face Hub is a popular platform for sharing machine learning datasets, models, and other resources.
LanceDB can directly scan Lance datasets hosted on the Hugging Face Hub with hf:// URIs.
This is enabled under the hood by the lance-huggingface
integration that allows users to stream Lance datasets directly from Hugging Face without needing to
download them first.
For ML and AI engineers working in LanceDB, this capability is incredibly useful for quickly exploring
multimodal datasets and reusing Lance datasets shared by others, without writing custom data loaders
or preprocessing pipelines.
The snippets below use the lance-format/laion-1m
dataset published in Lance format. The dataset includes a million image-caption pairs, and the
Lance dataset can package image embeddings alongside the metadata. This makes it useful for
demonstrating LanceDB’s multimodal search capabilities in combination with easy sharing via the
Hugging Face Hub.
The LAION table includes multimodal columns such as:
image (inline JPEG bytes)
caption (text)
img_emb (image embedding vector)
- metadata fields such as
url and similarity
Install dependencies
pip install lancedb pillow
Open the dataset with LanceDB
LanceDB can open the dataset directly from the Hub, without needing to download it first.
Note that in LanceDB, you need to specify the table name when opening a Lance dataset,
and the Hugging Face convention is to use train and test splits for datasets.
The LAION dataset is uploaded as a single split named train, so we specify the table name
that contains the *.lance files when opening the dataset.
import lancedb
db = lancedb.connect("hf://datasets/lance-format/laion-1m/data")
table = db.open_table("train")
print(f"Opened table: {table.name}")
print(f"Rows: {len(table)}")
Inspect schema and available indexes
This prints the schema of the LAION table. Note that there’s an image embedding column that’s
a fixed-size list of 768-dimensional floats, and a binary column containing the raw JPEG bytes of the image.
image_path: string
caption: string
NSFW: string
similarity: double
LICENSE: string
url: string
key: string
status: string
error_message: null
width: int64
height: int64
original_width: int64
original_height: int64
exif: string
md5: string
img_emb: fixed_size_list<item: float>[768]
child 0, item: float
image: binary
When inspecting Lance datasets from Hugging Face, it’s also a good idea to check whether the dataset author included
any pre-built indexes that you can use for search. You can check the available indexes with:
print(table.list_indices())
[
Index(IvfPq, columns=["img_emb"], name="img_emb_idx"),
Index(FTS, columns=["caption"], name="caption_idx")
]
In this case, we see that we have an IVF_PQ vector index on the img_emb column, and an FTS index on the caption
column, which means we can directly do vector search on the image embeddings and keyword search on the captions
without needing to build the indexes ourselves!
If you see an empty list, it may be because the dataset author did not include the index files when uploading
to Hugging Face. You can download the dataset locally, and build the indexes yourself. See the indexing guide
for instructions on building different types of indexes with LanceDB.
Projection scan
Run a simple scan by projecting relevant columns to get a feel for the dataset. For example, we
can run a search without any filters or input parameters to get a small subset of the data:
rows = (
table.search()
.select(["caption", "url", "similarity"])
.limit(3)
.to_list()
)
for i, row in enumerate(rows, start=1):
print(f"{i}. {row['caption']}")
print(f" url={row['url']}")
print(f" similarity={row['similarity']}")
We get the first three rows and their metadata printed out, which look like this:
1. Cordelia and Dudley on their wedding day last year
url=https://i.dailymail.co.uk/i/pix/2012/01/05/article-2082728-0EF8956600000578-53_233x315.jpg
similarity=0.2926466464996338
2. Statistics on challenges for automation in 2021
url=https://verloop.io/wp-content/uploads/2021/02/Challenges.jpg
similarity=0.30174341797828674
3. Teacher Gifts / Great gifts for your child's teacher. Don't know what to get? Take a look at these gifts that the teacher in your life will love!
url=https://i.pinimg.com/custom_covers/216x146/550494823141083777_1487893945.jpg
similarity=0.3362061381340027
Scan and filter data
Filtered search is a common pattern to narrow down interesting subsets of the data during early
exploration. Here’s an example:
filtered = (
table.search()
.where("height > 600")
.select(["caption", "url", "width", "height"])
.limit(3)
.to_list()
)
for row in filtered:
print(row["caption"], row["url"], row["width"], row["height"])
This prints out the metadata for large images with height greater than 600 pixels:
Luca Trousers, mustard stripe https://cdn.shopify.com/s/files/1/0151/5333/products/IMG_0791_1024x1024.jpg?v=1585142190 384 766
Baby Blue Fitted Short Sleeve T Shirt 3 https://cdn-img.prettylittlething.com/a/d/d/1/add198cab3ec30a61102437275573f4963642528_cmf6022_3.jpg 384 612
pattern cutting made easy pdf https://i.pinimg.com/736x/7c/6c/a7/7c6ca7361815a8929b3dd6ad34a03ab9.jpg 384 1045
Export image bytes to local files
To work with a subset of the data locally, you can export the image bytes from the table and save them as JPEG files.
from pathlib import Path
sample = (
table.search()
.select(["image", "caption"])
.limit(3)
.to_list()
)
out_dir = Path("samples")
out_dir.mkdir(exist_ok=True)
for i, row in enumerate(sample):
out_path = out_dir / f"laion_{i}.jpg"
with open(out_path, "wb") as f:
f.write(row["image"])
print(f"Saved {out_path} | caption={row['caption']}")
You can now preview the images you just exported on your local machine to get a better sense of the data.
Vector search
You can use LanceDB to run vector search directly on the data on the Hub, without needing to download the dataset
or build your own vector index. This makes it incredibly easy to explore the dataset and iterate on your search queries
before you decide to download a local copy for further experimentation on your end.
# Pick an arbitrary image embedding from the dataset
query_embedding = (
table.search()
.select(["img_emb"])
.limit(1)
.to_list()[0]["img_emb"]
)
results = (
table.search(query_embedding, vector_column_name="img_emb")
.select(["caption", "url", "_distance"])
.limit(3)
.to_list()
)
for row in results:
print(row["_distance"], row["caption"])
| distance | caption |
|---|
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
| 0.17765313386917114 | Cordelia and Dudley on their wedding day last year |
Note that the LAION dataset is known to contain a lot of duplicate images, so you may see the same image
showing up multiple times in the search results.
Full-text search
Run an FTS search query that uses BM25 ranking on the caption column (on which we already have an FTS index):
fts_results = (
table.search("dog running on beach", query_type="fts")
.select(["caption", "url", "_score"])
.limit(3)
.to_list()
)
| caption | url | _score |
|---|
| running with dog | https://www.doggytastic.com/wp… | 15.73168 |
| Dog Running in Water | https://static.wixstatic.com/m… | 14.756516 |
| Dogs on the run by heidiannemo… | http://ih2.redbubble.net/image… | 14.756516 |
Downloading the full dataset
You may hit Hugging Face rate limits when streaming large samples from hf://, despite using a Hugging Face token.
For repeated queries or queries that operate on the full dataset, it’s recommended to
download the dataset locally and query from disk.
Here’s how to download the entire dataset via the Hugging Face CLI:
huggingface-cli download lance-format/laion-1m --repo-type dataset --local-dir ./laion-1m
Explore more Lance datasets on Hugging Face
The LanceDB team is actively uploading useful and interesting datasets in Lance format to the Hugging Face Hub
under the lance-format organization. We actively encourage the Hugging Face
and LanceDB communities to upload their own Lance datasets to the Hub to share with others!
In the meantime, feel free to check out the Hugging Face Hub to discover more Lance datasets uploaded by the community.
Click here to explore the latest trending Lance datasets on 🤗 Hugging Face!