We Added AI Semantic Search to a Static Site — No Vector Database Required
Today, Superdata RobotAI launched AI-powered semantic search. Click the 🔍 button in the bottom-right corner of any page, type your query in natural language — like "I need humanoid dual-arm manipulation grasping dataset" — and the system matches the most relevant results from 43 datasets, 19 data standards, and 11 tools.
Why We Built This
Our navigator indexes 73 entities (datasets + standards + tools), each with rich tags: robot type, task type, data modality, license, institution, etc. Traditional search was keyword-based — search "humanoid robot" and you only get entries whose tags literally contain those words.
But that's not how people search. Someone might query "bipedal locomotion control data" — which requires understanding that "bipedal" ≈ "humanoid" + "locomotion" task. Or "large-scale real-world manipulation data" — a multi-tag combination. Pure keyword matching falls short.
Approach: Text Embeddings + Cosine Similarity
The core idea is simple: concatenate each entity's key information (name, description, type, task, modality, institution) into a text string, then generate a 2048-dimensional vector using Alibaba Cloud Bailian's text-embedding-v4 model (Qwen). User queries are vectorized the same way, and we compute cosine similarity.
The entire embedding index is less than 1.1MB — 73 entities × 2048 dims × 4 bytes ≈ 600KB, plus metadata = 1.1MB total. No Pinecone, no Chroma, no Milvus. Just a single JSON file.
Not Enough: Keyword Boost
Pure semantic embeddings have blind spots. A query like "simulation-based navigation training" might semantically drift toward manipulation datasets (because the overall corpus is manipulation-heavy), when what we really want is Habitat, Gibson, iGibson — simulation platforms.
The solution is hybrid retrieval: extract structured keywords from the query (humanoid → robot type field, grasping → task field), match them against entity tags, and boost matches. Embeddings handle semantic generalization, keywords handle precision. They complement each other.
Architecture: Zero-Cost Operations
Frontend: Pure static site (GitHub Pages), floating search widget, 200 lines of vanilla JS.
Backend: Alibaba Cloud Function Compute (FC), a single Web Function handling all search requests. The free tier covers 1 million invocations per month — our scale stays well within that.
Embedding API: Alibaba Cloud Bailian text-embedding-v4, 1 million free tokens for new users. Embedding all 73 entities consumed ~15K tokens.
Total monthly cost: $0.
Results
Real query examples:
- "humanoid dual-arm manipulation grasping" → GR-1 ActionNet 88%, Dexora 83%, Humanoid Everyday 82%
- "tactile sensor data" → DIGIT Dataset 85%, TacTip Datasets 78%, tools: TACTO 77%, PyTouch 72%
- "simulation navigation training" → Gibson 67%, iGibson 57%, Isaac Sim 56% (all hitting the Tools section)
Next: AI Assistant
Current search is retrieval-based — returning a ranked list for users to evaluate. The next step is integrating an LLM (Qwen or DeepSeek), so the system doesn't just "find relevant data" but provides structured recommendations: which dataset to use + which tools to pair + which standard to follow.
Try the search box on the site and let us know what you think via the comments section.
Tech Stack: Alibaba Cloud Bailian text-embedding-v4 (2048-dim) · Alibaba Cloud Function Compute · GitHub Pages · vanilla JS. All code open-sourced on GitHub.