Data Ingestion
Upload and process documents from files or URLs into knowledge graphs. The ingestion pipeline routes through MainOrchestrator.ingest_data() which coordinates the UnifiedIngestionPipeline to extract entities, relationships, and embeddings automatically.
/api/v1/ingest
Ingest data from files (PDF, CSV, TXT) or URLs into a specified knowledge graph. Routes through MainOrchestrator.ingest_data() → UnifiedIngestionService → UnifiedIngestionPipeline. The system automatically extracts entities, builds relationships, and generates vector embeddings for semantic search.
Agent Interaction Flow
Form Data Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
kg_name | string | Yes | Target knowledge graph name |
file | file | No* | File to upload (PDF, CSV, TXT) |
doc_url | string | No* | URL to fetch and ingest content from |
source_name | string | No | Human-readable source name (auto-generated if not provided) |
custom_fields_json | JSON string | No | Additional metadata as JSON string (e.g., {"domain": "health", "source_type": "official"}) |
* Either file or doc_url must be provided
Response Fields
| Field | Type | Description |
|---|---|---|
job_id | string | Unique identifier for tracking the ingestion job |
message | string | Human-readable status message |
kg_name | string | Knowledge graph that received the data |
source_name | string | Name of the ingested source |
source_id | string | Unique source identifier for future reference |
entities_count | integer | Number of entities extracted |
relations_count | integer | Number of relationships created |
chunks_count | integer | Number of text chunks created |
Supported File Types
| Format | Extension | Max Size | Notes |
|---|---|---|---|
| 50 MB | Text extraction with layout preservation | ||
| CSV | .csv | 20 MB | Automatic schema detection |
| Text | .txt | 10 MB | Plain text processing |
| Markdown | .md | 10 MB | Preserves formatting and structure |
Processing Pipeline
After ingestion, the document goes through several processing stages:
- Document Parsing: Extract text and structure from the file
- Chunking: Split content into semantic chunks
- Entity Extraction: Identify named entities (NER)
- Relation Extraction: Detect relationships between entities
- Embedding Generation: Create vector embeddings for semantic search
- Graph Storage: Store in Neo4j knowledge graph
- Vector Storage: Store embeddings in Qdrant
- Visualization: Generate interactive dashboards
Checking Ingestion Status
Use the source management endpoints to check ingestion progress:
Error Responses
Missing Required Parameters
File Too Large
Unsupported File Type
Invalid URL
Best Practices
- Meaningful names: Use descriptive source names for easier management
- Metadata: Add rich metadata to improve search and filtering
- File size: Split large documents into smaller files for faster processing
- Format choice: Use PDF for documents, CSV for structured data
- URL validation: Ensure URLs are publicly accessible before ingestion
- Batch processing: For multiple files, use asynchronous ingestion
- Monitor jobs: Track job_id to check completion status
Performance Optimization
| File Size | Typical Processing Time | Tips |
|---|---|---|
| Small (<1 MB) | 10-30 seconds | Ideal for real-time ingestion |
| Medium (1-10 MB) | 30-120 seconds | Use job tracking for status updates |
| Large (10-50 MB) | 2-10 minutes | Consider async processing workflows |
Custom Metadata Schema
You can attach any JSON-serializable metadata to your sources:
/api/v1/ingest/unified
Advanced unified ingestion endpoint with JSON-based requests and comprehensive features. Supports multiple processing modes, quality checks, deduplication, and asynchronous processing.
Processing Modes
| Mode | Description | Best For |
|---|---|---|
auto | Automatic mode selection based on content type | General use when unsure of content characteristics |
fast | Optimized for speed with basic processing | Large volumes, time-sensitive ingestion |
comprehensive | Full processing with all features enabled | High-quality content requiring deep analysis |
domain_specific | Specialized processing for specific domains | Domain-specific content with known characteristics |
Key Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
content | string | No* | Direct text content to ingest |
document_url | string | No* | URL of document to fetch and ingest |
kg_name | string | Yes | Target knowledge graph name |
mode | string | No | Processing mode: auto, fast, comprehensive, domain_specific (default: auto) |
chunking_strategy | string | No | Chunking strategy to use (default: semantic) |
chunk_size | integer | No | Target chunk size (100-5000, default: 1000) |
enable_entity_extraction | boolean | No | Enable LLM entity extraction (default: true) |
enable_relation_extraction | boolean | No | Enable LLM relation extraction (default: true) |
enable_quality_checks | boolean | No | Enable content quality validation (default: true) |
enable_deduplication | boolean | No | Enable duplicate detection (default: true) |
auto_domain_detection | boolean | No | Auto-detect content domains (default: true) |
async_processing | boolean | No | Process asynchronously (default: false) |
priority | string | No | Processing priority: low, normal, high, urgent (default: normal) |
* Either content or document_url must be provided
Response Fields
| Field | Type | Description |
|---|---|---|
success | boolean | Whether ingestion was successful |
source_info | object | Information about processed source |
content_analysis | object | Content analysis results |
detected_domains | array | Auto-detected content domains |
chunks_created | integer | Number of chunks created |
entities_extracted | integer | Number of entities extracted |
relations_extracted | integer | Number of relations extracted |
quality_score | number | Content quality score (0-1) |
duplicates_removed | integer | Number of duplicates removed |
storage_info | object | Storage locations and status |
processing_time_ms | number | Total processing time |
JSON vs Form-Data
/api/v1/ingest for simple file uploads via form-data, or/api/v1/ingest/unified for advanced control and detailed analytics.Comparison: Basic vs Unified Ingestion
| Feature | /api/v1/ingest | /api/v1/ingest/unified |
|---|---|---|
| Request Format | Form-data (multipart) | JSON |
| File Upload | Direct file upload | URL or text content |
| Processing Modes | Standard | Auto, Fast, Comprehensive, Domain-specific |
| Quality Checks | Basic | Advanced with scoring |
| Deduplication | No | Yes, configurable |
| Domain Detection | Manual | Automatic |
| Async Processing | Always async | Configurable |
| Response Detail | Basic stats | Comprehensive analytics |
| Priority Control | No | Yes |