Data Ingestion

Upload and process documents from files or URLs into knowledge graphs. The ingestion pipeline routes through MainOrchestrator.ingest_data() which coordinates the UnifiedIngestionPipeline to extract entities, relationships, and embeddings automatically.

POST

/api/v1/ingest

Ingest data from files (PDF, CSV, TXT) or URLs into a specified knowledge graph. Routes through MainOrchestrator.ingest_data() → UnifiedIngestionService → UnifiedIngestionPipeline. The system automatically extracts entities, builds relationships, and generates vector embeddings for semantic search.

Agent Interaction Flow

Ingestion Processing Flow

API Request → MainOrchestrator.ingest_data()
    ↓
UnifiedIngestionService.ingest_file() / ingest_url()
    ↓
UnifiedIngestionPipeline.process()
    ↓
┌─────────────────────────────────────┐
│ Document Processing                 │
│ • URL fetching (if URL)             │
│ • PDF extraction (if PDF)           │
│ • CSV parsing (if CSV)               │
│ • Text extraction                    │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Chunking                            │
│ • HierarchicalSemanticChunker        │
│ • Semantic chunking with overlap     │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Entity & Relation Extraction       │
│ • LLMEntityExtractor                │
│ • NER + Relation extraction         │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Knowledge Graph Building            │
│ • KGEntity creation                 │
│ • KGRelation creation               │
│ • KGChunk creation                  │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Storage & Embedding                 │
│ • SQLHandler → SQLite storage       │
│ • MultiKGTextEmbeddingProcessor    │
│   → Qdrant vector storage          │
│ • KGStorageService → Export files   │
│ • EnhancedKGVisualizer → Visuals    │
└─────────────────────────────────────┘

Form Data Parameters

Parameter	Type	Required	Description
`kg_name`	string	Yes	Target knowledge graph name
`file`	file	No*	File to upload (PDF, CSV, TXT)
`doc_url`	string	No*	URL to fetch and ingest content from
`source_name`	string	No	Human-readable source name (auto-generated if not provided)
`custom_fields_json`	JSON string	No	Additional metadata as JSON string (e.g., `{"domain": "health", "source_type": "official"}`)

* Either file or doc_url must be provided

Response Fields

Field	Type	Description
`job_id`	string	Unique identifier for tracking the ingestion job
`message`	string	Human-readable status message
`kg_name`	string	Knowledge graph that received the data
`source_name`	string	Name of the ingested source
`source_id`	string	Unique source identifier for future reference
`entities_count`	integer	Number of entities extracted
`relations_count`	integer	Number of relationships created
`chunks_count`	integer	Number of text chunks created

Supported File Types

Format	Extension	Max Size	Notes
PDF	.pdf	50 MB	Text extraction with layout preservation
CSV	.csv	20 MB	Automatic schema detection
Text	.txt	10 MB	Plain text processing
Markdown	.md	10 MB	Preserves formatting and structure

Processing Pipeline

After ingestion, the document goes through several processing stages:

Document Parsing: Extract text and structure from the file
Chunking: Split content into semantic chunks
Entity Extraction: Identify named entities (NER)
Relation Extraction: Detect relationships between entities
Embedding Generation: Create vector embeddings for semantic search
Graph Storage: Store in Neo4j knowledge graph
Vector Storage: Store embeddings in Qdrant
Visualization: Generate interactive dashboards

Checking Ingestion Status

Use the source management endpoints to check ingestion progress:

# Check source status
curl -X GET https://your-api.com/api/v1/sources/src_789/stats \
  -H "X-API-Key: your-api-key"

# Response shows processing status
{
  "source_id": "src_789",
  "status": "completed",
  "chunks_count": 89,
  "entities_count": 245,
  "relations_count": 512
}

Error Responses

Missing Required Parameters

{
  "detail": "Either 'file' or 'doc_url' must be provided",
  "status_code": 400
}

File Too Large

{
  "detail": "File size exceeds maximum limit of 50MB",
  "status_code": 413
}

Unsupported File Type

{
  "detail": "File type 'docx' is not supported. Supported formats: pdf, csv, txt, md",
  "status_code": 400
}

Invalid URL

{
  "detail": "Could not fetch content from URL. Please check the URL is accessible.",
  "status_code": 400
}

Best Practices

Meaningful names: Use descriptive source names for easier management
Metadata: Add rich metadata to improve search and filtering
File size: Split large documents into smaller files for faster processing
Format choice: Use PDF for documents, CSV for structured data
URL validation: Ensure URLs are publicly accessible before ingestion
Batch processing: For multiple files, use asynchronous ingestion
Monitor jobs: Track job_id to check completion status

Performance Optimization

File Size	Typical Processing Time	Tips
Small (<1 MB)	10-30 seconds	Ideal for real-time ingestion
Medium (1-10 MB)	30-120 seconds	Use job tracking for status updates
Large (10-50 MB)	2-10 minutes	Consider async processing workflows

Custom Metadata Schema

You can attach any JSON-serializable metadata to your sources:

Example Metadata Schemas

// Research Paper
{
  "title": "Deep Learning for NLP",
  "authors": ["Author One", "Author Two"],
  "year": 2024,
  "venue": "ACL 2024",
  "citations": 150,
  "keywords": ["NLP", "transformers"]
}

// News Article
{
  "published_date": "2024-01-15",
  "author": "Jane Journalist",
  "category": "technology",
  "tags": ["AI", "innovation"],
  "read_time_minutes": 5
}

// Legal Document
{
  "case_number": "2024-CV-1234",
  "jurisdiction": "Federal",
  "date_filed": "2024-01-15",
  "parties": ["Plaintiff Corp", "Defendant Inc"],
  "status": "active"
}

POST

/api/v1/ingest/unified

Advanced unified ingestion endpoint with JSON-based requests and comprehensive features. Supports multiple processing modes, quality checks, deduplication, and asynchronous processing.

Processing Modes

Mode	Description	Best For
`auto`	Automatic mode selection based on content type	General use when unsure of content characteristics
`fast`	Optimized for speed with basic processing	Large volumes, time-sensitive ingestion
`comprehensive`	Full processing with all features enabled	High-quality content requiring deep analysis
`domain_specific`	Specialized processing for specific domains	Domain-specific content with known characteristics

Key Request Parameters

Parameter	Type	Required	Description
`content`	string	No*	Direct text content to ingest
`document_url`	string	No*	URL of document to fetch and ingest
`kg_name`	string	Yes	Target knowledge graph name
`mode`	string	No	Processing mode: auto, fast, comprehensive, domain_specific (default: auto)
`chunking_strategy`	string	No	Chunking strategy to use (default: semantic)
`chunk_size`	integer	No	Target chunk size (100-5000, default: 1000)
`enable_entity_extraction`	boolean	No	Enable LLM entity extraction (default: true)
`enable_relation_extraction`	boolean	No	Enable LLM relation extraction (default: true)
`enable_quality_checks`	boolean	No	Enable content quality validation (default: true)
`enable_deduplication`	boolean	No	Enable duplicate detection (default: true)
`auto_domain_detection`	boolean	No	Auto-detect content domains (default: true)
`async_processing`	boolean	No	Process asynchronously (default: false)
`priority`	string	No	Processing priority: low, normal, high, urgent (default: normal)

* Either content or document_url must be provided

Response Fields

Field	Type	Description
`success`	boolean	Whether ingestion was successful
`source_info`	object	Information about processed source
`content_analysis`	object	Content analysis results
`detected_domains`	array	Auto-detected content domains
`chunks_created`	integer	Number of chunks created
`entities_extracted`	integer	Number of entities extracted
`relations_extracted`	integer	Number of relations extracted
`quality_score`	number	Content quality score (0-1)
`duplicates_removed`	integer	Number of duplicates removed
`storage_info`	object	Storage locations and status
`processing_time_ms`	number	Total processing time

JSON vs Form-Data

The unified ingestion endpoint uses JSON requests for programmatic access and advanced features. Use /api/v1/ingest for simple file uploads via form-data, or/api/v1/ingest/unified for advanced control and detailed analytics.

Comparison: Basic vs Unified Ingestion

Feature	/api/v1/ingest	/api/v1/ingest/unified
Request Format	Form-data (multipart)	JSON
File Upload	Direct file upload	URL or text content
Processing Modes	Standard	Auto, Fast, Comprehensive, Domain-specific
Quality Checks	Basic	Advanced with scoring
Deduplication	No	Yes, configurable
Domain Detection	Manual	Automatic
Async Processing	Always async	Configurable
Response Detail	Basic stats	Comprehensive analytics
Priority Control	No	Yes

File Upload Example

# Upload a file using unified ingestion
result = client.ingestion.ingest_unified(
    kg_name="KG_Universal",
    file_path="/path/to/document.pdf",
    mode="comprehensive",
    source_name="Research Paper",
    enable_entity_extraction=True,
    enable_relation_extraction=True
)

print(f"Success: {result.success}")
print(f"Entities extracted: {result.entities_extracted}

Async Processing Example

# For large documents, use async processing
result = client.ingestion.ingest_unified(
    kg_name="KG_Universal",
    document_url="https://example.com/large-document.pdf",
    mode="comprehensive",
    async_processing=True,
    priority="high",
    webhook_url="https://your-app.com/ingestion-callback"
)

# Returns immediately with job_id
print(f"Job ID: {result.job_id}")
print(f"Status: {result.message}