GitHub

Data Ingestion

Upload and process documents from files or URLs into knowledge graphs. The ingestion pipeline routes through MainOrchestrator.ingest_data() which coordinates the UnifiedIngestionPipeline to extract entities, relationships, and embeddings automatically.

POST

/api/v1/ingest

Ingest data from files (PDF, CSV, TXT) or URLs into a specified knowledge graph. Routes through MainOrchestrator.ingest_data() → UnifiedIngestionService → UnifiedIngestionPipeline. The system automatically extracts entities, builds relationships, and generates vector embeddings for semantic search.

Agent Interaction Flow

Ingestion Processing Flow
API Request → MainOrchestrator.ingest_data()
UnifiedIngestionService.ingest_file() / ingest_url()
UnifiedIngestionPipeline.process()
┌─────────────────────────────────────┐
│ Document Processing │
│ • URL fetching (if URL) │
│ • PDF extraction (if PDF) │
│ • CSV parsing (if CSV) │
│ • Text extraction │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Chunking │
│ • HierarchicalSemanticChunker │
│ • Semantic chunking with overlap │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Entity & Relation Extraction │
│ • LLMEntityExtractor │
│ • NER + Relation extraction │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Knowledge Graph Building │
│ • KGEntity creation │
│ • KGRelation creation │
│ • KGChunk creation │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Storage & Embedding │
│ • SQLHandler → SQLite storage │
│ • MultiKGTextEmbeddingProcessor │
│ → Qdrant vector storage │
│ • KGStorageService → Export files │
│ • EnhancedKGVisualizer → Visuals │
└─────────────────────────────────────┘

Form Data Parameters

ParameterTypeRequiredDescription
kg_namestringYesTarget knowledge graph name
filefileNo*File to upload (PDF, CSV, TXT)
doc_urlstringNo*URL to fetch and ingest content from
source_namestringNoHuman-readable source name (auto-generated if not provided)
custom_fields_jsonJSON stringNoAdditional metadata as JSON string (e.g., {"domain": "health", "source_type": "official"})

* Either file or doc_url must be provided

Response Fields

FieldTypeDescription
job_idstringUnique identifier for tracking the ingestion job
messagestringHuman-readable status message
kg_namestringKnowledge graph that received the data
source_namestringName of the ingested source
source_idstringUnique source identifier for future reference
entities_countintegerNumber of entities extracted
relations_countintegerNumber of relationships created
chunks_countintegerNumber of text chunks created

Supported File Types

FormatExtensionMax SizeNotes
PDF.pdf50 MBText extraction with layout preservation
CSV.csv20 MBAutomatic schema detection
Text.txt10 MBPlain text processing
Markdown.md10 MBPreserves formatting and structure

Processing Pipeline

After ingestion, the document goes through several processing stages:

  1. Document Parsing: Extract text and structure from the file
  2. Chunking: Split content into semantic chunks
  3. Entity Extraction: Identify named entities (NER)
  4. Relation Extraction: Detect relationships between entities
  5. Embedding Generation: Create vector embeddings for semantic search
  6. Graph Storage: Store in Neo4j knowledge graph
  7. Vector Storage: Store embeddings in Qdrant
  8. Visualization: Generate interactive dashboards

Checking Ingestion Status

Use the source management endpoints to check ingestion progress:

# Check source status
curl -X GET https://your-api.com/api/v1/sources/src_789/stats \
-H "X-API-Key: your-api-key"
# Response shows processing status
{
"source_id": "src_789",
"status": "completed",
"chunks_count": 89,
"entities_count": 245,
"relations_count": 512
}

Error Responses

Missing Required Parameters

{
"detail": "Either 'file' or 'doc_url' must be provided",
"status_code": 400
}

File Too Large

{
"detail": "File size exceeds maximum limit of 50MB",
"status_code": 413
}

Unsupported File Type

{
"detail": "File type 'docx' is not supported. Supported formats: pdf, csv, txt, md",
"status_code": 400
}

Invalid URL

{
"detail": "Could not fetch content from URL. Please check the URL is accessible.",
"status_code": 400
}

Best Practices

  • Meaningful names: Use descriptive source names for easier management
  • Metadata: Add rich metadata to improve search and filtering
  • File size: Split large documents into smaller files for faster processing
  • Format choice: Use PDF for documents, CSV for structured data
  • URL validation: Ensure URLs are publicly accessible before ingestion
  • Batch processing: For multiple files, use asynchronous ingestion
  • Monitor jobs: Track job_id to check completion status

Performance Optimization

File SizeTypical Processing TimeTips
Small (<1 MB)10-30 secondsIdeal for real-time ingestion
Medium (1-10 MB)30-120 secondsUse job tracking for status updates
Large (10-50 MB)2-10 minutesConsider async processing workflows

Custom Metadata Schema

You can attach any JSON-serializable metadata to your sources:

Example Metadata Schemas
// Research Paper
{
"title": "Deep Learning for NLP",
"authors": ["Author One", "Author Two"],
"year": 2024,
"venue": "ACL 2024",
"citations": 150,
"keywords": ["NLP", "transformers"]
}
// News Article
{
"published_date": "2024-01-15",
"author": "Jane Journalist",
"category": "technology",
"tags": ["AI", "innovation"],
"read_time_minutes": 5
}
// Legal Document
{
"case_number": "2024-CV-1234",
"jurisdiction": "Federal",
"date_filed": "2024-01-15",
"parties": ["Plaintiff Corp", "Defendant Inc"],
"status": "active"
}
POST

/api/v1/ingest/unified

Advanced unified ingestion endpoint with JSON-based requests and comprehensive features. Supports multiple processing modes, quality checks, deduplication, and asynchronous processing.

Processing Modes

ModeDescriptionBest For
autoAutomatic mode selection based on content typeGeneral use when unsure of content characteristics
fastOptimized for speed with basic processingLarge volumes, time-sensitive ingestion
comprehensiveFull processing with all features enabledHigh-quality content requiring deep analysis
domain_specificSpecialized processing for specific domainsDomain-specific content with known characteristics

Key Request Parameters

ParameterTypeRequiredDescription
contentstringNo*Direct text content to ingest
document_urlstringNo*URL of document to fetch and ingest
kg_namestringYesTarget knowledge graph name
modestringNoProcessing mode: auto, fast, comprehensive, domain_specific (default: auto)
chunking_strategystringNoChunking strategy to use (default: semantic)
chunk_sizeintegerNoTarget chunk size (100-5000, default: 1000)
enable_entity_extractionbooleanNoEnable LLM entity extraction (default: true)
enable_relation_extractionbooleanNoEnable LLM relation extraction (default: true)
enable_quality_checksbooleanNoEnable content quality validation (default: true)
enable_deduplicationbooleanNoEnable duplicate detection (default: true)
auto_domain_detectionbooleanNoAuto-detect content domains (default: true)
async_processingbooleanNoProcess asynchronously (default: false)
prioritystringNoProcessing priority: low, normal, high, urgent (default: normal)

* Either content or document_url must be provided

Response Fields

FieldTypeDescription
successbooleanWhether ingestion was successful
source_infoobjectInformation about processed source
content_analysisobjectContent analysis results
detected_domainsarrayAuto-detected content domains
chunks_createdintegerNumber of chunks created
entities_extractedintegerNumber of entities extracted
relations_extractedintegerNumber of relations extracted
quality_scorenumberContent quality score (0-1)
duplicates_removedintegerNumber of duplicates removed
storage_infoobjectStorage locations and status
processing_time_msnumberTotal processing time

JSON vs Form-Data

The unified ingestion endpoint uses JSON requests for programmatic access and advanced features. Use /api/v1/ingest for simple file uploads via form-data, or/api/v1/ingest/unified for advanced control and detailed analytics.

Comparison: Basic vs Unified Ingestion

Feature/api/v1/ingest/api/v1/ingest/unified
Request FormatForm-data (multipart)JSON
File UploadDirect file uploadURL or text content
Processing ModesStandardAuto, Fast, Comprehensive, Domain-specific
Quality ChecksBasicAdvanced with scoring
DeduplicationNoYes, configurable
Domain DetectionManualAutomatic
Async ProcessingAlways asyncConfigurable
Response DetailBasic statsComprehensive analytics
Priority ControlNoYes
File Upload Example
# Upload a file using unified ingestion
result = client.ingestion.ingest_unified(
kg_name="KG_Universal",
file_path="/path/to/document.pdf",
mode="comprehensive",
source_name="Research Paper",
enable_entity_extraction=True,
enable_relation_extraction=True
)
print(f"Success: {result.success}")
print(f"Entities extracted: {result.entities_extracted}
Async Processing Example
# For large documents, use async processing
result = client.ingestion.ingest_unified(
kg_name="KG_Universal",
document_url="https://example.com/large-document.pdf",
mode="comprehensive",
async_processing=True,
priority="high",
webhook_url="https://your-app.com/ingestion-callback"
)
# Returns immediately with job_id
print(f"Job ID: {result.job_id}")
print(f"Status: {result.message}