Workspace Indexing

NikCLI’s workspace indexing system intelligently analyzes your codebase to build a searchable knowledge base. It combines file filtering, language detection, importance scoring, and vector embeddings to create an efficient context retrieval system.

How It Works

1. File Discovery & Filtering

The system scans your workspace and applies intelligent filtering:

import { WorkspaceContextManager } from '@nicomatt69/nikcli';

const workspace = new WorkspaceContextManager(process.cwd());

// Automatic filtering applied:
// ✓ Respects .gitignore
// ✓ Excludes node_modules, dist, build
// ✓ Filters by file size (default: 1MB limit)
// ✓ Detects binary files
// ✓ Applies custom rules

await workspace.refreshWorkspaceIndex();

Default Exclusions:

const excludedDirectories = [
  'node_modules',
  'dist',
  'build',
  '.next',
  '.cache',
  '.git',
  'coverage',
  '__pycache__'
];

const excludedExtensions = [
  '.jpg', '.jpeg', '.png', '.gif', '.svg',
  '.pdf', '.zip', '.tar', '.gz',
  '.mp4', '.avi', '.mov',
  '.exe', '.dll', '.so'
];

2. Language & Framework Detection

Automatic detection of languages and frameworks:

// Detected from file extensions
const languageMap = {
  '.ts': 'typescript',
  '.tsx': 'typescript',
  '.js': 'javascript',
  '.jsx': 'javascript',
  '.py': 'python',
  '.go': 'go',
  '.rs': 'rust',
  '.java': 'java',
  // ... 40+ languages supported
};

// Framework detection from package.json
const frameworks = {
  'next': 'Next.js',
  'react': 'React',
  'vue': 'Vue.js',
  'express': 'Express',
  'fastify': 'Fastify',
  // ... many more
};

3. File Analysis

Each file is analyzed to extract:

interface FileContext {
  path: string;
  content: string;
  size: number;
  modified: Date;
  language: string;
  importance: number; // 0-100 score

  // Extracted metadata
  summary?: string;
  dependencies?: string[]; // import statements
  exports?: string[];      // exported symbols
  functions?: string[];    // function names
  classes?: string[];      // class names
  types?: string[];        // type/interface names
  tags?: string[];         // categorization tags

  // Performance optimization
  hash?: string;          // Content hash for change detection
  embedding?: number[];   // Vector embedding
  lastAnalyzed?: Date;
}

4. Importance Scoring

Files are scored based on multiple factors:

function calculateFileImportance(file: FileContext): number {
  let score = 50; // Base score

  // Path-based scoring
  if (isEntryPoint(file.path)) score += 25;       // index.ts, main.ts
  if (isConfig(file.path)) score += 20;           // package.json, tsconfig.json
  if (inSourceDir(file.path)) score += 15;        // src/, lib/
  if (isTest(file.path)) score -= 10;             // test files lower priority

  // Content-based scoring
  score += Math.min(file.exports.length * 5, 25); // Has exports
  score += Math.min(file.functions.length * 2, 20); // Has functions
  score += Math.min(file.classes.length * 3, 15);   // Has classes

  // Size-based scoring
  const lines = file.content.split('\n').length;
  if (lines > 100) score += 5;
  if (lines > 500) score += 10;

  return Math.min(100, Math.max(0, score));
}

Importance Categories:

90-100: Entry points, core configuration
70-89: Main source files, important modules
50-69: Regular source files
30-49: Utilities, helpers
0-29: Tests, documentation, generated files

5. Vector Embedding Generation

Files are chunked and embedded for semantic search:

// Intelligent chunking preserves context
const chunks = intelligentChunking(file.content, file.language);

// Code chunking (TypeScript example)
function chunkCodeFile(content: string): string[] {
  // Keeps functions/classes together
  // Respects bracket depth
  // Smart overlap at function boundaries
  // Typically 80-150 lines per chunk
}

// Markdown chunking
function chunkMarkdownFile(content: string): string[] {
  // Splits by headers
  // Preserves hierarchy
  // Maintains cross-references
  // Minimum 200 chars per section
}

// Generate embeddings
for (const chunk of chunks) {
  const embedding = await unifiedEmbeddingInterface.generateEmbedding(chunk);
  await vectorStore.addDocument({
    id: `${file.path}#${chunkIndex}`,
    content: chunk,
    embedding: embedding.vector,
    metadata: {
      source: file.path,
      language: file.language,
      importance: file.importance,
      chunkIndex,
      totalChunks: chunks.length
    }
  });
}

Indexing Strategies

Full Workspace Index

Index entire workspace:

import { unifiedRAGSystem } from '@nicomatt69/nikcli';

// Analyze and index full workspace
const analysis = await unifiedRAGSystem.analyzeProject(process.cwd());

console.log({
  indexedFiles: analysis.indexedFiles,
  cost: `$${analysis.embeddingsCost.toFixed(4)}`,
  time: `${analysis.processingTime}ms`,
  vectorDB: analysis.vectorDBStatus
});

// Example output:
// {
//   indexedFiles: 342,
//   cost: '$0.0234',
//   time: '12450ms',
//   vectorDB: 'available'
// }

Selective Indexing

Index specific paths:

import { WorkspaceContextManager } from '@nicomatt69/nikcli';

const workspace = new WorkspaceContextManager();

// Select specific paths to index
await workspace.selectPaths([
  'src/core',
  'src/agents',
  'src/tools',
  'README.md'
]);

// Only selected paths will be indexed
const context = workspace.getContext();
console.log(`Indexed ${context.files.size} files from selected paths`);

Incremental Updates

Only re-index changed files:

// File change detection via hash
const fileHash = generateFileHash(filePath, content);

if (cachedHash !== fileHash) {
  // File changed, re-index
  await analyzeFile(filePath, content);
  updateCache(filePath, fileHash);
} else {
  // File unchanged, use cached analysis
  const cached = getCachedAnalysis(filePath);
}

Configuration

import { createFileFilter } from '@nicomatt69/nikcli';

const fileFilter = createFileFilter(process.cwd(), {
  // Respect .gitignore
  respectGitignore: true,

  // Size limits
  maxFileSize: 1024 * 1024, // 1MB per file
  maxTotalFiles: 1000,

  // Include/exclude
  includeExtensions: ['.ts', '.js', '.tsx', '.jsx', '.py'],
  excludeExtensions: ['.test.ts', '.spec.ts'],
  excludeDirectories: ['node_modules', 'dist', 'build'],
  excludePatterns: ['**/*.generated.ts', '**/vendor/**'],

  // Custom rules
  customRules: [
    {
      name: 'priority_configs',
      pattern: /\.(json|yaml|yml|toml)$/,
      type: 'include',
      priority: 10,
      reason: 'Important configuration files'
    },
    {
      name: 'skip_tests',
      pattern: /\.(test|spec)\.(ts|js|tsx|jsx)$/,
      type: 'exclude',
      priority: 8,
      reason: 'Test files have lower priority'
    }
  ]
});

// Check if file should be indexed
const result = fileFilter.shouldIncludeFile(filePath, rootPath);
if (result.allowed) {
  await indexFile(filePath);
}

Chunking Configuration

import { TOKEN_LIMITS } from '@nicomatt69/nikcli';

// Configure chunk sizes
const config = {
  // Token-based chunking
  chunkTokens: TOKEN_LIMITS.RAG?.CHUNK_TOKENS ?? 700,
  overlapTokens: TOKEN_LIMITS.RAG?.CHUNK_OVERLAP_TOKENS ?? 80,

  // Code-specific
  codeChunkMinLines: TOKEN_LIMITS.RAG?.CODE_CHUNK_MIN_LINES ?? 80,
  codeChunkMaxLines: TOKEN_LIMITS.RAG?.CODE_CHUNK_MAX_LINES ?? 150,

  // Markdown-specific
  markdownMinSection: TOKEN_LIMITS.RAG?.MARKDOWN_MIN_SECTION ?? 200,
};

unifiedRAGSystem.updateConfig(config);

Cost Management

// Set indexing cost threshold
unifiedRAGSystem.updateConfig({
  costThreshold: 0.10 // Stop if exceeds $0.10
});

// Estimate costs before indexing
const files = await glob('**/*.{ts,js}');
const estimatedCost = await estimateIndexingCost(files, process.cwd());

if (estimatedCost > 0.10) {
  console.warn(`Estimated cost: $${estimatedCost.toFixed(4)}`);
  console.warn('Consider reducing scope or using selective indexing');
}

Monitoring & Optimization

Index Statistics

const workspace = new WorkspaceContextManager();
const stats = workspace.getPerformanceStats();

console.log({
  totalFiles: stats.totalFiles,
  totalDirectories: stats.totalDirectories,

  cacheStats: {
    hits: stats.cacheStats.hits,
    misses: stats.cacheStats.misses,
    hitRate: `${((stats.cacheStats.hits / (stats.cacheStats.hits + stats.cacheStats.misses)) * 100).toFixed(1)}%`
  },

  cacheSize: {
    semanticSearch: stats.cacheSize.semanticSearch,
    fileContent: stats.cacheSize.fileContent,
    embeddings: stats.cacheSize.embeddings,
    analysis: stats.cacheSize.analysis
  },

  ragAvailable: stats.ragAvailable,
  lastUpdated: stats.lastUpdated
});

Cache Management

// Clear all caches
workspace.clearAllCaches();

// Optimize cache (remove old entries)
await workspace.optimizeCache();

// Manual cache cleanup
setInterval(async () => {
  await workspace.optimizeCache();
}, 3600000); // Every hour

Watch Mode

Monitor file changes and re-index automatically:

// Start watching for changes
workspace.startWatching();

// Files are automatically re-analyzed when changed
// Debounced to 1 second to avoid excessive re-indexing

// Stop watching
workspace.stopWatching();

Best Practices

1. Optimize Index Scope

// Instead of indexing everything:
// await unifiedRAGSystem.analyzeProject(process.cwd());

// Index only source code:
const workspace = new WorkspaceContextManager();
await workspace.selectPaths([
  'src',
  'lib',
  'package.json',
  'tsconfig.json',
  'README.md'
]);

2. Use Appropriate Filters

const fileFilter = createFileFilter(process.cwd(), {
  // Include only code files
  includeExtensions: [
    '.ts', '.tsx', '.js', '.jsx', // JavaScript/TypeScript
    '.py',                         // Python
    '.go',                         // Go
    '.rs'                          // Rust
  ],

  // Exclude test files
  excludePatterns: [
    '**/*.test.*',
    '**/*.spec.*',
    '**/__tests__/**',
    '**/__mocks__/**'
  ]
});

3. Leverage Caching

// Enable all caching
process.env.CACHE_RAG = 'true';
process.env.CACHE_AI = 'true';

// Embeddings cached for 24 hours
// Analysis cached for 5 minutes
// File hashes cached for 7 days

4. Monitor Costs

// Track embedding costs
const analysis = await unifiedRAGSystem.analyzeProject(process.cwd());
console.log(`Indexing cost: $${analysis.embeddingsCost.toFixed(4)}`);

// Use local-only mode if needed
unifiedRAGSystem.updateConfig({
  useVectorDB: false,        // Disable vector DB
  useLocalEmbeddings: true,  // Use simple TF-IDF
  hybridMode: false
});

Troubleshooting

High Indexing Costs

# Problem: Indexing costs too high
# Solution: Reduce scope and enable local embeddings

# 1. Selective indexing
nikcli --index-paths "src,lib"

# 2. Use local embeddings
export USE_LOCAL_EMBEDDINGS=true

# 3. Set cost limit
export INDEXING_COST_THRESHOLD=0.05

Large Workspaces

# Problem: Workspace too large
# Solution: Increase limits or use selective indexing

# 1. Increase file limit
export MAX_INDEX_FILES=2000

# 2. Increase file size limit
export MAX_FILE_SIZE_MB=2

# 3. Use selective paths
nikcli --index-paths "src/core,src/agents"

Slow Indexing

// Problem: Indexing takes too long
// Solution: Optimize batch sizes and use caching

unifiedRAGSystem.updateConfig({
  indexingBatchSize: 500,  // Larger batches
  embedBatchSize: 100,     // Parallel embedding generation
});

// Enable caching
process.env.CACHE_RAG = 'true';

Semantic Search

Advanced search capabilities

Embeddings

Embedding configuration

Token Management

Optimize token usage

Cache System

Performance optimization

Getting Started

CLI Reference

User Guide

Agent System

Planning System

Context & RAG

Architecture

Troubleshooting

Workspace Indexing

Workspace Indexing

How It Works

1. File Discovery & Filtering

2. Language & Framework Detection

3. File Analysis

4. Importance Scoring

5. Vector Embedding Generation

Indexing Strategies

Full Workspace Index

Selective Indexing

Incremental Updates

Configuration

Chunking Configuration

Cost Management

Monitoring & Optimization

Index Statistics

Cache Management

Watch Mode

Best Practices

1. Optimize Index Scope

2. Use Appropriate Filters

3. Leverage Caching

4. Monitor Costs

Troubleshooting

High Indexing Costs

Large Workspaces

Slow Indexing

Semantic Search

Embeddings

Token Management

Cache System

Getting Started

CLI Reference

User Guide

Agent System

Planning System

Context & RAG

Architecture

Troubleshooting

​Workspace Indexing

​How It Works

​1. File Discovery & Filtering

​2. Language & Framework Detection

​3. File Analysis

​4. Importance Scoring

​5. Vector Embedding Generation

​Indexing Strategies

​Full Workspace Index

​Selective Indexing

​Incremental Updates

​Configuration

​File Filter Options

​Chunking Configuration

​Cost Management

​Monitoring & Optimization

​Index Statistics

​Cache Management

​Watch Mode

​Best Practices

​1. Optimize Index Scope

​2. Use Appropriate Filters

​3. Leverage Caching

​4. Monitor Costs

​Troubleshooting

​High Indexing Costs

​Large Workspaces

​Slow Indexing

​Related Documentation

Semantic Search

Embeddings

Token Management

Cache System

Workspace Indexing

How It Works

1. File Discovery & Filtering

2. Language & Framework Detection

3. File Analysis

4. Importance Scoring

5. Vector Embedding Generation

Indexing Strategies

Full Workspace Index

Selective Indexing

Incremental Updates

Configuration

File Filter Options

Chunking Configuration

Cost Management

Monitoring & Optimization

Index Statistics

Cache Management

Watch Mode

Best Practices

1. Optimize Index Scope

2. Use Appropriate Filters

3. Leverage Caching

4. Monitor Costs

Troubleshooting

High Indexing Costs

Large Workspaces

Slow Indexing

Related Documentation