Documentation
¶
Index ¶
- Variables
- type CSVExtractor
- type Chunk
- type Chunker
- type ContentExtractor
- type DocxExtractor
- type ExtractorRegistry
- type FileResult
- type HTMLExtractor
- type JSONExtractor
- type ODFExtractor
- type PDFExtractor
- type RTFExtractor
- type SQLExtractor
- type TextExtractor
- type Walker
- type XMLExtractor
- type XlsxExtractor
- type YAMLExtractor
Constants ¶
This section is empty.
Variables ¶
var DefaultIgnoreList = map[string]bool{ ".git": true, "node_modules": true, "vendor": true, ".idea": true, ".vscode": true, "__pycache__": true, ".DS_Store": true, }
DefaultIgnoreList contains directory names to skip during crawling.
var SupportedExtensions = []string{
".txt", ".md", ".json", ".jsonl", ".csv", ".tsv", ".yaml", ".yml",
".pdf", ".docx", ".xlsx", ".html", ".htm", ".sql",
".odt", ".ods", ".odp", ".rtf", ".xml",
}
SupportedExtensions contains file extensions to process.
Functions ¶
This section is empty.
Types ¶
type CSVExtractor ¶
type CSVExtractor struct {
Separator rune
}
CSVExtractor handles CSV and TSV by converting rows to labeled strings.
type Chunker ¶
type Chunker struct {
Size int // Number of words per chunk
Overlap int // Number of overlapping words
}
Chunker handles splitting text into overlapping windows.
func NewChunker ¶
NewChunker creates a new Chunker instance.
type ContentExtractor ¶
ContentExtractor defines the interface for extracting text from various file formats.
type DocxExtractor ¶
type DocxExtractor struct{}
DocxExtractor handles .docx files using nguyenthenguyen/docx.
type ExtractorRegistry ¶
type ExtractorRegistry struct {
// contains filtered or unexported fields
}
ExtractorRegistry maps file extensions to their respective extractors.
func NewExtractorRegistry ¶
func NewExtractorRegistry() *ExtractorRegistry
NewExtractorRegistry initializes the registry with all supported extractors.
func (*ExtractorRegistry) Get ¶
func (r *ExtractorRegistry) Get(ext string) (ContentExtractor, bool)
func (*ExtractorRegistry) Register ¶
func (r *ExtractorRegistry) Register(ext string, e ContentExtractor)
type FileResult ¶
FileResult represents a discovered file and its content.
type JSONExtractor ¶
type JSONExtractor struct{}
JSONExtractor handles JSON and JSONL by extracting string values or pretty-printing.
type ODFExtractor ¶
type ODFExtractor struct{}
ODFExtractor handles OpenDocument files (.odt, .ods, .odp).
type RTFExtractor ¶
type RTFExtractor struct{}
RTFExtractor handles Rich Text Format using J45k4/rtf.
type SQLExtractor ¶
type SQLExtractor struct{}
SQLExtractor handles SQL dumps by extracting comments and INSERT values.
type Walker ¶
type Walker struct {
IgnoreList map[string]bool
Registry *ExtractorRegistry
}
Walker recursively walks a directory and sends file contents to a channel.