DirectoryConnector

Ingest text-based files from a directory by extension.

Usage

Source

DirectoryConnector()

Scans a directory recursively and yields files matching the given extensions as Document objects. Binary files (e.g. PDF) are skipped with a note in metadata — full PDF extraction is left to enrichment plugins.

Parameters

path: str | Path

Path to the directory to scan.

extensions: list[str] | None = None

File extensions to include (e.g. [".txt", ".md", ".py"]). Defaults to [".txt", ".md"].

name: str = "directory"
Connector name (defaults to "directory").

Examples

import talk_box as tb

connector = tb.DirectoryConnector(
    "~/Documents/work/",
    extensions=[".txt", ".md", ".py"],
)
for doc in connector.scan():
    print(doc.title, len(doc.content))

Methods

Name Description
scan() Yield documents from the directory tree.

scan()

Yield documents from the directory tree.

Usage

Source

scan()