How to add txt, csv file and customize embedfile ?
Hi, i just discovered your Embedfile tool, and this is really huge !
I want to use but i have some questions about it.
For example : if i want to add text file, i do :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.txt mybase.db
can you tell me the caracteristics of the .txt file (encoding : utf-8 ?, line break : CR+LF, ... ?)
if i want to add CSV file :
all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.csv mybase.db
can you specify format of CSV (char separator,encoding..., number of column, name of columns...)
Last question : how to create my own Embedfile.exe with add .gguf ? can i simply binary copy embedfile.exe + gguf ? or what else ?
Thank you for your help, and your tool.
Sincerely,
Lionel.
@lbarasc answering your questions using ChatGPT based on the embedfile source code:
TXT File Input Characteristics
Encoding:
UTF-8 is required. The source uses lines_read(?) internally via sqlite-lines, which assumes UTF-8 encoded text.
Line Breaks:
lines_read handles both LF (\n) and CR+LF (\r\n)
Format:
- One logical entry per line.
- No special header or metadata required.
- Empty lines may be treated as empty strings (not skipped unless handled downstream).
CSV File Input Characteristics
Encoding:
Same as TXT: UTF-8. The CSV virtual table (sqlite-csv) reads from file directly using filename, and there's no transcoding.
Character Separator:
Default: Comma , The code does not specify a custom separator, so only standard CSV is supported.
Header:
Required: "CREATE VIRTUAL TABLE temp.source USING csv(filename=\"%w\", header=yes)". If your CSV lacks a header, import will fail or misinterpret the first row.
Column Names:
- Must be valid SQLite identifiers (letters, digits, underscores).
- Avoid duplicate column names.
- Required to match the
--embed COLUMNname (case-sensitive match in SQLite by default).
Column Count:
No hard limit, but embedfile uses SELECT * FROM temp.source, so the number of columns must match consistently across all rows.
Creating Your Own embedfile With a Custom Model
You can use a process similar to llamafile:
Option 1: zipalign approach (like llamafile)
zipalign -j0 embedfile model.gguf .args
.args should contain CLI arguments like:
-m
model.gguf
Option 2: Environment or CLI flag
You can also just do:
embedfile -m ./my-model.gguf import --embed text input.csv output.db
This is equivalent, but less portable than a self-contained binary.
Summary Table
| Format | Encoding | Line Breaks | Special Notes |
|---|---|---|---|
.txt |
UTF-8 | LF or CR+LF | One entry per line. Used with lines_read() |
.csv |
UTF-8 | LF or CR+LF | Comma-separated, header required. No support for custom delimiters. |
.json/.ndjson |
UTF-8 | LF or CR+LF | Structured parsing via json_each() and lines_read(). |
.db |
SQLite DB | β | You must provide --table NAME. Currently not implemented |