How to add txt, csv file and customize embedfile ?

#1
by lbarasc - opened

Hi, i just discovered your Embedfile tool, and this is really huge !

I want to use but i have some questions about it.

For example : if i want to add text file, i do :

all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.txt mybase.db

can you tell me the caracteristics of the .txt file (encoding : utf-8 ?, line break : CR+LF, ... ?)

if i want to add CSV file :

all-minilm-l6-v2.f16.embedfile.exe import --embed text mytest.csv mybase.db

can you specify format of CSV (char separator,encoding..., number of column, name of columns...)

Last question : how to create my own Embedfile.exe with add .gguf ? can i simply binary copy embedfile.exe + gguf ? or what else ?

Thank you for your help, and your tool.

Sincerely,

Lionel.

@lbarasc answering your questions using ChatGPT based on the embedfile source code:

TXT File Input Characteristics

Encoding:

UTF-8 is required. The source uses lines_read(?) internally via sqlite-lines, which assumes UTF-8 encoded text.

Line Breaks:

lines_read handles both LF (\n) and CR+LF (\r\n)

Format:

  • One logical entry per line.
  • No special header or metadata required.
  • Empty lines may be treated as empty strings (not skipped unless handled downstream).

CSV File Input Characteristics

Encoding:

Same as TXT: UTF-8. The CSV virtual table (sqlite-csv) reads from file directly using filename, and there's no transcoding.

Character Separator:

Default: Comma , The code does not specify a custom separator, so only standard CSV is supported.

Header:

Required: "CREATE VIRTUAL TABLE temp.source USING csv(filename=\"%w\", header=yes)". If your CSV lacks a header, import will fail or misinterpret the first row.

Column Names:

  • Must be valid SQLite identifiers (letters, digits, underscores).
  • Avoid duplicate column names.
  • Required to match the --embed COLUMN name (case-sensitive match in SQLite by default).

Column Count:

No hard limit, but embedfile uses SELECT * FROM temp.source, so the number of columns must match consistently across all rows.

Creating Your Own embedfile With a Custom Model

You can use a process similar to llamafile:

Option 1: zipalign approach (like llamafile)

zipalign -j0 embedfile model.gguf .args

.args should contain CLI arguments like:

-m
model.gguf

Option 2: Environment or CLI flag

You can also just do:

embedfile -m ./my-model.gguf import --embed text input.csv output.db

This is equivalent, but less portable than a self-contained binary.


Summary Table

Format Encoding Line Breaks Special Notes
.txt UTF-8 LF or CR+LF One entry per line. Used with lines_read()
.csv UTF-8 LF or CR+LF Comma-separated, header required. No support for custom delimiters.
.json/.ndjson UTF-8 LF or CR+LF Structured parsing via json_each() and lines_read().
.db SQLite DB β€” You must provide --table NAME. Currently not implemented

Sign up or log in to comment