Language Support

ContractParser analyzes documents in most world languages.

How multilingual analysis works

Every document is analyzed by Anthropic's Claude, a multilingual model. You select the fields you want extracted (in English), and Claude returns the values in the source language of the document. A French contract's "Date de fin" fills the End Date column; a Japanese contract's "契約満了日" fills the same column.

You can also write your custom prompt in any language. The AI responds in the language you used.

Supported languages

Quality is strong for all of these. This list is not exhaustive — if your language isn't here, it likely still works.

Script family	Languages
Latin (European)	English, Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Danish, Norwegian, Finnish, Polish, Czech, Hungarian, Romanian, Turkish, Vietnamese, Indonesian, Tagalog
Cyrillic	Russian, Ukrainian, Bulgarian, Serbian, Belarusian
Greek	Greek
East Asian	Japanese, Simplified Chinese (简体中文), Traditional Chinese (繁體中文), Korean
South & Southeast Asian	Hindi, Bengali, Tamil, Telugu, Marathi, Urdu, Thai
Middle Eastern	Arabic (العربية), Hebrew (עברית), Persian/Farsi

File formats

All supported formats work with foreign-language content:

PDF — native text PDFs extract with highest quality. Scanned/image-only PDFs rely on Claude's vision OCR, which is strong for Latin, Cyrillic, and CJK scripts; somewhat weaker for Arabic handwriting or degraded scans.
Word (.docx, .doc) — full Unicode support.
HTML, RTF, plain text — handled natively in UTF-8.
Images (PNG, JPG, GIF, WebP) — passed to Claude's vision model.

CSV output

Extracted CSVs are UTF-8 encoded and include a byte-order mark (BOM) so Microsoft Excel opens them correctly. Google Sheets, LibreOffice Calc, Apple Numbers, and any text editor also handle them without issue.

Caveats worth knowing

Token cost for CJK: Chinese, Japanese, and Korean characters consume about 2× the tokens per character compared to English, because each glyph carries more information. Pricing is still per page, but our per-page cost is higher for dense CJK documents. No change to what you pay; just noting we may see thinner margins on those files.
Scanned handwriting: Latin-script handwriting extracts well. Cursive Arabic, stylized Asian calligraphy, and very degraded scans may have lower quality — enable the Verified tier to have a second AI pass flag suspicious extractions.
Right-to-left scripts (Arabic, Hebrew): values in the CSV are correct. The on-site results table renders correctly in modern browsers.
Mixed-language documents: handled naturally — each field is extracted in whichever language that passage of the document uses.

Something not working?

If you're analyzing a document in a language not covered above and the results are off, email support@grovestreams.com with the input language and a redacted sample. We triage multilingual edge cases within a day.