Github – allenai/olmocr: Toolkit for linearizing PDFs for LLM datasets/training

A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.

Features:
– Convert PDF, PNG, and JPEG based documents into clean Markdown
– Support for equations, tables, handwriting, and complex formatting
– Automatically removes headers and footers
– Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets
– Efficient, less than $200 USD per million pages converted (Based on a 7B parameter VLM, so it requires a GPU)

2 reactions

No comments yet

Leave a Reply

Your email will not be published.