Github – allenai/olmocr: Toolkit for linearizing PDFs for LLM datasets/training
A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.
Features:
– Convert PDF, PNG, and JPEG based documents into clean Markdown
– Support for equations, tables, handwriting, and complex formatting
– Automatically removes headers and footers
– Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets
– Efficient, less than $200 USD per million pages converted (Based on a 7B parameter VLM, so it requires a GPU)
2 reactions
No comments yet