Convert a folder of PDFs and Word documents into Markdown to feed into ChatGPT or Claude for analysis.
Extract text and structure from PowerPoint slides and Excel sheets to build a searchable knowledge base.
Transcribe audio files and convert images with OCR into Markdown for processing by text analysis tools.
Batch-convert mixed media (videos, documents, images) into a unified Markdown format for indexing.
Requires Azure Document Intelligence API key and credentials to function.
MarkItDown is a Python tool from Microsoft that converts many kinds of files into Markdown so they can be fed into large language models and other text-analysis pipelines. Markdown is a plain-text format that still preserves structure like headings, lists, tables, and links, and the README notes that mainstream LLMs natively understand Markdown well and that the format is also token-efficient, meaning the converted text uses fewer tokens, which lowers cost when sending to a model. The tool currently supports PDF, PowerPoint, Word, Excel, images with EXIF metadata and OCR, audio with EXIF metadata and speech transcription, HTML, text-based formats like CSV, JSON, and XML, ZIP files where it iterates over the contents, YouTube URLs, EPubs, and more. You install it with pip and can pull in only the dependencies you need (for example just PDF and Word). It can be used from the command line, pointing it at a file and redirecting the output to a Markdown file, or as a Python library. There is also a Docker image and an integration with Azure Document Intelligence for higher-quality conversion. A separate plugin called markitdown-ocr adds OCR to embedded images by calling an OpenAI-compatible vision model. Someone would use MarkItDown when they want to feed mixed real-world documents into an LLM workflow without writing one parser per file format. The README notes the output targets text analysis rather than human reading, so it may not be ideal for high-fidelity document conversion. It requires Python 3.10 or higher.
← microsoft on gitmyhub — every repo by this author, as a profile.
double-check against the repo, no cap.