How to read and convert PDFs to Markdown for better RAG results with LLMs
Markdown is a lightweight, easy-to-read language for creating formatted text. Many people are probably familiar with Markdown from GitHub’s README.md files.
Here are some basic examples of Markdown syntax:
# Heading level 1
## Heading level 2
### Heading level 3This is **bold text**.
This is *italicized text*.
> This text is a quote
This is how to do a link [Link Text](https://www.example.org)
```
This text is code
```
| Header 1 | Header 2 |
|------------|------------|
| table data | table data |
Markdown seems to be establishing itself as a popular format for Large Language Models (LLMs).
Markdown has some important advantages, such as [1]:
- It provides structure for headings, tables, lists, links, and more
- It adds typographic emphasis elements such as bold or italics
- It is easy to write and human-readable
- It is already widely used, for example on GitHub and in Jupyter notebooks
Markdown is not only useful in the context of LLMs as input documents, but it is also how chatbots like ChatGPT format their…
