Improved RAG Document Processing With Markdown | by Dr. Leon Eversberg | Nov, 2024


How to read and convert PDFs to Markdown for better RAG results with LLMs

Dr. Leon Eversberg
Towards Data Science
Photo by insung yoon on Unsplash

Markdown is a lightweight, easy-to-read language for creating formatted text. Many people are probably familiar with Markdown from GitHub’s README.md files.

Here are some basic examples of Markdown syntax:

# Heading level 1
## Heading level 2
### Heading level 3

This is **bold text**.

This is *italicized text*.

> This text is a quote

This is how to do a link [Link Text](https://www.example.org)

```
This text is code
```

| Header 1 | Header 2 |
|------------|------------|
| table data | table data |

Markdown seems to be establishing itself as a popular format for Large Language Models (LLMs).

Markdown has some important advantages, such as [1]:

  • It provides structure for headings, tables, lists, links, and more
  • It adds typographic emphasis elements such as bold or italics
  • It is easy to write and human-readable
  • It is already widely used, for example on GitHub and in Jupyter notebooks

Markdown is not only useful in the context of LLMs as input documents, but it is also how chatbots like ChatGPT format their



Source link

[aisg_get_postavatar size=64]