Docling: The Document Alchemist | Towards Data Science

[ad_1]

Why do we still wrestle with documents in 2025?

in any data-driven organisation, and you’ll encounter a host of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats into something their Python pipelines will accept. Even the latest generative-AI stacks can choke when the underlying text is wrapped inside graphics or sprinkled across irregular table grids.

Docling was born to solve exactly that pain. Released as an open-source project by IBM Research Zurich and now hosted under the Linux Foundation AI & Data Foundation, the library abstracts parsing, layout understanding, OCR, table reconstruction, multimodal export, and even audio transcription behind one reasonably straightforward API and CLI command.

Although docling supports the processing of HTML, MS Office format files, Image formats and others, we’ll be mostly looking at using it to process PDF files.

As a data scientist or ML engineer, why should I care about Docling?

Often, the real bottleneck isn’t building the model — it’s feeding it. We spend a large percentage of our time on data wrangling, and nothing kills productivity faster than being handed a critical dataset locked inside a 100-page PDF. This is precisely the problem Docling solves, acting as a bridge from the world of unstructured documents directly to the structured sanity of Markdown, JSON, or a Pandas DataFrame.

But its power extends beyond just data extraction, directly into the area of modern, AI-assisted development. Imagine pointing docling at an HTML page of API specifications; it effortlessly translates that complex web layout into clean, structured Markdown — the perfect context to feed directly into AI coding assistants like Cursor, ChatGPT, or Claude.

Where Docling came from

The project originated within IBM’s Deep Search team, which was developing retrieval-augmented generation (RAG) pipelines for long patent PDFs. They open-sourced the core under an MIT license in late 2024 and have been shipping weekly releases ever since. A vibrant community quickly formed around its unified DoclingDocument model, a Pydantic object that keeps text, images, tables, formulas, and layout metadata together so downstream tools like LangChain, LlamaIndex, or Haystack don’t have to guess a page’s reading order.

Today, Docling integrates visual-language models (VLMs), such as SmolDocling, for figure captioning. It also supports Tesseract, EasyOCR, and RapidOCR for text extraction and ships recipes for chunking, serialisation, and vector-store ingestion. In other words: you point it at a folder, and you get Markdown, HTML, CSV, PNGs, JSON, or just a ready-to-embed Python object — no extra scaffolding code required.

What we’ll do

To showcase Docling, we’ll first install it and then use it with three different examples that demonstrate its versatility and usefulness as a document parser and processor. Please note that using Docling is quite computationally intensive, so it will be helpful if you have access to a GPU on your system.

However, before we start coding, we need to set up a development environment.

Setting up a development environment

I’ve started using the UV package manager for this now, but feel free to use whichever tools you’re most comfortable with. Note also that I’ll be working under WSL2 Ubuntu for Windows and running my code using a Jupyter Notebook.

Note, even using UV, the code below took a couple of minutes to complete on my system, as it’s a pretty hefty set of library installs.

$ uv init docling
Initialized project `docling` at `/home/tom/docling`
$ cd docling
$ uv venv
Using CPython 3.11.10 interpreter at: /home/tom/miniconda3/bin/python
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(docling) $ uv pip install docling pandas jupyter

Now type in the command,

(docling) $ jupyter notebook

And you should see a notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after running the Jupyter Notebook command. Near the bottom, you will find a URL to copy and paste into your browser to launch the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d

Example 1: Convert any PDF or DOCX to Markdown or JSON

The simplest use case is also the one you’ll use a large percentage of the time:- turn a document’s text into Markdown

For most of our examples, our input PDF will be one I’ve used several times before for different tests. It is a copy of Tesla’s 10-Q SEC filing document from September 2023. It is approximately fifty pages long and consists mainly of financial information related to Tesla. The full document is publicly available on the Securities & Exchange Commission (SEC) website and can be viewed/downloaded using this link.

Here is an image of the first page of that document for your reference.

Let’s review the docling code we need to convert into markdown. It sets up the file path for the input PDF, runs the DocumentConverter function on it, and then exports the parsed result into Markdown format so that the content can be more easily read, edited, or analysed.

from docling.document_converter import DocumentConverter
import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"

data_folder = Path(inpath)

doc_path = data_folder / infile

converter = DocumentConverter()
result    = converter.convert(doc_path)     # → DoclingResult

# Markdown export still works
markdown_text = result.document.export_to_markdown()

This is the output we get from running the above code (just the first page).

## UNITED STATES SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549 FORM 10-Q

(Mark One)

- x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the quarterly period ended September 30, 2023

OR

- o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the transition period from \_\_\_\_\_\_\_\_\_ to \_\_\_\_\_\_\_\_\_

Commission File Number: 001-34756

## Tesla, Inc.

(Exact name of registrant as specified in its charter)

Delaware

(State or other jurisdiction of incorporation or organization)

1 Tesla Road Austin, Texas

(Address of principal executive offices)

## (512) 516-8177

(Registrant's telephone number, including area code)

## Securities registered pursuant to Section 12(b) of the Act:

| Title of each class   | Trading Symbol(s)   | Name of each exchange on which registered   |
|-----------------------|---------------------|---------------------------------------------|
| Common stock          | TSLA                | The Nasdaq Global Select Market             |

Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 ('Exchange Act') during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes x No o

Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes x No o

Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of 'large accelerated filer,' 'accelerated filer,' 'smaller reporting company' and 'emerging growth company' in Rule 12b-2 of the Exchange Act:

Large accelerated filer

x

Accelerated filer

Non-accelerated filer

o

Smaller reporting company

Emerging growth company

o

If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. o

Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes o No x

As of October 16, 2023, there were 3,178,921,391 shares of the registrant's common stock outstanding.

With the rise of AI code editors and the use of LLMs in general, this technique has become significantly more valuable and relevant. The efficacy of LLMs and code editors can be significantly enhanced by providing them with appropriate context. Often this will entail supplying them with the textual representation of a particular tool or framework’s documentation, API and coding examples.

Converting the output of PDFs to JSON format is also straightforward. Just add these two lines of code. You may encounter limitations with the size of the JSON output, so adjust the print statement accordingly.

json_blob = result.document.model_dump_json(indent=2)

print(json_blob[10000], "…")

Example 2: Extract complex tables from a PDF

Many PDFs often store tables as isolated text chunks or, worse, as flattened images. Docling’s table-structure model reassembles rows, columns, and spanning cells, giving you either a Pandas DataFrame or a ready-to-save CSV. Our test input PDF has many tables. Look, for example, at page 11 of the PDF, and we can see the table below,

Let’s see if we can extract that data. It’s slightly more complex code than in our first example, but it’s doing more work. The PDF is converted again using Docling’s DocumentConverter function, producing a structured document representation. Then, for each table detected, it transforms the table into a Pandas DataFrame and also retrieves the page number of the table from the document’s provenance metadata. If the table comes from page 11, it prints it out in Markdown format and then breaks the loop (so only the first matching table is shown).

import pandas as pd
from docling.document_converter import DocumentConverter
from time import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
input_doc_path = data_folder / infile

doc_converter = DocumentConverter()
start_time = time()
conv_res = doc_converter.convert(input_doc_path)

# Export table from page 11
for table_ix, table in enumerate(conv_res.document.tables):
    page_number = table.prov[0].page_no if table.prov else "Unknown"
    if page_number == 11:
        table_df: pd.DataFrame = table.export_to_dataframe()
        print(f"## Table {table_ix} (Page {page_number})")
        print(table_df.to_markdown())
        break

end_time = time() - start_time
print(f"Document converted and tables exported in {end_time:.2f} seconds.")

And the output is not too shabby.

## Table 10 (Page 11)
|    |                                        | Three Months Ended September 30,.2023   | Three Months Ended September 30,.2022   | Nine Months Ended September 30,.2023   | Nine Months Ended September 30,.2022   |
|---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
|  0 | Automotive sales                       | $ 18,582                                | $ 17,785                                | $ 57,879                               | $ 46,969                               |
|  1 | Automotive regulatory credits          | 554                                     | 286                                     | 1,357                                  | 1,309                                  |
|  2 | Energy generation and storage sales    | 1,416                                   | 966                                     | 4,188                                  | 2,186                                  |
|  3 | Services and other                     | 2,166                                   | 1,645                                   | 6,153                                  | 4,390                                  |
|  4 | Total revenues from sales and services | 22,718                                  | 20,682                                  | 69,577                                 | 54,854                                 |
|  5 | Automotive leasing                     | 489                                     | 621                                     | 1,620                                  | 1,877                                  |
|  6 | Energy generation and storage leasing  | 143                                     | 151                                     | 409                                    | 413                                    |
|  7 | Total revenues                         | $ 23,350                                | $ 21,454                                | $ 71,606                               | $ 57,144                               |
Document converted and tables exported in 33.43 seconds.

To retrieve ALL the tables from a PDF, you would need to omit the if page_number =… line from my code.

One thing I have noticed with Docling is that it’s not fast. As shown above, it took almost 34 seconds to extract that single table from a 50-page PDF.

Example 3: Perform OCR on an image.

For this example, I scanned a random page from the Tesla 10-Q PDF and saved it as a PNG file. Let’s see how Docling copes with reading that image and converting what it finds into markdown. Here is my scanned image.

And our code. We use Tesseract as our OCR engine (others are available)

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure OCR for image input
    image_options = ImageFormatOption(
        ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"\n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("\n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"\nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Here is our output.

--- Table 1 (Page 1) ---
|                          |   Three Months Ended September J0,. | Three Months Ended September J0,.2022   | Nine Months Ended September J0,.2023   | Nine Months Ended September J0,.2022   |
|:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| Cost ol revenves         |                                 181 | 150                                     | 554                                    | 424                                    |
| Research an0 developrent |                                 189 | 124                                     | 491                                    | 389                                    |
|                          |                                  95 |                                         | 2B3                                    | 328                                    |
| Total                    |                                 465 | 362                                     | 1,328                                  | 1,141                                  |

--- Full Document (Markdown) ---
## Note 8 Equity Incentive Plans

## Other Pertormance-Based Grants

("RSUs") und stock optlons unrecognized stock-based compensatian

## Summary Stock-Based Compensation Information

|                          | Three Months Ended September J0,   | Three Months Ended September J0,   | Nine Months Ended September J0,   | Nine Months Ended September J0,   |
|--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
|                          |                                    | 2022                               | 2023                              | 2022                              |
| Cost ol revenves         | 181                                | 150                                | 554                               | 424                               |
| Research an0 developrent | 189                                | 124                                | 491                               | 389                               |
|                          | 95                                 |                                    | 2B3                               | 328                               |
| Total                    | 465                                | 362                                | 1,328                             | 1,141                             |

## Note 9 Commitments and Contingencies

## Operating Lease Arrangements In Buffalo, New York and Shanghai, China

## Legal Proceedings

Between september 1 which 2021 pald has

Processing completed in 7.64 seconds

If you compare this output to the original image, the results are disappointing. A lot of the text in the image was just missed or garbled. This is where a product like AWS Textract comes into its own, as it excels at extracting text from a wide range of sources.

However, Docling does provide various options for OCR, so if you receive poor results from one system, you can always switch to another.

I attempted the same task using EasyOCR, but the results weren’t significantly different from those obtained with Tesseract. If you’d like to try it out, here is the code.

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.easyocr_model import EasyOcrOptions  # Import EasyOCR options


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure image pipeline with EasyOCR
    image_options = ImageFormatOption(
        ocr_options=EasyOcrOptions(force_full_page_ocr=True),  # use EasyOCR
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"\n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("\n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"\nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Summary

The generative-AI boom re-ignited an old truth: garbage in, garbage out. LLMs can hallucinate less only when they ingest semantically and spatially coherent input. Docling provides coherence (most of the time) across multiple source formats that your stakeholders can present, and does so locally and reproducibly.

Docling has its uses beyond the AI world, though. Consider the vast number of documents stored in locations such as bank vaults, solicitors’ offices, and insurance companies worldwide. If these are to be digitised, Docling may provide some of the solutions for that.

Its biggest weakness is probably the Optical Character Recognition of text within images. I tried using Tesseract and EasyOCR, and the results from both were disappointing. You’ll probably need to use a commercial product like AWS Textract if you want to reliably reproduce text from those types of sources.

It can also be slow. I’ve a fairly high-spec desktop PC with a GPU, and it took some time on most tasks I set it. However, if your input documents are primarily PDFs, Docling could be a valuable addition to your text processing toolbox.

I have only scratched the surface of what Docling is capable of, and I encourage you to visit their homepage, which can be accessed using the following link to learn more.

Source link

#Docling #Document #Alchemist #Data #Science

[ad_2]

Docling: The Document Alchemist | Towards Data Science

Why do we still wrestle with documents in 2025?

As a data scientist or ML engineer, why should I care about Docling?

Where Docling came from

What we’ll do

Setting up a development environment

Example 1: Convert any PDF or DOCX to Markdown or JSON

Example 2: Extract complex tables from a PDF

Example 3: Perform OCR on an image.

Summary

Recent Posts

New Google Cloud tool fights future quantum attacks

Western Union to launch stablecoin

“We will never build a sex robot,” says Mustafa Suleyman

Using NumPy to Analyze My Daily Habits (Sleep, Screen Time & Mood)

Mazda shows a rotary hybrid concept for Tokyo with evolved design language

Donald Trump’s Truth Social Is Launching a Polymarket Competitor

Roundtables: Seeking Climate Solutions in Turbulent Times

Withings’ urine scanning health tracker is now available for $350

Google Workspace Promo Code: Up to 14% Off in October 2025

University Denies Monkeys That Escaped in Truck Crash Were Infected With Horrific Diseases