• About
  • Advertise
  • Privacy & Policy
  • Contact
Saturday, December 27, 2025
  • Login
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
Advertisement
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
    • Home – Layout 4
    • Home – Layout 5
    • Home – Layout 6
  • News
    • All
    • Business
    • Politics
    • Science
    • World
    Hillary Clinton in white pantsuit for Trump inauguration

    Hillary Clinton in white pantsuit for Trump inauguration

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Amazon has 143 billion reasons to keep adding more perks to Prime

    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Tech
    • All
    • Apps
    • Gadget
    • Mobile
    • Startup
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    These Are the 5 Big Tech Stories to Watch in 2017

    These Are the 5 Big Tech Stories to Watch in 2017

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
  • Entertainment
    • All
    • Gaming
    • Movie
    • Music
    • Sports
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    Harnessing the power of VR with Power Rangers and Snapdragon 835

    So you want to be a startup investor? Here are things you should know

    So you want to be a startup investor? Here are things you should know

  • Lifestyle
    • All
    • Fashion
    • Food
    • Health
    • Travel
    Shooting More than 40 Years of New York’s Halloween Parade

    Shooting More than 40 Years of New York’s Halloween Parade

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Heroes of the Storm Global Championship 2017 starts tomorrow, here’s what you need to know

    Why Millennials Need to Save Twice as Much as Boomers Did

    Why Millennials Need to Save Twice as Much as Boomers Did

    Doctors take inspiration from online dating to build organ transplant AI

    Doctors take inspiration from online dating to build organ transplant AI

    How couples can solve lighting disagreements for good

    How couples can solve lighting disagreements for good

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Ducati launch: Lorenzo and Dovizioso’s Desmosedici

    Trending Tags

    • Golden Globes
    • Game of Thrones
    • MotoGP 2017
    • eSports
    • Fashion Week
  • Review
    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    The Legend of Zelda: Breath of the Wild gameplay on the Nintendo Switch

    Shadow Tactics: Blades of the Shogun Review

    Shadow Tactics: Blades of the Shogun Review

    macOS Sierra review: Mac users get a modest update this year

    macOS Sierra review: Mac users get a modest update this year

    Hands on: Samsung Galaxy A5 2017 review

    Hands on: Samsung Galaxy A5 2017 review

    The Last Guardian Playstation 4 Game review

    The Last Guardian Playstation 4 Game review

    Intel Core i7-7700K ‘Kaby Lake’ review

    Intel Core i7-7700K ‘Kaby Lake’ review

No Result
View All Result
Ai News
No Result
View All Result
Home Machine Learning

Docling: The Document Alchemist | Towards Data Science

AiNEWS2025 by AiNEWS2025
2025-09-13
in Machine Learning
0
Docling: The Document Alchemist | Towards Data Science
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Why do we still wrestle with documents in 2025?

in any data-driven organisation, and you’ll encounter a host of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats into something their Python pipelines will accept. Even the latest generative-AI stacks can choke when the underlying text is wrapped inside graphics or sprinkled across irregular table grids.

Docling was born to solve exactly that pain. Released as an open-source project by IBM Research Zurich and now hosted under the Linux Foundation AI & Data Foundation, the library abstracts parsing, layout understanding, OCR, table reconstruction, multimodal export, and even audio transcription behind one reasonably straightforward API and CLI command.

Although docling supports the processing of HTML, MS Office format files, Image formats and others, we’ll be mostly looking at using it to process PDF files.

As a data scientist or ML engineer, why should I care about Docling?

Often, the real bottleneck isn’t building the model — it’s feeding it. We spend a large percentage of our time on data wrangling, and nothing kills productivity faster than being handed a critical dataset locked inside a 100-page PDF. This is precisely the problem Docling solves, acting as a bridge from the world of unstructured documents directly to the structured sanity of Markdown, JSON, or a Pandas DataFrame. 

But its power extends beyond just data extraction, directly into the area of modern, AI-assisted development. Imagine pointing docling at an HTML page of API specifications; it effortlessly translates that complex web layout into clean, structured Markdown — the perfect context to feed directly into AI coding assistants like Cursor, ChatGPT, or Claude.

Where Docling came from

The project originated within IBM’s Deep Search team, which was developing retrieval-augmented generation (RAG) pipelines for long patent PDFs. They open-sourced the core under an MIT license in late 2024 and have been shipping weekly releases ever since. A vibrant community quickly formed around its unified DoclingDocument model, a Pydantic object that keeps text, images, tables, formulas, and layout metadata together so downstream tools like LangChain, LlamaIndex, or Haystack don’t have to guess a page’s reading order.

Today, Docling integrates visual-language models (VLMs), such as SmolDocling, for figure captioning. It also supports Tesseract, EasyOCR, and RapidOCR for text extraction and ships recipes for chunking, serialisation, and vector-store ingestion. In other words: you point it at a folder, and you get Markdown, HTML, CSV, PNGs, JSON, or just a ready-to-embed Python object — no extra scaffolding code required. 

What we’ll do 

To showcase Docling, we’ll first install it and then use it with three different examples that demonstrate its versatility and usefulness as a document parser and processor. Please note that using Docling is quite computationally intensive, so it will be helpful if you have access to a GPU on your system.

However, before we start coding, we need to set up a development environment.

Setting up a development environment

I’ve started using the UV package manager for this now, but feel free to use whichever tools you’re most comfortable with. Note also that I’ll be working under WSL2 Ubuntu for Windows and running my code using a Jupyter Notebook. 

Note, even using UV, the code below took a couple of minutes to complete on my system, as it’s a pretty hefty set of library installs.

$ uv init docling
Initialized project `docling` at `/home/tom/docling`
$ cd docling
$ uv venv
Using CPython 3.11.10 interpreter at: /home/tom/miniconda3/bin/python
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
$ source .venv/bin/activate
(docling) $ uv pip install docling pandas jupyter

Now type in the command,

(docling) $ jupyter notebook

And you should see a notebook open in your browser. If that doesn’t happen automatically, you’ll likely see a screenful of information after running the Jupyter Notebook command. Near the bottom, you will find a URL to copy and paste into your browser to launch the Jupyter Notebook.

Your URL will be different to mine, but it should look something like this:-

http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69d

Example 1: Convert any PDF or DOCX to Markdown or JSON

The simplest use case is also the one you’ll use a large percentage of the time:- turn a document’s text into Markdown 

For most of our examples, our input PDF will be one I’ve used several times before for different tests. It is a copy of Tesla’s 10-Q SEC filing document from September 2023. It is approximately fifty pages long and consists mainly of financial information related to Tesla. The full document is publicly available on the Securities & Exchange Commission (SEC) website and can be viewed/downloaded using this link.

Here is an image of the first page of that document for your reference.

Image from Tesla 10-Q PDF

Let’s review the docling code we need to convert into markdown. It sets up the file path for the input PDF, runs the DocumentConverter function on it, and then exports the parsed result into Markdown format so that the content can be more easily read, edited, or analysed.

from docling.document_converter import DocumentConverter
import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"

data_folder = Path(inpath)

doc_path = data_folder / infile

converter = DocumentConverter()
result    = converter.convert(doc_path)     # → DoclingResult

# Markdown export still works
markdown_text = result.document.export_to_markdown()

This is the output we get from running the above code (just the first page).

## UNITED STATES SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549 FORM 10-Q

(Mark One)

- x QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the quarterly period ended September 30, 2023

OR

- o TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934

For the transition period from \_\_\_\_\_\_\_\_\_ to \_\_\_\_\_\_\_\_\_

Commission File Number: 001-34756

## Tesla, Inc.

(Exact name of registrant as specified in its charter)

Delaware

(State or other jurisdiction of incorporation or organization)

1 Tesla Road Austin, Texas

(Address of principal executive offices)

## (512) 516-8177

(Registrant's telephone number, including area code)

## Securities registered pursuant to Section 12(b) of the Act:

| Title of each class   | Trading Symbol(s)   | Name of each exchange on which registered   |
|-----------------------|---------------------|---------------------------------------------|
| Common stock          | TSLA                | The Nasdaq Global Select Market             |

Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 ('Exchange Act') during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. Yes x No o

Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). Yes x No o

Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of 'large accelerated filer,' 'accelerated filer,' 'smaller reporting company' and 'emerging growth company' in Rule 12b-2 of the Exchange Act:

Large accelerated filer

x

Accelerated filer

Non-accelerated filer

o

Smaller reporting company

Emerging growth company

o

If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. o

Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes o No x

As of October 16, 2023, there were 3,178,921,391 shares of the registrant's common stock outstanding.

With the rise of AI code editors and the use of LLMs in general, this technique has become significantly more valuable and relevant. The efficacy of LLMs and code editors can be significantly enhanced by providing them with appropriate context. Often this will entail supplying them with the textual representation of a particular tool or framework’s documentation, API and coding examples.

Converting the output of PDFs to JSON format is also straightforward. Just add these two lines of code. You may encounter limitations with the size of the JSON output, so adjust the print statement accordingly.

json_blob = result.document.model_dump_json(indent=2)

print(json_blob[10000], "…")

Example 2: Extract complex tables from a PDF

Many PDFs often store tables as isolated text chunks or, worse, as flattened images. Docling’s table-structure model reassembles rows, columns, and spanning cells, giving you either a Pandas DataFrame or a ready-to-save CSV. Our test input PDF has many tables. Look, for example, at page 11 of the PDF, and we can see the table below,

Image from Tesla 10-Q PDF

Let’s see if we can extract that data. It’s slightly more complex code than in our first example, but it’s doing more work. The PDF is converted again using Docling’s DocumentConverter function, producing a structured document representation. Then, for each table detected, it transforms the table into a Pandas DataFrame and also retrieves the page number of the table from the document’s provenance metadata. If the table comes from page 11, it prints it out in Markdown format and then breaks the loop (so only the first matching table is shown).

import pandas as pd
from docling.document_converter import DocumentConverter
from time import time
from pathlib import Path

inpath = "/mnt/d//tesla"
infile = "tesla_q10_sept_23.pdf"
data_folder = Path(inpath)
input_doc_path = data_folder / infile

doc_converter = DocumentConverter()
start_time = time()
conv_res = doc_converter.convert(input_doc_path)

# Export table from page 11
for table_ix, table in enumerate(conv_res.document.tables):
    page_number = table.prov[0].page_no if table.prov else "Unknown"
    if page_number == 11:
        table_df: pd.DataFrame = table.export_to_dataframe()
        print(f"## Table {table_ix} (Page {page_number})")
        print(table_df.to_markdown())
        break

end_time = time() - start_time
print(f"Document converted and tables exported in {end_time:.2f} seconds.")

And the output is not too shabby.

## Table 10 (Page 11)
|    |                                        | Three Months Ended September 30,.2023   | Three Months Ended September 30,.2022   | Nine Months Ended September 30,.2023   | Nine Months Ended September 30,.2022   |
|---:|:---------------------------------------|:----------------------------------------|:----------------------------------------|:---------------------------------------|:---------------------------------------|
|  0 | Automotive sales                       | $ 18,582                                | $ 17,785                                | $ 57,879                               | $ 46,969                               |
|  1 | Automotive regulatory credits          | 554                                     | 286                                     | 1,357                                  | 1,309                                  |
|  2 | Energy generation and storage sales    | 1,416                                   | 966                                     | 4,188                                  | 2,186                                  |
|  3 | Services and other                     | 2,166                                   | 1,645                                   | 6,153                                  | 4,390                                  |
|  4 | Total revenues from sales and services | 22,718                                  | 20,682                                  | 69,577                                 | 54,854                                 |
|  5 | Automotive leasing                     | 489                                     | 621                                     | 1,620                                  | 1,877                                  |
|  6 | Energy generation and storage leasing  | 143                                     | 151                                     | 409                                    | 413                                    |
|  7 | Total revenues                         | $ 23,350                                | $ 21,454                                | $ 71,606                               | $ 57,144                               |
Document converted and tables exported in 33.43 seconds.

To retrieve ALL the tables from a PDF, you would need to omit the if page_number =… line from my code.

One thing I have noticed with Docling is that it’s not fast. As shown above, it took almost 34 seconds to extract that single table from a 50-page PDF.

Example 3: Perform OCR on an image.

For this example, I scanned a random page from the Tesla 10-Q PDF and saved it as a PNG file. Let’s see how Docling copes with reading that image and converting what it finds into markdown. Here is my scanned image.

Image from Tesla 10-Q PDF

And our code. We use Tesseract as our OCR engine (others are available)

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.tesseract_ocr_cli_model import TesseractCliOcrOptions


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure OCR for image input
    image_options = ImageFormatOption(
        ocr_options=TesseractCliOcrOptions(force_full_page_ocr=True),
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"\n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("\n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"\nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Here is our output.

--- Table 1 (Page 1) ---
|                          |   Three Months Ended September J0,. | Three Months Ended September J0,.2022   | Nine Months Ended September J0,.2023   | Nine Months Ended September J0,.2022   |
|:-------------------------|------------------------------------:|:----------------------------------------|:---------------------------------------|:---------------------------------------|
| Cost ol revenves         |                                 181 | 150                                     | 554                                    | 424                                    |
| Research an0 developrent |                                 189 | 124                                     | 491                                    | 389                                    |
|                          |                                  95 |                                         | 2B3                                    | 328                                    |
| Total                    |                                 465 | 362                                     | 1,328                                  | 1,141                                  |

--- Full Document (Markdown) ---
## Note 8 Equity Incentive Plans

## Other Pertormance-Based Grants

("RSUs") und stock optlons unrecognized stock-based compensatian

## Summary Stock-Based Compensation Information

|                          | Three Months Ended September J0,   | Three Months Ended September J0,   | Nine Months Ended September J0,   | Nine Months Ended September J0,   |
|--------------------------|------------------------------------|------------------------------------|-----------------------------------|-----------------------------------|
|                          |                                    | 2022                               | 2023                              | 2022                              |
| Cost ol revenves         | 181                                | 150                                | 554                               | 424                               |
| Research an0 developrent | 189                                | 124                                | 491                               | 389                               |
|                          | 95                                 |                                    | 2B3                               | 328                               |
| Total                    | 465                                | 362                                | 1,328                             | 1,141                             |

## Note 9 Commitments and Contingencies

## Operating Lease Arrangements In Buffalo, New York and Shanghai, China

## Legal Proceedings

Between september 1 which 2021 pald has

Processing completed in 7.64 seconds

If you compare this output to the original image, the results are disappointing. A lot of the text in the image was just missed or garbled. This is where a product like AWS Textract comes into its own, as it excels at extracting text from a wide range of sources. 

However, Docling does provide various options for OCR, so if you receive poor results from one system, you can always switch to another.

I attempted the same task using EasyOCR, but the results weren’t significantly different from those obtained with Tesseract. If you’d like to try it out, here is the code.

from pathlib import Path
import time
import pandas as pd

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.models.easyocr_model import EasyOcrOptions  # Import EasyOCR options


def main():
    inpath = "/mnt/d//tesla"
    infile = "10q-image.png"

    input_doc_path = Path(inpath) / infile

    # Configure image pipeline with EasyOCR
    image_options = ImageFormatOption(
        ocr_options=EasyOcrOptions(force_full_page_ocr=True),  # use EasyOCR
        do_table_structure=True,
        table_structure_options={"do_cell_matching": True},
    )

    converter = DocumentConverter(
        format_options={"image": image_options}
    )

    start_time = time.time()

    conv_res = converter.convert(input_doc_path).document

    # Print all tables as Markdown
    for table_ix, table in enumerate(conv_res.tables):
        table_df: pd.DataFrame = table.export_to_dataframe(doc=conv_res)
        page_number = table.prov[0].page_no if table.prov else "Unknown"
        print(f"\n--- Table {table_ix+1} (Page {page_number}) ---")
        print(table_df.to_markdown(index=False))

    # Print full document text as Markdown
    print("\n--- Full Document (Markdown) ---")
    print(conv_res.export_to_markdown())

    elapsed = time.time() - start_time
    print(f"\nProcessing completed in {elapsed:.2f} seconds")


if __name__ == "__main__":
    main()

Summary

The generative-AI boom re-ignited an old truth: garbage in, garbage out. LLMs can hallucinate less only when they ingest semantically and spatially coherent input. Docling provides coherence (most of the time) across multiple source formats that your stakeholders can present, and does so locally and reproducibly.

Docling has its uses beyond the AI world, though. Consider the vast number of documents stored in locations such as bank vaults, solicitors’ offices, and insurance companies worldwide. If these are to be digitised, Docling may provide some of the solutions for that.

Its biggest weakness is probably the Optical Character Recognition of text within images. I tried using Tesseract and EasyOCR, and the results from both were disappointing. You’ll probably need to use a commercial product like AWS Textract if you want to reliably reproduce text from those types of sources.

It can also be slow. I’ve a fairly high-spec desktop PC with a GPU, and it took some time on most tasks I set it. However, if your input documents are primarily PDFs, Docling could be a valuable addition to your text processing toolbox.

I have only scratched the surface of what Docling is capable of, and I encourage you to visit their homepage, which can be accessed using the following link to learn more.

Source link

#Docling #Document #Alchemist #Data #Science

Tags: artificial intelligenceDeep DivesDocument Processingmachine learningPython
Previous Post

Scientists: It’s do or die time for America’s primacy exploring the Solar System

Next Post

How do AI models generate videos?

AiNEWS2025

AiNEWS2025

Next Post
How do AI models generate videos?

How do AI models generate videos?

Stay Connected test

  • 23.9k Followers
  • 99 Subscribers
  • Trending
  • Comments
  • Latest
A tiny new open source AI model performs as well as powerful big ones

A tiny new open source AI model performs as well as powerful big ones

0
Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

Water Cooler Small Talk: The Birthday Paradox 🎂🎉 | by Maria Mouschoutzi, PhD | Sep, 2024

0
Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

Ghost of Yōtei: The acclaimed Ghost of Tsushima is getting a sequel

0
Best Headphones for Working Out (2024): Bose, Shokz, JLab

Best Headphones for Working Out (2024): Bose, Shokz, JLab

0
8 great games for your Steam Deck from 2025

8 great games for your Steam Deck from 2025

2025-12-27
The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

2025-12-27
Own a Roku TV? You’re missing out on these hidden settings and menu screens

Own a Roku TV? You’re missing out on these hidden settings and menu screens

2025-12-27
Gemini Robotics 1.5 brings AI agents into the physical world

Gemini Robotics 1.5 brings AI agents into the physical world

2025-12-27

Recent News

8 great games for your Steam Deck from 2025

8 great games for your Steam Deck from 2025

2025-12-27
The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

2025-12-27
Own a Roku TV? You’re missing out on these hidden settings and menu screens

Own a Roku TV? You’re missing out on these hidden settings and menu screens

2025-12-27
Gemini Robotics 1.5 brings AI agents into the physical world

Gemini Robotics 1.5 brings AI agents into the physical world

2025-12-27
Footer logo

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow Us

Browse by Category

  • AI & Cloud Computing
  • AI & Cybersecurity
  • AI & Sentiment Analysis
  • AI Applications
  • AI Ethics
  • AI Future Predictions
  • AI in Education
  • AI in Fintech
  • AI in Gaming
  • AI in Healthcare
  • AI in Startups
  • AI Innovations
  • AI News
  • AI Research
  • AI Tools & Automation
  • Apps
  • AR/VR & AI
  • Business
  • Deep Learning
  • Emerging Technologies
  • Entertainment
  • Fashion
  • Food
  • Gadget
  • Gaming
  • Health
  • Lifestyle
  • Machine Learning
  • Mobile
  • Movie
  • Music
  • News
  • Politics
  • Review
  • Robotics & Smart Systems
  • Science
  • Sports
  • Startup
  • Tech
  • Travel
  • World

Recent News

8 great games for your Steam Deck from 2025

8 great games for your Steam Deck from 2025

2025-12-27
The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

The Number of Robotaxis Tesla Is Actually Running Will Make You Snort Out of Your Nose With Pure Derision

2025-12-27
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result

© 2025 JNews - Premium WordPress news & magazine theme by Jegtheme.