amazon textract case study

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Research Article
Open access
Published: 22 November 2021
Volume 5 , pages 861–882, ( 2022 )

Cite this article

You have full access to this open access article

Thomas Hegghammer ORCID: orcid.org/0000-0001-6253-1518 1

24k Accesses

28 Citations

28 Altmetric

Explore all metrics

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n = 322) and Arabic-language article scans ( n = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

Avoid common mistakes on your manuscript.

Introduction

Few technologies hold as much promise for the social sciences and humanities as optical character recognition (OCR). Automated text extraction from digital images can open up large quantities of understudied historical documents to computational analysis, potentially generating deep new insights into the human past.

But OCR is a technology still in the making, and available software provides varying levels of accuracy. The best results are usually obtained with a tailored solution involving corpus-specific pre-processing, model training, or postprocessing, but such procedures can be labour-intensive. Footnote 1 Pre-trained, general OCR processors have a much higher potential for wide adoption in the scholarly community, and hence their out-of-the box performance is of scientific interest.

For long, general OCR processors such as Tesseract ([ 27 , 38 ]) only delivered perfect results under what we may call laboratory conditions, i.e., on noise-free, single-column text in a clear printed font. This limited their utility for real-life historical documents, which often contain shading, blur, shine-through, stains, skewness, complex layouts, and other things that produce OCR error. Historically, general OCR processors have also struggled with non-Western languages ([ 16 ]), rendering them less useful for the many scholars working on documents in such languages.

In the past decade, advances in machine learning have led to substantial improvements in standalone OCR processor performance. Moreover, the past 2 years have seen the arrival of server-based processors such as Amazon Textract and Google Document AI, which offer document processing via an application processing interface (API) ([ 43 ]). Media and blog coverage indicate that these processors deliver strong out-of-the-box performance Footnote 2 , but those tests usually involve a small number of documents. Academic benchmarking studies exist ([ 37 , 41 ]) but the predate the server-based processors.

To find out, I conducted a benchmarking experiment comparing the performance of Tesseract, Textract, and Document AI on English and Arabic page scans. The objective was to generate statistically meaningful measurements of the accuracy of a selection of general OCR processors on document types commonly encountered in social scientific and humanities research.

The exercise yielded specifications for the relative performance of three leading OCR products as well as the differential effects of commonly found noise types. The findings can help scholars identify better OCR solutions for their research needs. The test materials, which have been preserved in the openly available “Noisy OCR Dataset” (NOD), can be used in future research.

The experiment involved taking two document collections of 322 English-language and 100 Arabic-language page scans, replicating them 43 times with different types of artificially generated noise, processing the full corpus of ~18,500 documents in each OCR engine, and measuring the accuracy against ground truth using the Information Science Research Institute (ISRI) tool.

I chose Tesseract, Textract, and Document AI on the basis of their wide use, reputation for accuracy, and availability for programmatic use. Budget constraints prevented the inclusion of additional reputable processors such as Adobe PDF Services and ABBYY Cloud OCR, but these can be tested in the future using the same procedure and test materials. Footnote 3

A full description of these processors is beyond the scope of this article, but Table 1 summarizes their main user-related features. Footnote 4 All the processors are primarily designed for programmatic use and can be accessed in multiple programming languages, including R and Python. The main difference is that Tesseract is open source and installed locally, whereas Textract and Document are paid services accessed remotely via a REST API.

For test data, I sought materials that would be reasonably representative of those commonly studied in the social sciences and humanities. This is to say historical documents containing extended text, as opposed to forms, receipts, and other business documents, which commercial OCR engines are primarily designed for, and which tend to get the most attention in media and blog reviews.

Since many scholars work on documents in languages other than English, I also wanted to include test materials in a non-Western language. Historically, these have been less well served by OCR engines, partly because their sometimes more ornate scripts are more difficult to process than Latin script, and partly because market incentives have led the software industry to prioritize the development of English-language OCR. I chose Arabic for three reasons: its size as a world language, its alphabetic structure (which allows accuracy measurement with the ISRI tool), and the complexity of its script. Arabic is known as one of the hardest alphabetic languages for computers to process ([ 14 , 23 ]), so including it alongside English will likely provide something close to the outer performance bounds of OCR engines on alphabetic scripts. I excluded logographic scripts such as Hanzi (Chinese) and Kanji (Japanese) partly due to the difficulty of generating comparable accuracy measures and partly due to my lack of familiarity with such languages.

The English test corpus consisted of the “Old Books Dataset” ([ 2 ]), a collection of 322 colour page scans from ten books printed between 1853 and 1920 (see Fig. 1 a and 1b and Table 2 ). The dataset comes as 300 DPI and 500 DPI TIFF image files accompanied by ground truth (drawn from the Project Gutenberg website) in TXT files. I used the 300 DPI files in the experiment.

Sample test documents in their original state

The Arabic test materials were drawn from the “Yarmouk Arabic OCR Dataset” ([ 8 ]), a collection of 4587 Wikipedia articles printed out to paper and colour scanned to PDF (see Fig. 1 c,d). The dataset contains ground truth in HTML and TXT files. Due to the homogeneity of the collection, a randomly selected subset of 100 pages was deemed sufficient for the experiment.

The Yarmouk dataset is suboptimal because it does not come from historical printed documents, but it is one of very few Arabic language datasets of some size with accompanying ground truth data. The English and Arabic test materials are thus not directly analogous, and in principle the latter poses a lighter OCR challenge than the former. Another limitation of the experiment is that the test materials only includes single-column text due to the complexities involved in measuring layout parsing accuracy.

Noise application

Social scientists and historians often deal with digitized historical documents that contain visual noise ([ 18 , 47 ]). In practice, virtually any document that existed first on paper and were later digitized—which is to say almost all documents produced before around 1990 and many thereafter—is going to contain some kind of noise. Sometimes it is the original copy that is degraded; at other times the document passed through a poor photocopier, an old microfilm, or a blurry lens before reaching us. The type and degree of noise will vary across collections and individual documents, but most scholars who use archival material will encounter this problem at least occasionally.

A key objective of the experiment was, therefore, to gauge the effect of different types of visual noise on OCR performance. To achieve this, I programmatically applied different types of artificial noise to the test materials, so as to allow isolation of noise effects at the measurement stage. Specifically, the two dataset were duplicated 43 times, each with a different type of noise filter. The R code used for noise generation is included in the Appendix. Footnote 5

I began by creating a binary version of each image, so that there were two versions—colour and greyscale—with no added noise (see Fig. 2 a and b). I then wrote functions to generate six ideal types of image noise: “blur,” “weak ink,” “salt and pepper,” “watermark,” “scribbles,” and “ink stains” (see Fig. 2 c-h). While not an exhaustive list of possible noise types, they represent several of the most common ones found in historical document scans. Footnote 6 I applied each of the six filters to both the colour version and the binary version of the images, thus creating 12 additional versions of each image. Lastly I applied all available combinations of two noise filters to the colour and binary images, for an additional 30 versions.

This generated a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English test corpus of 14,168 documents and an Arabic test corpus of 4400 documents. The dataset is preserved as the “Noisy OCR Dataset” ([ 12 ]).

Sample test document (“Old Books j020”) with noise applied

The experiment aimed at measuring out-of-the-box performance, so documents were submitted without further preprocessing using the OCR engines’ default settings. Footnote 7 While this is an uncommon use of Tesseract, it treats the engines equally and helps highlight the degree to which Tesseract is dependent on image preprocessing.

The English corpus was submitted to all three OCR engines in a total of 42,504 document processing requests. The Arabic corpus was only submitted to Tesseract and Document AI—since Textract does not support Arabic—for a total of 8800 processing requests.

The Tesseract processing was done in R with the package tesseract (v4.1.1). For Textract, it was carried out via the R package paws (v0.1.11), which provides a wrapper for the Amazon Web Services API. For Document AI, I used the R package daiR (v0.8.0) to access the Document AI API v1 endpoint. The processing was done in April and May of 2021 and took an estimated net total of 150–200 h to complete. The Document AI and Textract APIs processed documents at a rate of approximately 10–15 s per page. Tesseract took 17 s per page for Arabic and 2 seconds per page for English on a Linux Desktop with a 12-core, 4.3 Ghz CPU and 64GB RAM.

Measurement

Accuracy was measured with the ISRI tool ([ 30 ]) in Eddie Antonio Santos’s (2019) updated version—known as Ocreval—which has UTF-8 support. ISRI is a simple but robust tool that has been used for OCR assessment since its creation in the mid-1990s. Alternatives exist ([ 1 , 5 , 46 ]), but ISRI was deemed sufficient for this exercise.

ISRI compares two versions of a text—in this case OCR output to ground truth—and returns a range of measures for divergence, notably a document’s overall character accuracy and word accuracy expressed in percent. Character accuracy is the proportion of characters in a hypothesis text that match the reference text. Any misread, misplaced, absent, or excess character is considered an error and subtracted from the numerator. This represents the so-called Levenshtein distance ([ 20 ]), i.e., the minimum number of edit operations needed to correct the hypothesis text. Word accuracy is the proportion of non-stopwords in a hypothesis text that match those of the reference text. Footnote 8

Character and word accuracy are usually highly correlated, but the former punishes error harder, since each wrong character detracts from the accuracy rate. Footnote 9 In word accuracy, by contrast, a misspelled word counts as one error regardless of the number of wrong characters that contribute to the error. Moreover, in ISRI’s implementation of word accuracy, case errors and excess words are ignored. Footnote 10

Figure 3 provides some examples of what character and word error rates may correspond to in an actual text. I will return later to the question of how error matters for analysis.

Examples of word error effects

Which of the two measures is better depends on the type of document and the purpose of the analysis. For shorter texts where details matter—such as forms and business documents—character accuracy is considered the more relevant measure. For longer texts to be used for searches or text mining, word accuracy is commonly used as the principal metric. In the following, I, therefore, report word accuracy rates, transformed to word error rates by subtracting them from 100. Character accuracy rates are available in the Appendix.

The main results are shown in Fig. 4 and reveal clear patterns. Document AI had consistently lower error rates, with Textract coming in a close second, and Tesseract last. More noise yielded higher error rates in all engines, but Tesseract was significantly more sensitive to noise than the two others. Overall, there was a significant performance gap between the server-based processors (Document AI and Textract) on one side and the local installation (Tesseract) on the other. Only on noise-free documents in English could Tesseract compete.

We also see a marked performance difference across languages. Both Document AI and Tesseract delivered substantially lower accuracy for Arabic than they did for English. This was despite the Arabic corpus consisting of Internet articles in a single, very common font, while the English corpus contained old book scans in several different fonts. An analogous Arabic corpus would likely have produced an even larger performance gap. This said, Document AI represents a significant improvement on Tesseract as far as out-of-the-box Arabic OCR is concerned.

Disaggregating the data by noise type shows a more detailed picture (see Figs. 5 and 6 ). Beyond the patterns already described, we see, for example, that both Textract and Tesseract performed somewhat better on greyscale versions of the test images than on the colour version. We also note that all engines struggled with blur, while Tesseract was much more sensitive to salt & pepper noise than the two other engines. Incidentally, it is not surprising that the ink stain filter yielded lower accuracy throughout since it completely concealed part of the text. The reason we see a bimodal distribution in the bin + blur” filters on the English corpus is that they yielded many zero values, probably as a result of the image crossing a threshold of illegibility. The same did not happen in the Arabic corpus, probably because the source images there had crisper characters at the outset.

Word error rates by engine and noise level for English and Arabic documents

Word error rates by engine and noise type for English-language documents

Word error rates by engine and noise type for Arabic-language documents

Implications

When is it worth paying for better OCR accuracy? The answer depends on a range of situational factors, such as the state of the corpus, the utility function of the researcher, and the intended use case.

Much hinges on the corpus itself. As we have seen, accuracy gains increase with noise and are higher for certain types of noise. Moreover, if the corpus contains many different types of noise, a better general processor will save the researcher relatively more preprocessing time. Unfortunately we lack good tools for (ground truth-free) noise diagnostics, but there are ways to obtain some information about the noise state of the corpus ([ 10 , 21 , 28 ]). Finally, the size of the dataset matters, since processing costs scale with the number of documents while accuracy gains do not.

The calculus also depends on the economic situation of the researcher. Aside from absolute size of one’s budget, a key consideration is labour cost, since cloud-based processing is in some sense a substitute for Tesseract processing with additional labour input. The latter option will thus make more sense for a student than for a professor and more sense for the faster programmer.

Last but not least is the intended use of the OCRed text. If the aim is to recreate a perfect plaintext copy of the original document for, say, a browseable digital archive, then every percentage point matters. But if the purpose is to build a topic model or conduct a sentiment analysis, it is not obvious that a cleaner text will always yield better end results. The downstream effects of OCR error is a complex topic that cannot be explored in full here, but we can get some pointers by looking at the available literature and doing some tests of our own.

Existing research suggests that the effects of OCR error vary by analytical toolset. Broadly speaking, topic models have proved relatively robust to OCR inaccuracy ([ 6 , 9 , 26 , 36 ]), with [ 40 ] suggesting a baseline for acceptable OCR accuracy as low as 80 percent. Classification models have been somewhat more error-sensitive, although the results here have been mixed ([ 6 , 25 , 34 , 40 ]). The biggest problems seem to arise in natural language processing (NLP) tasks where details matter, such as part-of-speech tagging and named entity recognition ([ 11 , 22 , 24 , 40 ]).

To illustrate some of these dynamics and add to the empirical knowledge of OCR error effects, we can run some simple tests on the English-language materials from our benchmarking exercise. The Old Books dataset is small, but similar in kind to the types of text collections studied by historians and social scientists, and hence a reasonably representative test corpus. In the following, I look at OCR error in four analytical settings: sentiment analysis, classification, topic modelling, and named entity recognition. I exploit the fact that the benchmarking exercise yielded 132 different variants (3 engines and 44 noise types) of the Old Book corpus, each with a somewhat different amount of OCR error. Footnote 11 By running the same analyses on all text variants, we should get a sense of how OCR error can affect substantive findings. This said, the exercise as a whole is a back-of-the-envelope test insofar as it covers only a small subset of available text mining methods and does not implement any of them as fully as one would in a real-life setting.

Sentiment analysis

Faced with a corpus like Old Books (see Table 2 ), a researcher might want to explore text sentiment, for example to examine differences between authors or over time. Using the R package quanteda s LSD 2015 and ANEW dictionaries, I generated document-level sentiment polarity and valence scores for all variants of the corpus after standard preprocessing. To assess the effect of OCR error, I calculated the absolute difference between these scores and those of the ground truth version of the corpus. Figure 7 a–d indicate that these differences increase only slightly with OCR error, but also that, for sentiment polarity, the variance is such that just a few percent OCR error can produce sentiment scores that diverge from ground truth scores by up to two whole points at the document level.

OCR error and sentiment analysis accuracy

Text classification

Another common analytical task is text classification. Imagine that we knew which works were represented in the Old Books corpus, but not which work each document belonged to. We could then handcode a subset and train an algorithm to classify the rest. Since we happen to have pre-coded metadata we can easily simulate this exercise. I trained two multiclass classifiers—Random Forest and Support-Vector Machine—to retrieve the book from which a document was drawn. To avoid imbalance, I removed the smallest subset (“Engraving of Lions, Tigers, Panthers, Leopards, Dogs,&C.”) and was left with 9 classes and 314 documents. For each variant of the corpus I preprocessed the texts, split them 70/30 for training and testing, and fit the models using the tidymodels R package. Figure s 8 a, b shows the results. We see that OCR error has only a small negative effect on classifier accuracy up to a threshold of around 20% OCR error, after which accuracy plummets.

OCR error and multiclass classifier accuracy

Topic modelling

Assessing the effect of OCR error on topic models is more complicated, since they involve more judgment calls and do not yield an obvious indicator of accuracy. I used the stm R package to fit structural topic models to all the versions of the corpus. As a first step, I ran the stm::searchK() function for a k value range from 6 to 20, on the suspicion that different variants of the text might yield different diagnostics and hence inspire different choices for the number of topics in the model. Figure 9 a shows that the k intercept for the high point of the held-out likelihood curve varies from 6 to 12 depending on the version of the corpus. Held-out likelihood is not the only criterion for selecting k , but it is an important one, so these results suggests that even a small amount of OCR error can lead researchers to choose a different topic number than they would have done on a cleaner text, with concomitant effects on the substantive analysis. Moreover, if we hold k still at 8—the value suggested by diagnostics of the ground truth version of the corpus—we see in Fig. 9 b that the semantic coherence of the model decreases slightly with more noise.

OCR error and topic model fits

Named entity recognition

Our corpus is full of names and dates, so a researcher might also want to explore it with named entity recognition (NER) models. I used a pretrained spaCy model ( en_core_web_sm ) to extract entities from all non-preprocessed versions of the corpus and compared the output to that of the ground truth text. In the absence of ground truth NER label data, I treated spaCy ’s prediction for the ground truth text as the reference point and calculated the F1 score (the harmonic average of precision and recall) as a metric for accuracy. For simplicity, the evaluation included only predicted entity names, not entity labels. Figure 10 shows that OCR error affected NER accuracy severely. In a real-life setting these effects would be partly mitigated by pre- and postprocessing, but it seems reasonable to suggest that NER is one of the areas where the value added from high-precision OCR is the highest.

OCR error and named entity recognition accuracy

Broadly speaking, these tests indicate that OCR error mattered the most in NER, the least in topic modelling and sentiment analysis, while in classification there was a tipping point at around 20 percent OCR error. At the same time, all the tests showed some accuracy deterioration even at very low OCR error rates.

This article described a systematic test of three general OCR processors on a large new dataset of English and Arabic documents. It suggests that the server-based engines Document AI and Textract deliver markedly higher out-of-the-box accuracy than the standalone Tesseract library, especially on noisy documents. It also indicates that certain types of “integrated” noise, such as blur and salt and pepper, generate more error than “superimposed” noise such as watermarks, scribbles, and even ink stains. Furthermore, it suggests that the “OCR language gap” still persists, although Document AI seems to have partially closed it, at least for Arabic.

The key takeaway for the social sciences and humanities is that high-accuracy OCR is now more accessible than ever before. Researchers who might be deterred by the prospect of extensive document preprocessing or corpus-specific model training now have at their disposal user-friendly tools that deliver strong results out of the box. This will likely lead to more scholars adopting OCR technology and to more historical documents becoming digitized.

The findings can also help scholars tailor OCR solutions to their needs. For many users and use cases, server-based OCR processing will be an efficient option. However, there are are downsides to consider, such as processing fees and data privacy concerns, which means that in some cases, other solutions—such as self-trained Tesseract models or even plain Tesseract—might be preferable. Footnote 12 Having baseline data on relative processor performance and differential effects of noise types can help navigate such tradeoffs and optimise one’s workflow.

The study has several limitations, notably that it tested only three processors on two languages with a non-exhaustive list of noise types. This means we cannot say which processor is the very best on the market or provide a comprehensive guide to OCR performance on all languages and noise types. However, the test design used here can easily be applied to other processors, languages, and noise types for a more complete picture. Another limitation is that the experiment only used single-column test materials, which does not capture layout parsing capabilities. Most OCR engines, including Document AI and Textract, still struggle with multi-column text, and even state-of-the-art tools such as Layout Parser ([ 32 ]) require corpus-specific training for accurate results. Future studies will need to determine which processors deliver the best out-of-the-box layout parsing. In any case, we appear to be in the middle of a small revolution in OCR technology with potentially large benefits for the social sciences and humanities.

For pre-processing see, e.g, [ 3 , 7 , 13 , 19 , 42 ], and [ 44 ]. For model training, see, e.g., [ 4 , 29 , 33 ], and [ 45 ]. For postprocessing, see, e.g., [ 17 , 35 ], and [ 39 ].

See, for example, Ted Han and Amanda Hickman, “Our Search for the Best OCR Tool, and What We Found,” OpenNews , February 19, 2019 ( https://source.opennews.org/articles/so-many-ocr-options/ ); Fabian Gringel, “Comparison of OCR tools: how to choose the best tool for your project,” Medium.com , January 20, 2020 ( https://medium.com/dida-machine-learning/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project-bd21fb9dce6b ); Manoj Kukreja, “Compare Amazon Textract with Tesseract OCR—OCR & NLP Use Case,” TowardDataScience.com , September 17, 2020 ( https://towardsdatascience.com/compare-amazon-textract-with-tesseract-ocr-ocr-nlp-use-case-43ad7cd48748 ); Cem Dilmegani, “Best OCR by Text Extraction Accuracy in 2021,” AIMultiple.com , June 6, 2021 ( https://research.aimultiple.com/ocr-accuracy/ ).

As of September 2021, Adobe PDF Services charges a flat rate of $50 per 1000 pages ( https://www.adobe.io/apis/documentcloud/dcsdk/pdf-pricing.html , accessed 3 September 2021). ABBYY Cloud costs between $28 and $60 per 1000 pages depending one’s monthly plan and the total number of documents (see https://www.abbyy.com/cloud-ocr-sdk/licensing-and-pricing/ , accessed 3 September 2021). By contrast, processing in Amazon Textract and Google Document AI costs $1.50 per 1,000 pages.

For documentation, see the product websites: https://github.com/tesseract-ocr/tesseract , https://aws.amazon.com/textract/ , and https://cloud.google.com/document-ai .

There are other ways of generating synthetic noise, notably the powerful tool DocCreator ([ 15 ]). I chose not to use DocCreator primarily because it is graphical user interface-based, and I found I could generate realistic noise more efficiently with R code.

It would be possible to extend the list of noise types further, to include 10–20 different types, but this would increase the size of the corpus (and thus the processing costs) considerably, probably without affecting the broad result patterns. Since the main aim here is not to map all noise types but to compare processors, I decided on a manageable subset of noise types.

The only exception was the setting of the relevant language libraries in Tesseract.

ISRI only has an English-language stopword list (of 110 words), so in the measurements for Arabic, stopwords are included in the assessment. All else equal, this should produce slightly higher accuracy rates for Arabic, since oft-recurring words are easier for OCR engines to recognize.

ISRI’s character accuracy rates can actually be negative as a result of excess text. OCR engines sometimes introduce garbled text when they see images or blank areas with noise, resulting in output texts that are much longer than ground truth. Since excess characters are treated as errors and subtracted from the numerator, they can result in negative accuracy rates. In the corpus studied here, this phenomenon affected 4.6 percent of the character accuracy measurements, and it occurred almost exclusively in texts processed by Tesseract.

This also means that ISRI’s word accuracy tool does not yield negative rates. As Eddie Antonio Santos explains, “The wordacc algorithm creates parallel arrays of words and checks only for words present in the ground truth. It finds ‘paths’ from the generated file that correspond to ground truth. For this reason, it only detects as many words as there are in ground truth”; private email correspondence, 1 September 2021. However, the word accuracy tool returns NA when the hypothesis text has no recognizable words. This occurred in 9.4 percent of the measurements in this experiment, again almost exclusively in Tesseract output. These NAs are treated as zeroes in Figs. 4 , 5 , 6

In all of the below, “OCR error” refers to word error rates computed with the ISRI tool.

Amazon openly says it “may store and use document and image inputs [...] to improve and develop the quality of Amazon Textract and other Amazon machine-learning/artificial-intelligence technologies” (see https://aws.amazon.com/textract/faqs/ , accessed 3 September 2021). Google says it “does not use any of your content [...] for any purpose except to provide you with the Document AI API service” (see https://cloud.google.com/document-ai/docs/data-usage , accessed 3 September 2021), but it is unclear what lies in the word “provide” and whether it includes the training of the processor.

Alghamdi, Mansoor A., Alkhazi, Ibrahim S., & Teahan, William J. (2016). “Arabic OCR Evaluation Tool.” In 2016 7th International Conference on Computer Science and Information Technology (CSIT) , 1–6. IEEE.

Barcha, Pedro. (2017). Old Books Dataset . GitHub: GitHub Repository. https://github.com/PedroBarcha/old-books-dataset .

Google Scholar

Bieniecki, W., Grabowski, S., & Rozenberg, W. (2007). “Image Preprocessing for Improving Ocr Accuracy.” In 2007 International Conference on Perspective Technologies and Methods in MEMS Design , 75–80. IEEE.

Boiangiu, C.-A., Ioanitescu, Radu, & Dragomir, Razvan-Costin. (2016). Voting-Based OCR System. The Proceedings of Journal ISOM, 10 , 470–86.

Carrasco, R. C. (2014). “An Open-Source OCR Evaluation Tool.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage , 179–84.

Colavizza, G. (2021). Is your OCR good enough? Probably so. Results from an assessment of the impact of OCR quality on downstream tasks. KB Lab Blog . https://lab.kb.nl/about-us/blog/your-ocr-good-enough-probably-so-results-assessment-impact-ocr-quality-downstream .

Dengel, A., Hoch, R., Hönes, F., Jäger, T., Malburg, M., Weigel, A. (1997) “Techniques for Improving OCR Results.” In Handbook of Character Recognition and Document Image Analysis , 227–58. World Scientific.

Doush, I. Abu, A., Faisal, & Gharibeh, A. H. (2018). “Yarmouk Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science and Information Technology (CSIT) , 150–54. IEEE.

Grant, P., Sebastian, R., Allassonnière-Tang, M., & Cosemans, S. (2021). Topic modelling on archive documents from the 1970s: global policies on refugees. Digital Scholarship in the Humanities, March. https://doi.org/10.1093/llc/fqab018

Gupta, A., Gutierrez-Osuna, R., Christy, M., Capitanu, B., Auvil, L., Grumbach, L., Furuta, R., & Mandell, L. (2015). “Automatic Assessment of OCR Quality in Historical Documents.” In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 29. 1.

Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., & Doucet, A. (2019). “An Analysis of the Performance of Named Entity Recognition over OCRed Documents.” In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , 333–34. IEEE. https://ieeexplore.ieee.org/document/8791217 .

Hegghammer, T. (2021). Noisy OCR Dataset. Repository details TBC.

Holley, R. (2009). How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs. D-Lib Magazine, 15 (3/4)

Jain, M., Mathew, M., & Jawahar, C. V. (2017). Unconstrained Scene Text and Video Text Recognition for Arabic Script. arXiv:1711.02396 .

Journet, Nicholas, Visani, Muriel, Mansencal, Boris, Van-Cuong, Kieu, & Billy, Antoine. (2017). Doccreator: a new software for creating synthetic ground-truthed document images. Journal of Imaging, 3 (4), 62.

Article Google Scholar

Kanungo, T., Marton, G. A., & Bulbul, O. (1999). Performance Evaluation of Two Arabic OCR Products. In 27th AIPR Workshop: Advances in Computer-Assisted Recognition , 3584:76–83. International Society for Optics; Photonics.

Kissos, I., & Dershowitz, N. (2016). “OCR Error Correction Using Character Correction and Feature-Based Word Classification.” In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) , 198–203. IEEE.

Krishnan, R., & Babu, D. R. R. (2012). A Language Independent Characterization of Document Image Noise in Historical Scripts. International Journal of Computer Applications, 50 (9), 11–18.

Lat, A., & Jawahar, C. V. (2018). “Enhancing Ocr Accuracy with Super Resolution.” In 2018 24th International Conference on Pattern Recognition (ICPR) , 3162–67. IEEE.

Levenshtein, V. I, and others. (1966). Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady , 10:707–10. 8. Soviet Union.

Lins, R. D., Banergee, S., & Thielo, M. (2010). “Automatically Detecting and Classifying Noises in Document Images.” In Proceedings of the 2010 ACM Symposium on Applied Computing , 33–39.

Lopresti, D. (2009). Optical Character Recognition Errors and Their Effects on Natural Language Processing. International Journal on Document Analysis and Recognition (IJDAR) 12 (3): 141–51. http://www.cse.lehigh.edu/~lopresti/tmp/AND08journal.pdf .

Mariner, M. C. (2017). Optical Character Recognition (OCR). Encyclopedia of Computer Science and Technology (pp. 622–29). CRC Press.

Miller, D., Boisen, S., Schwartz, R., Stone, R., & Weischedel, R. (2000). “Named Entity Extraction from Noisy Input: Speech and OCR.” In Sixth Applied Natural Language Processing Conference , 316–24. https://aclanthology.org/A00-1044.pdf .

Murata, M., Busagala, L. S. P., Ohyama, W., Wakabayashi, T., & Kimura, F. (2006). The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification. In Document Analysis Systems VII , edited by Horst Bunke and A. Lawrence Spitz, 506–17. Berlin, Heidelberg: Springer Berlin Heidelberg. https://link.springer.com/chapter/10.1007/11669487_45 .

Mutuvi, S., Doucet, A., Odeo, M., & Jatowt, A. (2018). “Evaluating the Impact of OCR Errors on Topic Modeling.” In International Conference on Asian Digital Libraries , 3–14. Springer.

Patel, C., Patel, A., & Patel, D. (2012). Optical character recognition by open source OCR tool Tesseract: a case study. International Journal of Computer Applications, 55 (10), 50–56.

Reffle, U., & Ringlstetter, C. (2013). Unsupervised Profiling of OCRed Historical Documents. Pattern Recognition, 46 (5), 1346–57.

Reul, C., Springmann, U., Wick, C., & Puppe, F. (2018). “Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting.” In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS) , 423–28. IEEE.

Rice, S. V, & Nartker, T. A. (1996). The ISRI Analytic Tools for OCR Evaluation. UNLV/Information Science Research Institute, TR-96 2.

Santos, E. A. (2019). “OCR Evaluation Tools for the 21st Century.” In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers) , 23–27. Honolulu: Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-6004 .

Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (Eds.). (2021). LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. arXiv Preprint arXiv:2103.15348 .

Springmann, U., Najock, D., Morgenroth, H., Schmid, H., Gotscharek, A., & Fink, F. (2014). “OCR of Historical Printings of Latin Texts: Problems, Prospects, Progress.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage , 71–75.

Stein, S. S., Argamon, S., & Frieder, O. (2006). “The Effect of OCR Errors on Stylistic Text Classification.” In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 701–2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.67.6791&rep=rep1&type=pdf .

Strohmaier, C. M, Ringlstetter, C., Schulz, K. U., & Mihov, S. (2003). Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary? In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. , 3:1133–33. Citeseer.

Su, J., Boydell, O., Greene, D., & Lynch, G. (2015). Topic Stability over Noisy Sources. arXiv Preprint arXiv:1508.01067 . https://noisy-text.github.io/2016/pdf/WNUT09.pdf.

Tafti, A. P., Baghaie, A., Assefi, M., Arabnia, H. R., Zeyun, Y., & Peissig, P. (2016). “OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym.” In International Symposium on Visual Computing , 735–46. Springer.

tesseract-ocr. (2019). Tesseract OCR 4.1.1. GitHub Repository. GitHub. https://github.com/tesseract-ocr/tesseract .

Thompson, P., McNaught, J., & Ananiadou, S. (2015). “Customised OCR Correction for Historical Medical Text.” In 2015 Digital Heritage , 1:35–42. IEEE.

van Strien, D., Beelen, K., Ardanuy, M., Hosseini, Kasra, McGillivray, B., & Colavizza, G. (2020). “Assessing the Impact of OCR Quality on Downstream NLP Tasks.” INSTICC; SciTePress. https://doi.org/10.5220/0009169004840496 .

Vijayarani, S., & Sakila, A. (2015). Performance Comparison of OCR Tools. International Journal of UbiComp (IJU), 6 (3), 19–30.

Volk, Martin, Furrer, Lenz, & Sennrich, Rico. (2011). Strategies for Reducing and Correcting OCR Errors. Language Technology for Cultural Heritage (pp. 3–22). Springer.

Walker, J., Fujii, Y., & Popat, A. C. (2018). “A Web-Based Ocr Service for Documents.” In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria . Vol. 1.

Wemhoener, D., Yalniz, I. Z., & Manmatha, R. (2013). “Creating an Improved Version Using Noisy OCR from Multiple Editions.” In 2013 12th International Conference on Document Analysis and Recognition , 160–64. IEEE.

Wick, C., Reul, C., & Puppe, F. (2018). Comparison of OCR Accuracy on Early Printed Books Using the Open Source Engines Calamari and OCRopus. J. Lang. Technol. Comput. Linguistics, 33 (1), 79–96.

Yalniz, I. Z., & Manmatha, R. (2011). “A Fast Alignment Scheme for Automatic OCR Evaluation of Books.” In 2011 International Conference on Document Analysis and Recognition , 754–58. https://doi.org/10.1109/ICDAR.2011.157 .

Ye, Peng, & Doermann, David. (2013). “Document Image Quality Assessment: A Brief Survey.” In 2013 12th International Conference on Document Analysis and Recognition , 723–27. IEEE.

Download references

Author information

Authors and affiliations.

Norwegian Defence Research Establishment (FFI), Kjeller, Norway

Thomas Hegghammer

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Hegghammer .

Ethics declarations

Conflict of interest.

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

I am grateful to the three anonymous reviewers and to Neil Ketchley for valuable comments. I also thank participants in the University of Oslo Political Data Science seminar on 17 June 2021 for inputs and suggestions, as well as Eddie Antonio Santos for helping solve technical questions related to the ISRI/Ocreval tool. Supplementary information and replication materials are available at https://github.com/Hegghammer/noisy-ocr-benchmark .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Hegghammer, T. OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment. J Comput Soc Sc 5 , 861–882 (2022). https://doi.org/10.1007/s42001-021-00149-1

Download citation

Received : 23 June 2021

Accepted : 06 October 2021

Published : 22 November 2021

Issue Date : May 2022

DOI : https://doi.org/10.1007/s42001-021-00149-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Cloud computing
Benchmarking
Find a journal
Publish with us
Track your research

Case Study: PDF Insights with AWS Textract and OpenAI integration

Learn how we seeked a way to efficiently extract key information from PDF pitchdecks, including dealing with the challenges posed by graphical elements and unstructured content.

A QUICK SUMMARY – FOR THE BUSY ONES

PDF Insights with AWS Textract and OpenAI integration

Project context.

The company approached us with the issue of a large quantity of data to sift through in the form of pitchdecks. We were faced with the task of automating the extraction of the most important information from unstructured hard-to-parse format - PDF.

Getting the text contents of the PDF was just the beginning. The text in PDF is all over the place: we had slides with two or three words, some tables, lists, or just paragraphs squished between images.

Reliable text extraction

We’ve used AWS Textract to parse PDF files. This way we don’t rely on the internal structure of the PDF to get text from it
To parse the text and pull what we want from it, we went with the OpenAI GPT-3.5 and GPT-4 models.
What we missed and what was probably the most difficult is the ability to interpret the images and spatial relationships in PDF slides.

Read the whole article to learn more about our findings.

TABLE OF CONTENTS

Original problem - automated PDF summarization

The company approached us with the issue of a large quantity of data to sift through in the form of pitchdecks. While each pitchdeck is generally fairly short, in most cases around 10 slides each, the issue is the number of them to analyze. We were faced with the task of automating the extraction of the most important information from unstructured hard-to-parse format - PDF. Additionally, the data is in the form of slides: with a lot of graphical cues and geometric relations between words that convey information not easily inferred from the text itself. To make it easier to analyze a large amount of data, we would need a solution that would automate as much of that process as possible: from reading the document itself, to finding interesting pieces of information like names of people involved, financial data, and so on.

Why is text extraction so hard?

The first issue we faced was getting the text contents from a PDF file. While extracting text directly from PDF, using open source tools like pdf-parse (which is used internally by langchain’s pdf-loader) did the job most of the time, we still had some issues with it: some PDFs were not parsed correctly and the tool returned empty string (like in the case of Uber sample pitchdeck ), we’ve just got some words split into individual characters and so on.

Unfortunately, getting the text contents of the PDF was just the beginning. The text in PDF is all over the place: we had slides with two or three words, some tables, lists, or just paragraphs squished between images. Below is the example of text extracted from page 2 of the example reproduction of AirBnB early pitchdeck ( link , extraction done with pdf-parse library):

And this is one of the better ones!

While parsing text like this is hard in itself, we also would like to be able to modify what extract from the text. We may want to know what people are involved in a business. Or do we just want to get all financial data, or maybe just the name of the industry? Each type of data extracted requires a different approach to parsing and validating text, and then a lot of testing.

How can it be solved?

First, we’ve decided to leave open-source solutions behind. We’ve used AWS Textract to parse PDF files. This way we don’t rely on the internal structure of the PDF to get text from it (or to get nothing - like in the case of the Uber example). Textract uses OCR and machine learning to get not only text but also spatial information from the document.

Here is the Textract result (with all geometric information stripped) from the same page of the AirBnB pitchdeck reproduction

But that’s not all! Textract responds with a list of Blocks (like “Page”, or “Line” for a line of text), together with their position and relationships which we can use to understand the structure of the document better

Most of the time, we don’t need such details, so in our case, we use only a fraction of them.

Summarisation process and AI

Now to actually parse the text and pull what we want from it. For that, the only solution that seemed viable was to use a language model. While we tested some open-source solutions, they were not up to the task. Hallucinations were too common, and responses too unpredictable. Additionally, most capable Open Source models available today are not licensed to be used commercially. So we went with the OpenAI GPT-3.5 and GPT-4 models.

We’ve decided to first let the model summarise the text and include all information from the pitchdeck in that summary. That way we have text that is complete (not just the outline) and has a structure that is easier to work with. We’ve used the following prompt for each page of the document:

With additional instructions like “avoid adding your own opinions or assumptions” we minimize the hallucinations (models like to add fake data to the summary. GPT-3 even added a completely fake financial analysis!). When we have a summary of all pages we can ask the model to extract information from it. Here is an example of the prompt we’ve used to get the list of people referenced in the document:

The summarisation returned by the models (both GPT3 and 4) is of good quality: the information returned is factual and whatever is plainly stated in the document will end up in the summary as well.

However, the extracting of the list of people is a different story. Models, especially GPT-3, often answer with a list similar to this (not an actual response):

Not only this is clearly not a correct list of people, but also, the email was not in the source text at all, the model made it up!

We’ve also experimented with many variations of that prompt like:

Adding information that this is text extracted from PDF doesn’t seem to make any difference - models treat the input text the same way. When looking at the data there really isn’t any information for the model to infer anything from. We would need to include actual geometry data.
Skipping the summarisation part, and asking the model to get information from the text extracted from the whole document directly. This didn’t have much effect either (although I’ve seen a little worse responses at least in one case, but it was very subtle) which would suggest that we don’t need that summarisation step, especially when we do that for each page so we make quite a lot of requests. We’ve decided to keep it however as we may need a summary anyway.
Providing GPT with text together with spatial information returned by Textract. While this seems like a way to allow the model to infer some visual cues it is hard to figure out the right format. The JSON that Textract returns is quite verbose and it’s often too long to pass to the model (even with unnecessary fields stripped). Splitting up a page into smaller chunks seems wrong as the page context is often important to understand a chunk. This still needs investigating and more experiments.
While trying to solve the issue with inaccurate or hallucinated answers we’ve tried feeding the model with its answer so that it can validate and fix it. Unfortunately, our tests with GPT-3 failed - it didn’t see any issues with it’s made-up emails and phone numbers on a list that was supposed to contain the names of people. We need more tests with this approach using GPT-4 model though.

What we miss and what is probably the most difficult is the ability to interpret the images and spatial relationships in PDF slides. While AWS Textract returns some spatial information it does not recognize images, and the data returned is hard to pass to the model. We’re still investigating how to make the model understand arrows, charts, and tables. Additionally, we would like to automate the process of online research e.g. find more information about companies mentioned in the documents using available APIs (like Crunchbase) or fetch more data on the people involved.

The case study addresses automating the extraction of vital details from numerous PDF pitchdecks. These decks are concise but numerous, making manual analysis impractical. The challenge involves extracting text and interpreting graphical elements. AWS Textract was employed for text extraction due to its OCR and layout understanding capabilities. OpenAI's GPT-3.5 and GPT-4 models were used to summarize and extract information, yet challenges arose in accurately extracting specific data like people's names or financial data. The study acknowledges the need to enhance image interpretation to understand visual elements better.

Frequently Asked Questions

Our promise

Every year, Brainhub helps 750,000+ founders, leaders and software engineers make smart tech decisions. We earn that trust by openly sharing our insights based on practical software engineering experience.

Full-stack software engineer with 9 years of professional experience. JavaScript & LLMs passionate.

Popular this month

Get smarter in engineering and leadership in less than 60 seconds.

Join 300+ founders and engineering leaders, and get a weekly newsletter that takes our CEO 5-6 hours to prepare.

previous article in this collection

It's the first one.

next article in this collection

It's the last one.

Contact Sales
Getting Started

Documentation

AWS Marketplace
Enterprises
Public Sector
Bahasa Indonesia
Î¡ÑÑÑÐºÐ¸Ð¹
ä¸æ (ç®ä½)
ä¸æ (ç¹é«)
AWS Management Console
Account Settings
Billing & Cost Management
Security Credentials
AWS Personal Health Dashboard

AWS & Cloud Computing

Training & resources, support & services, websites & web apps, backup, storage, & archive, big data & hpc, energy & utilities, financial services, game development, digital media, healthcare & life sciences, business apps, telecommunications, networking & content delivery, developer tools, aws cost management, management tools, media services, security, identity & compliance, ar & vr, machine learning, mobile services, application integration, customer engagement, business productivity, desktop & app streaming, internet of things, resource center, sdks & toolkits, additional software & services, aws general reference.

Amazon EC2 Auto Scaling
Amazon Elastic Container Service
Amazon Elastic Container Service for Kubernetes
Amazon Elastic Container Registry
Amazon Lightsail
AWS Elastic Beanstalk
AWS Fargate
AWS Serverless Application Repository
Elastic Load Balancing
VMware Cloud on AWS
Amazon Simple Storage Service (S3)
Amazon Elastic Block Store (EBS)
Amazon Elastic File System (EFS)
Amazon Glacier
AWS Storage Gateway
AWS Snowball
AWS Snowball Edge
AWS Snowmobile
Amazon Aurora
Amazon DynamoDB
Amazon ElastiCache
Amazon Redshift
Amazon Neptune
AWS Database Migration Service
AWS Migration Hub
AWS Application Discovery Service
AWS Server Migration Service
Amazon VPC PrivateLink
Amazon CloudFront
Amazon Route 53
Amazon API Gateway
AWS Direct Connect
AWS CodeStar
AWS CodeCommit
AWS CodeBuild
AWS CodeDeploy
AWS CodePipeline
AWS Tools & SDKs
Amazon CloudWatch
AWS Auto Scaling
AWS CloudFormation
AWS CloudTrail
AWS OpsWorks
AWS Service Catalog
AWS Systems Manager
AWS Trusted Advisor
AWS Command Line Interface
AWS Managed Services
Amazon Elastic Transcoder
Amazon Kinesis Video Streams
AWS Elemental MediaConvert
AWS Elemental MediaLive
AWS Elemental MediaPackage
AWS Elemental MediaStore
AWS Elemental MediaTailor
Amazon SageMaker
Amazon Comprehend
Amazon Polly
Amazon Rekognition
Amazon Machine Learning
Amazon Translate
Amazon Transcribe
AWS DeepLens
AWS Deep Learning AMIs
Apache MXNet on AWS
TensorFlow on AWS
Amazon Athena
Amazon CloudSearch
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Data Pipeline
AWS Identity and Access Management (IAM)
Amazon Cloud Directory
Amazon Cognito
Amazon GuardDuty
Amazon Inspector
Amazon Macie
AWS Certificate Manager
AWS CloudHSM
AWS Directory Service
AWS Firewall Manager
AWS Key Management Service
AWS Organizations
AWS Secrets Manager
AWS Single Sign-On
AWS Artifact
AWS Mobile Hub
Amazon Pinpoint
AWS AppSync
AWS Device Farm
AWS Mobile SDK
Amazon Sumerian
Amazon Simple Queue Service (SQS)
Amazon Simple Notification Service (SNS)
AWS Step Functions
Amazon Connect
Amazon Simple Email Service (SES)
Alexa for Business
Amazon Chime
Amazon WorkDocs
Amazon WorkMail
Amazon WorkSpaces
Amazon AppStream 2.0
AWS IoT Core
Amazon FreeRTOS
AWS Greengrass
AWS IoT 1-Click
AWS IoT Analytics
AWS IoT Button
AWS IoT Device Defender
AWS IoT Device Management
Amazon GameLift
Amazon Lumberyard
AWS Cost Explorer
AWS Budgets
Reserved Instance Reporting
AWS Cost and Usage Report

Introduction to Amazon Textract: Now in Preview

Broadcast Date: December 13, 2018

Amazon Textract enables you to easily extract text and data from virtually any document. Today, companies process millions of documents by manually entering the data or using simple optical character recognition software, which are prone to error and difficult to customize. Join this tech talk to learn how Amazon Textract uses machine learning to simplify document processing by enabling fast and accurate text and data extraction so you can process millions of documents in hours, with no machine learning experience required. We'll dive into how Textract's pre-trained machine learning models eliminate the need for companies to write code for data extraction and we'll also discuss different use cases.

Learning Objectives

Learn about the features and benefits of Amazon Textract
Learn about different use cases from media & entertainment to healthcare and more
Learn how to get started and sign up for the Amazon Textract preview

Who Should Attend?

Developers, Engineers, IT Professionals, Architects, Business Decisions Makers, Technical Decision Makers

Wendy Tse, Sr. Product Manager, AWS

To learn more about the services featured in this talk, please visit: https://aws.amazon.com/textract/

Intro body copy here about 2018 re:Invent launches.

Download the Slide Deck

Service how to.

December 19th, 2018 | 1:00 PM PT

Developing Deep Learning Models for Computer Vision with Amazon EC2 P3 Instances.

What's New / Cloud Innovation

December 11th, 2018 | 1:00 PM PT

Data Lakes & Analytics

Webinar 1: what's new / cloud innovation.

December 10th, 2018 | 11:00 AM PT

Webinar 2: What's New / Cloud Innovation

December 12th, 2018 | 11:00 AM PT

How Amazon Textract Transforms Document Processing

Amazon Textract is a cutting-edge service provided by Amazon Web Services (AWS) that leverages machine learning to extract text, handwriting, tables, and other data from scanned documents. This fully managed service is designed to transform the way businesses handle their documents, automating data extraction and significantly reducing the need for manual data entry.

How amazon textract works, understanding amazon textract.

Amazon Textract is a sophisticated service offered by Amazon Web Services that utilizes machine learning and Optical Character Recognition (OCR) to extract text, tables, and form data from scanned documents. This service is designed to transform traditional document processing by automating data extraction, thus eliminating the need for manual entry and enhancing the efficiency of business operations.

The Technology Behind Textract

At the heart of Amazon Textract is the integration of OCR technology with advanced machine learning algorithms. This combination enables Textract not only to identify text within documents accurately but also to understand its context and structure. For example, when processing an invoice, Textract can differentiate between the invoice number, date, and total amount by recognizing the layout and correlating elements within the document. This level of comprehension allows for the extraction of structured data from unstructured documents, a task that goes beyond the capabilities of traditional OCR solutions.

Core Features

Amazon Textract offers several key features that make it a valuable tool across various industries. Automated text and data extraction allows businesses to efficiently process documents of different types, such as forms, invoices, and identity documents, regardless of whether the text is printed or handwritten. Textract’s ability to understand document structures means it can accurately extract data from complex layouts, like tables and forms, maintaining the relationships between different data points.

An example of Textract’s application can be seen in the healthcare sector, where it facilitates the digitization of patient records by extracting information from clinical notes and insurance claims. This capability not only speeds up the processing of documents but also ensures that critical health information is accurately recorded and easily accessible.

Seamless Integration and Enhanced Security

Amazon Textract is fully integrated within the AWS ecosystem, allowing for seamless connections with other AWS services. This integration extends Textract’s functionality, enabling stored data to be further processed, analyzed, or used to trigger specific workflows. Security is also a top priority, with Textract incorporating robust measures to protect sensitive information throughout the extraction process, adhering to global security standards.

Practical Applications of Amazon Textract

Revolutionizing financial document processing.

Amazon Textract significantly enhances the efficiency of financial operations by automating the extraction of data from critical documents such as bank statements, invoices, and expense reports. This automation speeds up the reconciliation process and improves accuracy, reducing the risk of errors. For example, financial institutions can process loan applications faster, offering better customer service with quicker response times.

Innovating in Healthcare Records Management

In the healthcare sector, Amazon Textract simplifies the management of patient records and insurance claims. By extracting data from clinical notes and patient forms, Textract facilitates the digitization of health records, making vital information easily accessible and accurately recorded. This not only boosts operational efficiency but also contributes to improved patient care.

Streamlining Legal Document Analysis

Textract offers a solution for legal professionals by automating the extraction of information from contracts and legal documents. It identifies key clauses and important dates, streamlining contract reviews and compliance checks. This automation allows legal teams to focus on strategic work, relying on Textract for efficient foundational document analysis.

Enhancing Customer Service

Businesses across various sectors leverage Amazon Textract to automate data entry from customer forms and feedback. This results in quicker responses to customer needs, reducing the workload on service teams and elevating the overall customer experience.

Optimizing Government Operations

Government agencies benefit from Textract by automating data extraction from a wide range of documents, including applications and identification papers. This improves public service efficiency, making processes like application approvals for government programs more streamlined and transparent.

Integrating with AWS for Comprehensive Solutions

Textract’s integration with AWS services amplifies its impact across industries. By automating workflows and enabling actions based on extracted data – such as updating databases or initiating processes – Textract enhances operational efficiency and opens up new possibilities for data analysis and insight generation.

Through its advanced data extraction capabilities, Amazon Textract is setting a new standard in document processing. By automating manual tasks, it is not only saves time and resources but also enables organizations to harness the full potential of their data, leading to smarter business decisions and enhanced services.

Integration, Scalability, and Security: The Backbone of Amazon Textract

Seamless integration with aws ecosystem.

Amazon Textract is not just a standalone service; it’s a part of the broader AWS ecosystem, designed to work harmoniously with other AWS services. This seamless integration allows businesses to create powerful, end-to-end solutions that leverage the strengths of multiple AWS services. For instance, the extracted data from Textract can be stored in Amazon S3 , processed and analyzed with AWS Lambda functions, or used to trigger workflows in AWS Step Functions. This ecosystem approach not only simplifies the architecture of document processing solutions but also enhances their capabilities, making it easier for businesses to innovate and adapt to changing needs.

Scalability to Meet Evolving Business Demands

One of the critical advantages of Amazon Textract is its scalability. Whether a business is dealing with a few documents a day or millions a month, Textract can scale its resources to meet the demand. This scalability ensures that businesses can rely on Textract for their document processing needs, regardless of the size or the volume of their document processing requirements. The ability to scale seamlessly means that businesses can maintain high levels of efficiency and responsiveness as they grow, without worrying about the underlying infrastructure.

Uncompromising Security for Sensitive Data

Security is a paramount concern for businesses, especially when dealing with sensitive or confidential documents. AWS’s commitment to security is evident in Textract, which incorporates robust security measures to protect data throughout the document processing pipeline. From encryption at rest and in transit to compliance with global security standards, Textract ensures that sensitive data is handled with the utmost care. Additionally, businesses can leverage AWS Identity and Access Management (IAM) to control access to Textract resources, further enhancing the security of their document processing operations.

Building Trust with Compliance and Data Protection

Beyond security, Amazon Textract adheres to AWS’s strict compliance protocols, ensuring that businesses can meet their regulatory requirements. Whether it’s GDPR for European customers or HIPAA for healthcare data in the United States, Textract is designed to help businesses comply with relevant regulations. This commitment to compliance and data protection builds trust, allowing businesses to focus on leveraging Textract’s capabilities to improve their operations, knowing that their data handling practices are sound.

Pricing and Accessibility: Tailoring Amazon Textract to Your Business Needs

Flexible pay-as-you-go pricing model.

Amazon Textract’s pricing model is designed with flexibility and cost-effectiveness in mind, adhering to a pay-as-you-go structure. This approach allows businesses to pay only for the amount of data they process, without any upfront costs or long-term commitments. Whether a company processes a handful of documents or scales up to handle millions, Textract’s pricing adjusts accordingly, ensuring businesses only pay for what they use. This model is particularly beneficial for startups and small businesses that require scalability without the burden of significant initial investments, as well as for large enterprises managing vast volumes of documents.

Detailed Pricing for Specific Features

The pricing for Amazon Textract is detailed and transparent, with specific costs associated with different features such as text detection, form analysis, and table extraction. This detailed pricing ensures that businesses can plan and optimize their costs based on their specific use cases. For example, a legal firm focusing on extracting data from contracts may prioritize form and table analysis, while a healthcare provider might focus on bulk text extraction from patient records. By understanding the specific pricing for these features, businesses can tailor their use of Textract to achieve the most cost-effective solution for their needs.

Accessibility Across Platforms and Languages

Accessibility is a cornerstone of Amazon Textract, designed to be easily integrated into existing workflows. Through the AWS Console , developers and IT professionals can quickly start using Textract without the need for extensive setup. For those looking to automate document processing within their applications, Textract provides SDKs and APIs that support multiple programming languages, including Python, Java, JavaScript, and Go. This wide range of supported languages ensures that developers can work with Textract in their preferred coding environment, facilitating a smoother integration process.

Streamlining Integration into Workflows

The ease of integration offered by Amazon Textract allows businesses to seamlessly incorporate advanced document processing capabilities into their existing systems. Whether it’s automating data entry, enhancing content management systems, or enriching customer relationship management (CRM) platforms, Textract’s accessibility ensures that these integrations are straightforward. Additionally, the extensive documentation and support provided by AWS help developers navigate the integration process, ensuring they can leverage Textract’s full potential to streamline operations and improve efficiency.

Amazon Textract is transforming document processing with its advanced machine-learning capabilities. By automating data extraction and offering features like form and table data extraction, document classification, and custom queries, Textract enables businesses to process documents more efficiently and accurately than ever before. As an advanced-tier AWS partner , Cloudvisor is uniquely positioned to help businesses leverage the power of Amazon Textract, driving efficiency and innovation in document processing workflows across Europe, the USA, and beyond.

Unlock AWS Efficiency and Savings!

Other aws guides.

AWS Outposts: Enhancing On-Premises and Cloud Integration

AWS IoT Core: Key Features and Pricing Explained

Optimizing Software Development: The Power of Amazon CodeGuru

AWS Transit Gateway: Streamlining Complex Network Architectures

AWS X-Ray for Application Insight and Debugging

Text Insights with AWS Comprehend: A Comprehensive Guide

Get the latest articles and news about AWS

I have read and agree with Cloudvisor's Privacy Policy .

Take advantage of instant discounts on your AWS and Cloudfront services

AWS Cost Optimization

Squeeze the best performance out of your AWS infrastructure for less money

Well-Architected Framework Review

Ensure you're following AWS best practices with a free annual WAFR review

Monitoring Service

24/7 monitoring catches any potential issues before they turn into a problem

Data Engineering Services

Make the most of your data with optimization, analysis, and automation

Migration to AWS

Seamlessly transfer your cloud infrastructure to AWS with minimal downtime

AWS Security

Protect your AWS infrastructure with sophisticated security tools and consultation

AWS Marketplace

Access the best tools for your use case via the AWS Marketplace

For Startups
Case Studies

Dive into our latest insights, trends, and tips on cloud technology.

Your comprehensive resource for mastering AWS services.

Join our interactive webinars to learn from cloud experts.

Whitepapers

Explore in-depth analyses and research on cloud strategies.

Free consultation

Case Studies

CASE STUDIES

MovieTickets.com

PointClickCare

Industries Automotive, Transportation, and Logistics Consumer Goods Financial Services Food and Beverage Healthcare and Life Sciences Manufacturing Media and Entertainment Non-Profit and Public Sector Oil and Gas Public Utilities Retail SaaS and ISV Travel and Hospitality

AWS Announcements at a Glance: The Highlights from AWS in October 2021

AWS Announcements at a Glance: The Highlights from AWS in September 2021

AWS Announcements at a Glance: The Highlights from AWS in August 2021

Publications, improving cloud cost transparency and management, aws re:invent re:cap 2020, automating production level machine learning operations on aws.

Onica Featured in TechTarget SearchITChannel Remote Learning Technology Article

Onica Featured in TechTarget SearchEnterpriseAI AIoT article

Onica Featured in NetworkComputing Article on the Future of IoT with 5G

Rackspace Technology Expands Strategic Relationship with Amazon Web Services

Introduction to Amazon Textract

Mark McQuade
AI & Machine Learning , Blog , Data Analytics
December 4, 2019

[rt_reading_time label=”Read Time:” postfix=”minutes” postfix_singular=”minute”]

Amazon Textract is an automatic text and data extraction service, designed to simplify and accelerate advanced data extraction processes. Built to harness the power of machine learning, Amazon Textract exceeds the capabilities of simple optical character recognition (OCR) software, identifying and extracting the contents of fields in forms as well as information stored in tables. With support for virtually all kinds of documents and forms, Amazon Textract offers a powerful solution to ease your data extraction workflows.

Solving an old problem

Documents of all types, including contracts, forms, agreements or others, are essential to the operations of any business as primary tools of record. The necessity of documents extends across all industries, from finance, insurance and law to real estate, education and medicine. With thousands of documents produced at companies and organizations every year, it becomes increasingly hard to keep track of data in an organized, easy to access fashion. Machine learning models allow Amazon Textract to bring powerful and highly accurate document processing, enabling features like search and discovery through indexing, compliance and control as well as business process automation.

Existing Data Extraction Methods

Data extraction and document processing is currently performed primarily in three ways – manual processing, optical character recognition and rules and template based extraction.

Manual Processing

One of the most common ways of processing data for organizations or companies that require limited data extraction, manual processing involves human effort to scan and work through each document. While this method is simple to start, it is plagued by many challenges such as variable outputs across different documents, inconsistent results across multiple processors, time inefficiency due to the need for multiple reviews and high expenses that accumulate based on the compensation of those processing the documents.

Optical Character Recognition (OCR)

OCR allows for accelerated data extraction that can also be cheaper than manual processing. This method however is drastically limited by its error prone workflow, compatibility with only simple documents and lack of organization in results that makes it very difficult to decipher extracted data and put it to action.

Rules and Template Based Extraction

Extracting data with predefined rules and templates can speed up the process dramatically while achieving a good amount of accuracy in processing documents that match the layouts of templates completely. In real practice however, documents tend to vary quite frequently, from things like the differences in scanning practice to input methods varying between physical writing to digital entry. Small variances between documents can completely throw off rules and templates based extraction systems due to their inability to recognize individual elements and relationships in documents being processed.

Comprehensive Document Processing with Machine Learning

All the methods of data extraction discussed above have their own sets of advantages that are coupled with unique limitations which reduce their viability as reliable document processing alternatives. Some prominent limitations seem to stem from an inability to intelligently identify and apply appropriate processing to content of different types such as form entries, table entries and stylized text extraction. Hence with these tools, accuracy requires slow manual extraction, whereas quick processing comes at the cost of inaccurate data with limited usability.

Amazon Textract utilizes machine learning to instantly process documents with accuracy, undeterred by variability in document formats or by the complexity of the data being processed. The machine learning models utilized, have been trained on millions of documents from across almost every industry, comprising of document types such as contracts, tax documents, sales orders, benefits applications, insurance claims and more. Such extensive training allows the models to be flexible across document types, removing the need to write and maintain code as layouts change. Furthermore, Amazon Textract performs these tasks instantaneously without the cost of accuracy due to its ability to intelligently recognize tables, form field content and relationships between the data in these more complex entry formats.

Intelligent structured data extraction also allows for some highly utilitous features. Once data is procured, it can be indexed in Amazon Elasticsearch so that you can search for specific data from thousands of documents quickly. Extracted data can also be used by Amazon Textract to automate form processing without human intervention – allowing processes like loan approval by banks to be initiated without requiring manual review.

In addition to all of this, Amazon Textract provides processing and extraction services at very low cost. There are no upfront commitments or long term contracts and you pay for only the capacity that you use.

Amazon Textract is a powerful service designed to ease and accelerate data extraction – one of the most fundamental processes for any business. If you’d like to learn more about Amazon Textract and see specific examples of how it can prove to be significantly more useful than other data extraction methods, watch our webinar on Getting Started with Amazon Textract .

Ready to get started? Get in touch with our team to learn how we can help you leverage Amazon Textract to accelerate and ease your data extraction workflow.

Hidden layer

Onica Insights

AWS Announcements at a Glance: The Highlights from AWS in July 2021

Explore more cloud insights from onica, publications, apn premier consulting partner.

Expansive Skillset | 500+ AWS Certifications

Connect with Onica

Ready to plan your next cloud project?

by ADA for Web .

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

Analyze documents with Amazon Textract and generate output in multiple formats.

aws-samples/amazon-textract-textractor

Folders and files, repository files navigation.

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies)
amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)
amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)
amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...)
amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor . By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

pandas ( pip install "amazon-textract-textractor[pandas]" ) installs pandas which is used to enable DataFrame and CSV exports.
pdf ( pip install "amazon-textract-textractor[pdf]" ) includes pdf2image and enables PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
torch ( pip install "amazon-textract-textractor[torch]" ) includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
dev ( pip install "amazon-textract-textractor[dev]" ) includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]" .

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

Text recognition

Table extraction, form extraction, receipt processing (analyze expense).

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples .

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

See the documentation for more examples.

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna ( @srividh-r ).

Contributing

See CONTRIBUTING.md

Textractor can be cited using:

Or using the CITATION.cff file.

This library is licensed under the Apache 2.0 License.

Excavator image by macrovector on Freepik

Code of conduct

Security policy, releases 41, contributors 29.

Jupyter Notebook 81.4%
Python 18.6%

IMAGES

What is Amazon Textract
Indecomm Case Study: Enhancing IDX Innovation with Amazon Textract
How to use Amazon Textract to extract data from any Image & PDF?
How to use Amazon Textract to extract data from any Image & PDF?
Amazon Textract overview
Amazon Textract

VIDEO

BUSINESS ANALYST CASE STUDY 3
An End to End Process of Extracting Text from Images Using AWS Lambda and AWS Textract/
Build your first Amazon Textract application
Extracting Data from Documents in Mendix using Amazon Textract
Extract Handwriting & Data from Any Document with Amazon Textract
How to extract text from image using Amazon Textract [Hindi]

COMMENTS

Amazon Textract Customers
Deciding on Amazon Textract, InfraBeat proposed an SAP IRPA with Amazon Textract solution to achieve a high level of accuracy and minimal adjustments to their logic. We always want high accuracy when it comes to data extraction and the results from Amazon Textract were above our expectations, consistent across many different layouts, with 90% ...
Accurait Gives Customers More Accurate Leasing Data with Amazon Textract
Amazon Textract uses AI to detect and identify relevant data on each page, accelerating text scanning and extraction. Accurait deployed a new version of its data extraction tool on Amazon Textract, from proof of concept to production, in just three months. "We evaluated other offerings, but Amazon Textract was superior in terms of features ...
Indecomm Case Study: Enhancing IDX Innovation with Amazon Textract
Selecting the right technology to power IDX was a pivotal juncture. After an exhaustive evaluation, Indecomm's choice crystallized around Amazon Textract on Amazon Web Services (AWS). This decision was catalyzed by Textract's innate scalability and seamless integration with serverless tools like AWS Lambda.
Unveiling Amazon Textract: An In-Depth Exploration
Access Amazon Textract: Navigate to the AWS Management Console, locate the Textract service, and configure it to suit your needs. 3. Upload Documents: Upload your documents to an S3 bucket or use ...
OCR with Tesseract, Amazon Textract, and Google Document AI: a
Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n = 322) and Arabic-language article scans (n = 100 ...
Case Study: PDF Insights with AWS Textract and OpenAI integration
The case study addresses automating the extraction of vital details from numerous PDF pitchdecks. These decks are concise but numerous, making manual analysis impractical. The challenge involves extracting text and interpreting graphical elements. AWS Textract was employed for text extraction due to its OCR and layout understanding capabilities.
Introduction to Amazon Textract: Now in Preview
Learn how Amazon Textract, now in preview, enables companies to easily extract text and data from virtually any document. ... AWS Case Studies. AWS Documentation on Kindle. AWS Documentation Archive. Compute. Amazon EC2; Amazon EC2 Auto Scaling; Amazon Elastic Container Service;
Compare Amazon Textract with Tesseract OCR
Image by Gerd Altmann from Pixabay. In the article we will focus on two well know OCR frameworks: Tesseract OCR — free software, released under the Apache License, Version 2.0 - development has been sponsored by Google since 2006.; Amazon Textract OCR — fully managed service from Amazon, uses machine learning to automatically extract text and data We will compare the OCR capabilities of ...
OCR with Tesseract, Amazon Textract, and Google Document AI: a
In addition to combining text and image modalities, other studies have explored the use of additional modalit ies such as audio and video. For examp le, Hegghammer (2022) proposed a mu ltimodal ...
Extracting and Sending Text to AWS Comprehend for Analysis
Amazon Textract lets you include document text detection and analysis in your applications. With Amazon Textract you can extract text from a variety of different document types using both synchronous and asynchronous document processing. The extracted text can then be saved to a file or database, or sent to another AWS service for further processing.
Amazon Textract: Mastering Document Processing Efficiency
Revolutionizing Financial Document Processing. Amazon Textract significantly enhances the efficiency of financial operations by automating the extraction of data from critical documents such as bank statements, invoices, and expense reports. This automation speeds up the reconciliation process and improves accuracy, reducing the risk of errors.
Improving Data Extraction Processes Using Amazon Textract and Idexcel
Manually extracting data from multiple sources is repetitive, error-prone, and can create a bottleneck in the business process. Idexcel built a solution based on Amazon Textract that improves the accuracy of the data extraction process, reduces processing time, and boosts productivity to increase operational efficiencies. Learn how this approach can solidify your competitive edge, help you ...
Best Practices for Amazon Textract
Provide a high quality image, ideally at least 150 DPI. If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPEG, and PNG), don't convert or downsample the document before uploading it to Amazon Textract. For the best results when extracting text from tables in documents, ensure that:
Introduction to Amazon Textract
Amazon Textract is an automatic text and data extraction service, designed to simplify and accelerate advanced data extraction processes. Built to harness the power of machine learning, Amazon Textract exceeds the capabilities of simple optical character recognition (OCR) software, identifying and extracting the contents of fields in forms as well as information stored in tables.
Automatically extract text and structured data from documents with
By using Amazon Textract Response Parser, it's easier to de-serialize the JSON response and use in your program, the same way Amazon Textract Helper and Amazon Textract PrettyPrinter use it. The GitHub repository shows some examples.. Form and table extraction and processing. Amazon Textract can provide the inputs required to automatically process forms and tables without human intervention.
aws-samples/amazon-textract-textractor
Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.
Amazon Comprehend Customers and Partners
Amazon Textract's OCR technology enabled us to extract text from documents. Amazon Comprehend's context-aware NLP APIs extracted business-specific entities and their values from the text. We also incorporated humans in our workflow using Amazon Augmented AI (Amazon A2I), to have our teams review extracted data and provide feedback to the ML ...
Amazon Marketing Strategy: Case Study (2024)
Posted on May 22, 2024 by Daniel Pereira. The Amazon Marketing Strategy has been largely responsible for the company's meteoric rise to becoming one of the most powerful players in the global market. Dissimilar to conventional marketing approaches, Amazon's strategy has revolutionized the way businesses operate, reach out to customers, and ...
Analyzing Document Text with Amazon Textract
To analyze text in a document (API) Give a user the AmazonTextractFullAccess and AmazonS3ReadOnlyAccess permissions. For more information, see Step 1: Set Up an AWS Account and Create a User. Install and configure the AWS CLI and the AWS SDKs. For more information, see Step 2: Set Up the AWS CLI and AWS SDKs.
Paytm Case Study
Paytm is one of India's largest financial services companies, offering solutions ranging from payments, e-commerce, banking, and more, to serve the underbanked population and businesses in India. To keep pace with business growth and a rapidly increasing pool of users and customers, the company adopted Amazon Textract, which instantly and accurately extracts text and data from users ...
PDF Amazon Textract
applications where latency is critical. Amazon Textract also provides asynchronous operations to extend support to multipage documents. Amazon Textract's API operations have quotas that limit how quickly and how often you can use them. If the limit set for your account is frequently exceeded, you can request a limit increase. To

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Cite this article

Similar content being viewed by others

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

OCR as a Service: An Experimental Evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

Introduction

Noise application

Measurement

Implications

Sentiment analysis

Text classification

Topic modelling

Named entity recognition

Author information

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

Case Study: PDF Insights with AWS Textract and OpenAI integration

PDF Insights with AWS Textract and OpenAI integration

Reliable text extraction

Original problem - automated PDF summarization

Why is text extraction so hard?

How can it be solved?

Summarisation process and AI

Frequently Asked Questions

Documentation

AWS & Cloud Computing

Introduction to Amazon Textract: Now in Preview

Learning Objectives

Who Should Attend?

Download the Slide Deck

What's New / Cloud Innovation

Data Lakes & Analytics

Webinar 2: What's New / Cloud Innovation

How Amazon Textract Transforms Document Processing

Table of Contents

The Technology Behind Textract

Core Features

Seamless Integration and Enhanced Security

Practical Applications of Amazon Textract

Innovating in Healthcare Records Management

Streamlining Legal Document Analysis

Enhancing Customer Service

Optimizing Government Operations

Integrating with AWS for Comprehensive Solutions

Integration, Scalability, and Security: The Backbone of Amazon Textract

Scalability to Meet Evolving Business Demands

Uncompromising Security for Sensitive Data

Building Trust with Compliance and Data Protection

Pricing and Accessibility: Tailoring Amazon Textract to Your Business Needs

Detailed Pricing for Specific Features

Accessibility Across Platforms and Languages

Streamlining Integration into Workflows

Unlock AWS Efficiency and Savings!

AWS Outposts: Enhancing On-Premises and Cloud Integration

AWS IoT Core: Key Features and Pricing Explained

Optimizing Software Development: The Power of Amazon CodeGuru

AWS Transit Gateway: Streamlining Complex Network Architectures

AWS X-Ray for Application Insight and Debugging

Text Insights with AWS Comprehend: A Comprehensive Guide

AWS Cost Optimization

Well-Architected Framework Review

Monitoring Service

Data Engineering Services

Migration to AWS

AWS Security

AWS Marketplace

Whitepapers

CASE STUDIES

AWS Announcements at a Glance: The Highlights from AWS in October 2021

AWS Announcements at a Glance: The Highlights from AWS in September 2021

AWS Announcements at a Glance: The Highlights from AWS in August 2021

Onica Featured in TechTarget SearchITChannel Remote Learning Technology Article

Onica Featured in TechTarget SearchEnterpriseAI AIoT article

Onica Featured in NetworkComputing Article on the Future of IoT with 5G

Rackspace Technology Expands Strategic Relationship with Amazon Web Services