30 Jun 2022
News
4 minute read

Response to accusations regarding TableFormer paper

Update 08/02/22: IEEE reviewed the case internally and dismissed the plagiarism allegations.

We, IBM researchers and the authors of the TableFormer work,1 would like to respond to accusations of plagiarism by the authors of TableMaster2 (later referred as “OP”) in regard to their ideas and code.

The accusations arose on June 27, 2022 following the publication of our paper at the Computer Vision and Pattern Recognition Conference (CVPR). The authors of TableMaster did not contact us prior to their public accusations, which are ungrounded and easily refuted by a simple comparison of the two papers in question.

First, though, we would like to point out that never, in this or any other instance, have IBM researchers plagiarized anyone’s work. We adhere to the highest ethical standards of research and publishing our work, be it as a pre-print, at a conference or any other venue, or in a journal.

Our work introduces a different neural network architecture, built on top of the work published in 2019 by our IBM colleagues (EDD3). Also, TableFormer uses a unique data processing pipeline, applied directly to programmatic PDF documents. This approach follows our 2018 KDD4 paper ideas for PDF parsing and is fundamentally different from TableMaster, which depends on Optical Character Recognition (OCR) from images.

Let’s examine the accusations

Did we copy the idea?

The answer is no.

  • The dual decoder approach was introduced3 by our colleagues at IBM in 2019 before the OP’s work (in 2021).
  • The EDD3 public code contains the idea of bounding box regression, which predates the codebase and paper of the OP. In our quantitative analysis section, we refer to it as “EDD+BBox.”
  • The TableFormer network architecture is different from TableMASTER-mmocr. TableMASTER-mmocr uses a dual transformer decoder, text-line detection (based on PSENET). But TableFormer uses a single transformer decoder, with the output of the transformer decoder being first used by an attention network, and then with a DETR5 head to predict the bounding box.

Did we use any of the OP’s models?

The answer is no.

  • We do not use OCR — instead, we use the content from the original PDFs.
  • We do not use OP’s “text line detection” or “text line recognition.” In fact, we do not need to do this process at all, because we do not use any OCR.
  • We only use the original PDFs developed by our colleagues to create the PubTabNet dataset.
  • We apply our own method,4 published in 2018, to extract content (raw files) from PDFs.

Did we use any of the OP’s visualizations?

The answer is no.

  • Using bounding boxes to visualize detections is a standard technique in computer vision.
  • Many papers, published before OP’s work, use bounding boxes to visualize detections in tables. One example is the work6 by our IBM colleagues in 2020.
  • Our visualization is produced using our Javascript/HTML code, which has a unique appearance and simplifies the comparison of predictions at different stages.

Did we copy the OP’s preprocessing?

The answer is no.

  • Our data preparation stage includes steps that are not present in OP’s work. For instance, we have introduced a procedure that generates the missing bounding boxes explained in our supplementary material.
  • In the implementation details of our paper, we explained why we used 512 tokens.
  • The HTML classification tokens are not defined by OP’s work, but they were first described by EDD3 in 2019.
  • Even OP’s screenshots show that our work is different than theirs because we use “uncollapsed” tokens (“<td>”, “</td>”) — contrary to their work that uses “collapsed” tokens (“<td></td>”).

Did we copy the OP’s postprocessing?

The answer is no.

  • Our TableFormer extracts the text directly from the PDF document and it does not use any OCR. Therefore, the output of our model is different and uses different post-processing treatment.
  • Out post-processing pipeline is more sophisticated than OP’s work. This has been explained in detail in our supplementary material.
  • Caching for autoregressive methods during the inference is a known practice. It has been implemented by Open-Source Neural Machine Translation (OpenNMT) and is described in this blog post.

Did we mislead anyone?

The answer is no.

  • We were not aware of the OP’s work. Even during the paper’s review process, the existence of OP’s work was not mentioned.
  • As shown by our answers to previous questions, we were building upon our colleagues' work3, 6 that predates the OP’s work.
  • The OP did not contact us prior to a mass email sent to our work colleagues, and the Reddit post with the accusations. It would’ve been better if the OP had contacted us before making public accusations, then we could have gladly proven our points, cited OP’s work and compared the approaches.
  • We are open for discussion with the OP to further clarify all of the above and prove that our work has not been copied or even inspired by the OP’s work.
  • We demand a retraction of the plagiarism accusation, and an apology email to our colleagues retracting the accusation.
  • If the OP is still not convinced, we don’t mind them reaching out to CVPR. We have overwhelming evidence in terms of code (git history) and documentation to prove that the accusations are completely and totally unfounded and false.

References

  1. Ye, Jiaquan, et al. "PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML." arXiv preprint arXiv:2105.01848(2021).

  2. Nassar, Ahmed, et al. "TableFormer: Table Structure Understanding with Transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

  3. Staar, Peter WJ, et al. "Corpus conversion service: A machine learning platform to ingest documents at scale." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. APA 2 3 4 5

  4. Zhong, X., ShafieiBavani, E., Jimeno Yepes, A. (2020). Image-Based Table Recognition: Data, Model, and Evaluation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12366. Springer, Cham. https://doi.org/10.1007/978-3-030-58589-1_34. [arXiv preprint arXiv:1911.10683 (2019).] 2

  5. X. Zheng, D. Burdick, L. Popa, X. Zhong and N. X. R. Wang, "Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context," 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 697-706, doi: 10.1109/WACV48630.2021.00074. [arXiv preprint arXiv:2005.00589 (2020)].

  6. Zhu, Xizhou, et al. "Deformable detr: Deformable transformers for end-to-end object detection." arXiv preprint arXiv:2010.04159 (2020). 2