On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus

Adam Roegiest and Edward Lee • July 2019 • SIGIR 2019

While document signatures are a well established tool in IR, they have primarily been investigated in the context of web documents. Legal due diligence documents, by their nature, have more similar structure and language than we may expect out of standard web collections. Moreover, many due diligence systems strive to facilitate real-time interactions and so time from document ingestion to availability should be minimal. Such constraints further limit the possible solution space when identifying near duplicate documents. We present an examination of the tradeoffs that document signature methods face in the due diligence domain. In particular, we quantify the trade-off between signature length, time to compute, number of hash collisions, and number of nearest neighbours for a 90,000 document due diligence corpus.

Read the Paper

View the Github Repo

On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus

Read more papers

Redesigning Document Viewer for Legal Documents

On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron