Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Re-OCR
#1
Seems like justice.gov has their own search index, but this search index relies on the OCR in the documents themselves to be OK, but that doesn't seem to be the case.

Obviously I haven't looked through all the documents, but lot of them have garbled OCR either by mistake, or purpose. We should probably OCR everything again from scratch, all over again, but this time really ensure high quality outputs. I've used olmo-ocr for this before, which works well, just need to run through all of the documents.

Once we have this, we should also publish a search index.
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)