Using optical character recognition (OCR) to defeat Homoglyph attacks

this isn’t a super-baked idea, just one that I thought of at Defcon while playing linecon

I’ve written about Homoglyph attacks before, mostly as it’s a simple trick that really seems like it should have been eradicated by now, related writings:

Ok so what I was thinking, the whole issue is humans mistaking one character or collection of characters for another and thereby being phished, pwned etc. Therefore why not tune an OCR engine to get “fooled” like that and check the mismatch against the true input?

The flow:

  1. string comes in e.g. grnail.com
  2. render that to a HTML canvas element in a font that can often lead to misinterpretation, e.g Arial and make it pretty small
  3. use tesseract to OCR (Optical Character Recognition) that string back
  4. compare the input string to the OCR’d string
  5. if there’s a mismatch, tell the user

I think this could be cool as a chrome extension, or something in email flow checks on a backend.

============== Below is an example you can play with =================

This seems to crash mobile browsers, at least firefox, chrome and brave on my android, so I’d recommend running on a laptop, presumably tesseract is memory hungry?

try out “grnail.com” and “josephkirwin.com” as examples of it in action, click view-source to see what it do

Hostname:
OCR Result:
Did they match?:

canvas