Commit Graph

4 Commits

Author SHA1 Message Date
Alex Vandiver ce09c8b65f string_validation: Make `unicode_non_chars` a set, for efficiency. 2022-01-11 15:59:38 -08:00
Alex Vandiver a22a6e941f migrations: Fix inversion of character_is_printable. 2022-01-11 15:42:25 -08:00
Alex Vandiver df50280c54 string_validation: Loosen to allow some `Cn` unicode characters.
Under the unicodedata distributed with Python 3.6, some Emoji are
classified as `Cn`, and not `So`:

```
$ unicode 1f929 --long
U+1F929 GRINNING FACE WITH STAR EYES
UTF-8: f0 9f a4 a9 UTF-16BE: d83edd29 Decimal: 🤩 Octal: \0374451
🤩
Category: So (Symbol, Other); East Asian width: W (wide)
Unicode block: 1F900..1F9FF; Supplemental Symbols and Pictographs
Bidi: ON (Other Neutrals)

$ python3.6 -c 'import unicodedata; print(unicodedata.category("\U0001f929"))'
Cn

$ python3.7 -c 'import unicodedata; print(unicodedata.category("\U0001f929"))'
So
```

Drop `Cn` from the list of excluded Unicode character classes, and
replace it with an explicit list of the 66 non-characters, which are
invariant.

Co-authored-by: Shlok Patel <shlokcpatel2001@gmail.com>
2022-01-11 15:17:53 -08:00
Mateusz Mandera 93e18fe289 migrations: Remove disallowed characters from topics.
Following b3c58f454f, we want to clean up
old topics that may contain the disallowed characters. The Message table
is large, so we go in batches, making sure we limit topic fetches and
UPDATE query to no more than BATCH_SIZE Message rows per query.
2021-12-09 09:51:06 -08:00