Now we only tokenize the file once, and we pass
**validated** tokens to the pretty printer.
There are a few reasons for this:
* It obviously saves a lot of extra computation
just in terms of tokenization.
* It allows our validator to add fields
to the Token objects that help the pretty
printer.
I also removed/tweaked a lot of legacy tests for
pretty_print.py that were exercising bizarrely
formatted HTML that we now simply ban during the
validation phase.
This accomplishes a few things:
* lighten the load for the main validation loop
* defer indentation checks until we are sure the author
even knows how to match up tags
* add some info to the Token objects that we may soon
consume in our pretty-printer
We now complain about programmers who don't use
4-space indents in template files, rather than
letting the pretty printer fix them.
This is partly just to simplify the pretty printer
code (in future commits), but it also makes the
symptom more obvious to newbie developers. They
are probably just as able to react to the direct
error messages as they are able to figure out how
to read diffs from the pretty printer and grok
the --fix syntax. And once they learn the convention
and configure their editor, it should then be a
one time problem.
We now create tokens for whitespace and text, such that you
could rebuild the template file with "".join(token.s for
token in tokens).
I also fixed a few bugs related to not parsing
whitespace-control tokens.
We no longer ignore template variables, although we could do
a lot better at validating them.
The most immediate use case for the more thorough parser is
to simplify the pretty printer, but it should also make it
less likely for us to skip over new template constructs
(i.e. the tool will fail hard rather than acting strange).
Note that this speeds up the tool by almost 3x, which may be
slightly surprising considering we are building more tokens.
The reason is that we are now munching efficiently through
big chunks of whitespace and text at a time, rather than
checking each individual character to see if it starts one
of the N other token types.
The changes to the pretty_print module here are a bit ugly,
but they should mostly be made irrelevant in subsequent
commits.
I rewrote most of tools/lib/pretty-printer.py, which
was fairly easy due to being able to crib some
important details from the previous implementation.
The main motivation for the rewrite was that we weren't
handling else/elif blocks correctly, and it was difficult
to modify the previous code. The else/elif shortcomings
were somewhat historical in nature--the original parser
didn't recognize them (since they weren't in any Zulip
templates at the time), and then the pretty printer was
mostly able to hack around that due to the "nudge"
strategy. Eventually the nudge strategy became too
brittle.
The "nudge" strategy was that we would mostly trust
the existing templates, and we would just nudge over
some lines in cases of obviously faulty indentation.
Now we are bit more opinionated and rigorous, and
we basically set the indentation explicitly for any
line that is not in a code/script block. This leads
to this diff touching several templates for mostly
minor fix-ups.
We aren't completely opinionated, as we respect the
author's line wrapping decisions in many cases, and
we also allow authors not to indent blocks within
the template language's block constructs.
In cases where an opening tag is so long that we stretch
it to 2+ lines of code, we should try to use block-style
formatting in the template code.
Unfortunately, we have lots of legacy code that violates
this concept, so this is a timid fix.
There are also legit use cases like textarea where we
probably need to keep the ugly template syntax for things
to render properly.
We disallow this HTML:
junk-text-before-open-tag<p>
This is a paragraph.
</p>
We rarely see the above mistake, but we want to eliminate
the possibility to be somewhat rigorous, and so that we
can eliminate a pretty-printer mis-feature.
This reverses the policy that was set, but incompletely enforced, by
commit 951514dd7d. The self-closing tag
syntax is clearer, more consistent, simpler to parse, compatible with
XML, preferred by Prettier, and (most importantly now) required by
FormatJS.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Generated by pyupgrade --py36-plus --keep-percent-format.
Now including %d, %i, %u, and multi-line strings.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Fixes#2665.
Regenerated by tabbott with `lint --fix` after a rebase and change in
parameters.
Note from tabbott: In a few cases, this converts technical debt in the
form of unsorted imports into different technical debt in the form of
our largest files having very long, ugly import sequences at the
start. I expect this change will increase pressure for us to split
those files, which isn't a bad thing.
Signed-off-by: Anders Kaseorg <anders@zulip.com>
Automatically generated by the following script, based on the output
of lint with flake8-comma:
import re
import sys
last_filename = None
last_row = None
lines = []
for msg in sys.stdin:
m = re.match(
r"\x1b\[35mflake8 \|\x1b\[0m \x1b\[1;31m(.+):(\d+):(\d+): (\w+)", msg
)
if m:
filename, row_str, col_str, err = m.groups()
row, col = int(row_str), int(col_str)
if filename == last_filename:
assert last_row != row
else:
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
with open(filename) as f:
lines = f.readlines()
last_filename = filename
last_row = row
line = lines[row - 1]
if err in ["C812", "C815"]:
lines[row - 1] = line[: col - 1] + "," + line[col - 1 :]
elif err in ["C819"]:
assert line[col - 2] == ","
lines[row - 1] = line[: col - 2] + line[col - 1 :].lstrip(" ")
if last_filename is not None:
with open(last_filename, "w") as f:
f.writelines(lines)
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
We now prevent these variations:
* <hr/>
* <hr />
* <br/>
* <br />
We could enforce similar consistency for other void
tags, if we wished, but these two are particularly
prevalent.
We now forbid tags of the form `<foo ... />` in most
places, and we also forbid it even for several void
tags.
We make exceptions for tags that are already formatted
in two different ways in our codebase. This is mostly
svg tags, plus these common cases:
- br
- hr
- img
- input
It would be nice to lock down a convention for these,
even though the HTML specification is unopinionated
on these. We'll probably want to stay flexible for
svg tags, since they are sometimes copy/pasted from
other sources (although it's probably rare enough for
them that we can tolerate just doing minor edits as
needed).
If folks put something like '<br/>' in the HTML,
we would think the tag's name was "br/" instead
of "br". I think we were assuming most folks
would write either "<br>" or <br />".
ASIDE:
We should probably have a consistent
preference among these styles:
* <br>
* <br/>
* <br />
I prefer the first.
Generated by `pyupgrade --py3-plus --keep-percent-format` on all our
Python code except `zthumbor` and `zulip-ec2-configure-interfaces`,
followed by manual indentation fixes.
Signed-off-by: Anders Kaseorg <anders@zulipchat.com>
In this commit, we basically match any kinda of jinja2 start tag,
no matter its special kind (eg. jinja2_whitespace_stripped_start)
to any kinda jinja2 end tag (eg. jinja2_whitespace_stripped_end)
Idea is special operators like `-` do not change the meaning of
inline tag and thus matching shouldn't depend upon this.