Compilers: Difference between revisions
Brodriguez (talk | contribs) (Create page) |
Brodriguez (talk | contribs) (Add todo note) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
Compilers are essentially programs that read in one set of code, potentially | {{ todo | Rework this page/section. Not very useful as-is. }} | ||
Compilers are essentially programs that read in one set of code, potentially optimize it, and then output a different set of code. | |||
== Aspects of a Compiler == | == Aspects of a Compiler == | ||
Line 10: | Line 13: | ||
== Front End / Input == | === Front End / Input === | ||
The "Front End" or "Input" effectively reads code in from a file. The compiler needs to be programmed to be able to understand the structure of each language it's expected to use as input. The structure of a given language (whether that be English, Python, or even Math in general) is often called a Grammar ({{ todo | Link to grammars }}. | The "Front End" or "Input" effectively reads code in from a file. The compiler needs to be programmed to be able to understand the structure of each language it's expected to use as input. The structure of a given language (whether that be English, Python, or even Math in general) is often called a Grammar ({{ todo | Link to grammars }}. | ||
Line 18: | Line 21: | ||
Ideally at some point, the Front End will also check that the syntax of the original input file is correct for the corresponding language. After all, it doesn't make sense to do work on something that's initially broken. | Ideally at some point, the Front End will also check that the syntax of the original input file is correct for the corresponding language. After all, it doesn't make sense to do work on something that's initially broken. | ||
== Middle End == | === Middle End === | ||
The "Middle End" or "Optimization" effectively checks for various optimizations that can be run on the code. | The "Middle End" or "Optimization" effectively checks for various optimizations that can be run on the code. | ||
Line 39: | Line 40: | ||
== Back End == | === Back End === | ||
The "Back End' or "Output' effectively takes the "intermediate representation" of the code and outputs it to a file in a given language. It's basically the reverse of what the Front End of the compiler does. | The "Back End' or "Output' effectively takes the "intermediate representation" of the code and outputs it to a file in a given language. It's basically the reverse of what the Front End of the compiler does. | ||
By keeping it separate from the Front End, the compiler can basically read in one language for the input file, then output an entirely different language for the output file. For example, it might take in C code and output an equivalent, optimized Assembly file. {{ todo | link to languages}}. | By keeping it separate from the Front End, the compiler can basically read in one language for the input file, then output an entirely different language for the output file. For example, it might take in C code and output an equivalent, optimized Assembly file. {{ todo | link to languages}}. | ||
== Parse Trees == | |||
Parse trees (and related tree types) are a common form of intermediate representation. In general, a given grammar will be broken down into tokens, the same way you would break down, say, the English language. These tokens are then organized onto the tree such that when the leaf nodes are read in, they will allow direct reconstruction of the original file. The tree is often organized for Depth-First, In-Order traversal {{ todo | links }}. | |||
* Parse Trees, aka Concrete Syntax Trees (CSTs) - These are a direct representation of the code, symbol for symbol, where reading in the tree leaf nodes will allow direct recreation of the original input file. | |||
** See https://medium.com/basecs/grammatically-rooting-oneself-with-parse-trees-ec9daeda7dad for more info. | |||
* Syntax Trees, aka Abstract Syntax Trees (ASTs) - These are very similar to CSTs, but some of the original syntax is abstracted away. Thus, only the "core, most important" aspects remain to be represented in the tree. | |||
** See https://medium.com/basecs/leveling-up-ones-parsing-game-with-asts-d7a6fc2400ff for more info. | |||
Whenever parse trees are involved, the end result for the compiler is almost always an AST. |
Latest revision as of 07:08, 6 March 2023
Compilers are essentially programs that read in one set of code, potentially optimize it, and then output a different set of code.
Aspects of a Compiler
There are three main parts to a compiler:
- Front End/Input
- Middle End/Optimization
- Back End/Output
Each of these are explained in further detail in the following sections.
Front End / Input
The "Front End" or "Input" effectively reads code in from a file. The compiler needs to be programmed to be able to understand the structure of each language it's expected to use as input. The structure of a given language (whether that be English, Python, or even Math in general) is often called a Grammar (
.
Once the input file is read in, the "front end" will often convert this to some internally recognized "intermediate language" or "intermediate representation". This representation can be in literally any format, including ones that are only internally recognized by the compiler itself.
The point of an "Intermediate representation" language is so that the compiler can read in any language (say, either C or Fortran
) as input, and then convert them to a single format for the "Middle End" part of the compiler to handle. That way, the compiler can use one set of optimizations, regardless of what the original input file was.
Ideally at some point, the Front End will also check that the syntax of the original input file is correct for the corresponding language. After all, it doesn't make sense to do work on something that's initially broken.
Middle End
The "Middle End" or "Optimization" effectively checks for various optimizations that can be run on the code.
Depending on implementation of the compiler, one of two formats can happen:
- The user can specify exactly which optimizations occur from a list of the compiler's known optimizations.
- The compiler attempts to execute all known optimizations.
Regardless of which format the compiler takes, it will run through the code and attempt to run said optimizations.
Note that the compiler has to be programmed with each individual optimization first. It doesn't just arbitrarily know them ahead of time. Furthermore, each optimization should ideally be programmed such that it will never break the code. Generally speaking, it's better that an optimization "doesn't do enough but never breaks any input files" rather than "changes as much as possible, but breaks for some specific input files".
This can actually be incredibly difficult to program, as a given optimization may work with many input files, but break with one single specific file. Or alternatively, an optimization may seem to work with all input files when run alone. But then when the optimization is ran alongside others, it ends up breaking the input files.
Regardless, assuming the optimizations found anything to improve (which is often), then the end result should be an improved form of the "intermediate representation" code that effectively accomplishes the same logic as the original code.
Back End
The "Back End' or "Output' effectively takes the "intermediate representation" of the code and outputs it to a file in a given language. It's basically the reverse of what the Front End of the compiler does.
By keeping it separate from the Front End, the compiler can basically read in one language for the input file, then output an entirely different language for the output file. For example, it might take in C code and output an equivalent, optimized Assembly file.
.
Parse Trees
Parse trees (and related tree types) are a common form of intermediate representation. In general, a given grammar will be broken down into tokens, the same way you would break down, say, the English language. These tokens are then organized onto the tree such that when the leaf nodes are read in, they will allow direct reconstruction of the original file. The tree is often organized for Depth-First, In-Order traversal
.
- Parse Trees, aka Concrete Syntax Trees (CSTs) - These are a direct representation of the code, symbol for symbol, where reading in the tree leaf nodes will allow direct recreation of the original input file.
- Syntax Trees, aka Abstract Syntax Trees (ASTs) - These are very similar to CSTs, but some of the original syntax is abstracted away. Thus, only the "core, most important" aspects remain to be represented in the tree.
Whenever parse trees are involved, the end result for the compiler is almost always an AST.