`Here is the compiler from dsl to python code : \n Give me the full python code with all methods, variables, ... To create the program from this file : $file, be more generic csv do not have column name. Name of dataset is \'test_binary.csv\', if you need to load a pretrained model use \'model.pth\' Here is the first line of my csv : 5.1,3.5,1.4,0.2,Iris-setosa \n Give only the code without any comments or annotation. Give the code between ` '
---
## Automation
*Describe how to iterate over `example/`, execute compilers, and compare outputs. Mention verification steps: execution validation, functional equivalence, and textual similarity.*
### 1. Compiling with Our DSL Compiler
To compile all `programme.maj` files in each `example_i` folder using our DSL compiler, run the following command:
```bash
cd maj && node bin/cli.js compile-all
```
### 2. Compiling with LLM-Compiler
To compile all programme.maj files in each example_i folder using the LLM-compiler, run:
```bash
python LLM-compiler.py compile-all
```
### 3. Testing Compilation and Execution
To verify that the compiled files can execute correctly, run the unit tests with:
```bash
python test_LLM.py
```
### 4. Comparing Outputs for Similarity
To compare the outputs generated by the LLM-compiler and our DSL compiler, and calculate the textual similarity score using Levenshtein distance, use this command:
-**Functional Equivalence**: Validate identical behavior of outputs.
-**Execution Validation**: We use unit test case in `test_LLM.py`, unfortunately all tests have failed. LLM-compiler doesn't work, it still has python error
-**Textual Similarity**: For textual similarity we choose to use Levenshtein distance
-**Functional Equivalence**: It's impossible to proceed as all tests fail to compile
---
## Results
In our testing, we ran each of the ten example programs through both the traditional compiler and the LLM-based compiler. While the traditional compiler generated valid, runnable solutions in all ten cases, the LLM-based compiler failed to produce compilable code for every example. In some instances, the LLM-based approach introduced non-existent libraries or incorrect syntax, while in others it deviated significantly from the DSL specification. Consequently, none of the LLM-generated outputs could be executed as intended.
By contrast, the traditional compiler demonstrated consistency and reliability. Its strictly rule-based translation of the DSL into executable code adhered closely to the grammar, ensuring correctness for each example. These results underscore the limitations of a purely LLM-driven approach at this stage and highlight the value of a traditional, formal method for DSL processing. Despite the creative potential of large language models, achieving consistency and accuracy in code generation remains challenging without additional guardrails or iterative feedback loops.
The Textual Similarity results for each example show varying similarity scores between the outputs of our DSL compiler and the LLM-compiler. These scores are based on the number of different characters and the textual similarity between the two outputs, as measured by the Levenshtein distance.
P.S: We remove all comments in code to remove maximum of unnecessary data
|Example|Numbers of differents chars|Similarity score (max is 1)|
|-----------|-----------|----------|
|1|2829|0.42|
|2|1688|0.44|
|3|2333|0.43|
|4|1931|0.48|
|5|2194|0.47|
|6|1760|0.43|
|7|1427|0.54|
|8|3077|0.43|
|9|2598|0.48|
|10|1865|0.48|
The variation in similarity scores can be largely explained by the different variable names used by the LLM compiler compared to the DSL compiler. While the overall structure and logic of the programs might be very similar, differences in naming conventions can introduce discrepancies in the generated outputs, leading to changes in the number of characters and, therefore, the similarity score. Therefore it can be supported by the fact that each similarity score are between [0.42, 0.54]. Plus the fact taht in our compiler we print a lot of information. Even by removing all lines containing a print we still have results between [0.44, 0.54]
Regarding structural similarities, both implementations define the network using a class and separate functions for training and model evaluation. A key difference, however, lies in the data loading process. The LLM-compiler creates a custom class for data handling, while our compiler takes a more streamlined approach, using only a few lines of Python code. This is because the CSV structure that needs to be processed is already consistent, allowing for a simpler, more direct solution in our compiler.
In our testing, we ran each of the ten example programs through both the traditional compiler and the LLM-based compiler. While the traditional compiler generated valid, runnable solutions in all ten cases, the LLM-based compiler failed to produce compilable code for every example. In some instances, the LLM-based approach introduced non-existent libraries or incorrect syntax, while in others it deviated significantly from the DSL specification. Consequently, none of the LLM-generated outputs could be executed as intended.
---
## Improvements
During our experimentation with the LLM-based compiler, we observed that certain issues stemmed from how the model interpreted and expanded upon the DSL constructs. To improve the results, refining the prompt or using an alternative AI model can be an important part of the mitigations or errors. By crafting more explicit instructions and constraints, the LLM can be guided toward generating code that is more reliable and faithful to the grammar’s requirements.
In parallel, we introduced minor refinements to the traditional compiler to ensure that edge cases were handled more gracefully. These refinements included stricter validation checks, clearer error reporting, and a modular redesign that makes the addition of new features more straightforward. Consequently, while the LLM-based approach still struggles with certain aspects of code generation, the traditional compiler has become more robust and adaptable, demonstrating that lessons learned in one approach can inform improvements in the other.