Recommended Posts

  On 12/26/2020 at 6:14 AM, DarthParametric said:

That seems fine to me. I was more concerned that certain files might not able to be decompiled at all. This is the case for DeNCS as well, based on my own experiments.

My concern with nwnnsscomp is that we now know its output differs from that of the original compiler. That leads to three possibilities:

  1. The output files are fully reversible, even if they differ slightly from the source. This is acceptable.
  2. The output files are fully reversible, but require additional code to detect and handle the differences from the original compiler. This will be very annoying.
  3. The output files are not reversible. This is unacceptable.

At the moment, I have no idea which of these is the case. With over 2,500 files, there is no reasonable way I can develop a reverse compiler while inspecting each script in a timely fashion. I feel it's something to take note of, and look into once the reverse compiler is complete.

  On 12/26/2020 at 12:25 PM, JCarter426 said:

This is still common practice, at least as far as I've been taught. Yeah, there are security issues if you then try to read one, but modern compilers either won't let you do that or will initialize it with a default value, if the language itself doesn't. With that in mind, I don't see the issue.

I don't know who taught you this, but it's a practice that's frowned upon even with modern compilers. We try to learn good habits rather than relying on the compiler to handle things we could easily do ourselves. It's not uncommon for bugs to creep into compilers, and you can track these down more readily with proper code. This advice comes from the best programmers at places like Apple, Google, and Microsoft, and even the ISO C and C++ committees.

  On 12/26/2020 at 12:25 PM, JCarter426 said:

The issues you've mentioned before don't seem like they should cause problems to me. Like with the nested else/if vs a simple else if, that'll only be functionally different if you declare something local to the else branch's scope. And, obviously, don't do that. And I doubt the compiler would've optimized for it that way if it would produce functionally different code. It usually knows better than we do.

Why not create variables local to the else branch's scope? That's the whole point of having scopes, to create variables only where they're needed. Variables placed in too high a scope, e.g. the global scope, are at risk of being overwritten erroneously.

As for the compiler knowing better than we do, that's true of professionally written compilers. Which nwnnsscomp is not. I've already listed some of the basic things it fails to optimize, e.g. failing to remove empty functions. To be fair, the original compiler from BioWare has an even bigger problem in that logical AND and logical OR are incorrect. Which nwnnsscomp doesn't fix...

Share this post


Link to post
Share on other sites
  On 12/26/2020 at 2:36 PM, AmanoJyaku said:

I don't know who taught you this, but it's a practice that's frowned upon even with modern compilers. We try to learn good habits rather than relying on the compiler to handle things we could easily do ourselves. It's not uncommon for bugs to creep into compilers, and you can track these down more readily with proper code. This advice comes from the best programmers at places like Apple, Google, and Microsoft, and even the ISO C and C++ committees.

I've seen numerous examples in the documentation for C# and Java, including the very page on variable initialization in C#. The C++ textbooks I've read (Starting Out with C++, Professional C++, C++ Notes for Professionals) obviously warned about the dangers of using uninitialized variables, but still used the extensively when it was safe.

I was taught that a) it is theoretically more efficient in languages that don't require variables to be initialized (e.g. C++ and C#) as minimal as that efficiency gain may be, and b) it's more readable—if you don't know what the value will be, don't say that's the value, as that's potentially misleading.

  On 12/26/2020 at 2:36 PM, AmanoJyaku said:

Why not create variables local to the else branch's scope? That's the whole point of having scopes, to create variables only where they're needed. Variables placed in too high a scope, e.g. the global scope, are at risk of being overwritten erroneously.

This, on the other hand, I was taught was bad practice because it increases the chance of hiding variables by accident and creating conflicts. So I was told that it's better to avoid this except for temp variables in a for loop or for a swap. This was just within a function, though—obviously global variables have their own problems.

I was also told that, in terms of memory, it's better to allocate it at the start of a function and have it crash there if it's going to crash rather than in the middle of execution. But I question whether that's a realistic problem anymore.

  On 12/26/2020 at 2:36 PM, AmanoJyaku said:

As for the compiler knowing better than we do, that's true of professionally written compilers. Which nwnnsscomp is not. I've already listed some of the basic things it fails to optimize, e.g. failing to remove empty functions. To be fair, the original compiler from BioWare has an even bigger problem in that logical AND and logical OR are incorrect. Which nwnnsscomp doesn't fix...

Well... yes. I believe you. 😧

Share this post


Link to post
Share on other sites
  On 12/26/2020 at 2:36 PM, AmanoJyaku said:

At the moment, I have no idea which of these is the case. With over 2,500 files, there is no reasonable way I can develop a reverse compiler while inspecting each script in a timely fashion. I feel it's something to take note of, and look into once the reverse compiler is complete.

You might want to see if the round-trip holds, at least using nwnnsscomp and existing source files.

Like, source file -> nwnnsscomp (ncs #1) -> decompiler -> nwnnsscomp (ncs #2), then compare whether ncs #1 and ncs #2 identical. There shouldn't be any differences in the bytecode, probably. Possibly also run it through the decompiler again to check that the output is stable.

You can automate that steps and just focus on the examples where it fails. That's possible also good as a sort of test suite between big changes or releases.

There never was a public release of a KotOR-era BioWare compiler, though, right? Has anybody ever tried rigging the compiler in the Neverwinter Nights toolset (does it read the nwscript.nss or is it more hardcoded)? If it can be rigged, that would give you another angle to test on, though it's not necessarily guaranteed any of the NWN releases matches the version (versions?) that was used during KotOR development.

Share this post


Link to post
Share on other sites
  On 12/26/2020 at 5:55 PM, JCarter426 said:

I've seen numerous examples in the documentation for C# and Java, including the very page on variable initialization in C#. The C++ textbooks I've read (Starting Out with C++, Professional C++, C++ Notes for Professionals) obviously warned about the dangers of using uninitialized variables, but still used the extensively when it was safe.

Often, documentation is written with the assumption that the reader has some familiarity with the topic and is simply looking for a reference as a refresher on minutae. For example, the C# link shows a class definition A with an unassigned static scalar variable x, and unassigned instance scalar variable y. This is valid because it states:

  Quote

Static variables

The initial value of a static variable is the default value (Default values) of the variable's type.

Instance variables

The initial value of an instance variable of a class is the default value (Default values) of the variable's type.

The default value of an int is 0. This is achieved by the compiler writing the assignment code for you, so the initialization occurs. In contrast:

  Quote

Local variables

A local variable introduced by a local_variable_declaration is not automatically initialized and thus has no default value.

With the exception of the section "Try-catch-finally statements", the remainder of the document specifically uses uninitialized variables to point out their dangers. The "Try-catch-finally statements" section demonstrates a contrived (meaning, unrealistic) example of "safe" use of uninitialized variables. Documentation and tutorials use contrived examples to demonstrate how something works, but often caution such examples are not acceptable in production code.

I should point out Microsoft's documentation is mostly filled contrived examples. Post one of their examples on StackOverflow and prepare to be burned.

tl;dr: never assume uninitialized variables are safe.

  On 12/26/2020 at 5:55 PM, JCarter426 said:

I was taught that a) it is theoretically more efficient in languages that don't require variables to be initialized (e.g. C++ and C#) as minimal as that efficiency gain may be, and b) it's more readable—if you don't know what the value will be, don't say that's the value, as that's potentially misleading.

There's a mantra you should repeat to yourself: code for correctness, worry about performance later. Modern CPUs are fast, and compilers are good at optimizing your code for those CPUs. A CPU that operates at 3 billion cycles per second does not benefit from saving a few dozen cycles by omitting a single variable initialization. If the compiler thinks you can benefit, it will often make the change for you. For example, creating variables inside of loops is considered a bad practice because you're just going to create and destroy that variable repeatedly. Which is why compilers perform variable hoisting, meaning the variable is taken out of the loop and placed before it. So:

int Count = 10;
while (Count > 0)
{
  int i = 0;
  //Do something with i
  --Count;
}

Becomes:

int Count = 10;
int i = 0;
while (Count > 0)
{
  //Do something with i
  --Count;
}

But, you should probably just use a for loop:

for (int i = 0, Count = 10; Count > 0; --Count)
{
  //Do something with i
}

Readability is a concern, yes. But, it's more readable when you give a variable a descriptive name, initialize it with an appropriate starting value, and place it where you intend to use it. In the example, I can guess Count is the number of times to execute the loop. But, I should give i a better name. Maybe it should be Meters, or Years, or Euros? Both variables are created once, and destroyed once the loop completes. See how much we can learn from this simple code fragment?

  On 12/26/2020 at 5:55 PM, JCarter426 said:

This, on the other hand, I was taught was bad practice because it increases the chance of hiding variables by accident and creating conflicts. So I was told that it's better to avoid this except for temp variables in a for loop or for a swap. This was just within a function, though—obviously global variables have their own problems.

I was also told that, in terms of memory, it's better to allocate it at the start of a function and have it crash there if it's going to crash rather than in the middle of execution. But I question whether that's a realistic problem anymore.

I'm not sure I know what you mean by hiding variables, but if you place them where you use them then they aren't hidden. Conflicts can easily be avoided by using descriptive names. And a function should only be as large as the single task it performs.

And, please, no global variables. I don't think there is ever a reason for global variables.

As for memory, that's only a problem with embedded systems and some 32-bit platforms. No matter how little RAM a computer has it still has access to virtual memory. It won't crash, but it will be slow.

  On 12/26/2020 at 6:54 PM, DrMcCoy said:
  On 12/26/2020 at 6:54 PM, DrMcCoy said:

You might want to see if the round-trip holds, at least using nwnnsscomp and existing source files.

Like, source file -> nwnnsscomp (ncs #1) -> decompiler -> nwnnsscomp (ncs #2), then compare whether ncs #1 and ncs #2 identical. There shouldn't be any differences in the bytecode, probably. Possibly also run it through the decompiler again to check that the output is stable.

You can automate that steps and just focus on the examples where it fails. That's possible also good as a sort of test suite between big changes or releases.

That's the plan.

  On 12/26/2020 at 6:54 PM, DrMcCoy said:

There never was a public release of a KotOR-era BioWare compiler, though, right? Has anybody ever tried rigging the compiler in the Neverwinter Nights toolset (does it read the nwscript.nss or is it more hardcoded)? If it can be rigged, that would give you another angle to test on, though it's not necessarily guaranteed any of the NWN releases matches the version (versions?) that was used during KotOR development.

I have no idea, I only learned about these tools when I answered your request for help. It's embarrassing, because I have two copies of NWN, I just need to find them...

Share this post


Link to post
Share on other sites
  On 12/26/2020 at 8:06 PM, AmanoJyaku said:

I should point out Microsoft's documentation is mostly filled contrived examples. Post one of their examples on StackOverflow and prepare to be burned.

Yeah, I know. And with a managed language, they have the luxury of being more willy-nilly than if it were C++ or C. Still, I don't think they would be doing it all over the place if it were so universally unacceptable. It's safe as long as you don't read before assigning at some point, and modern IDEs and compilers go out of their way to prevent you from doing that by accident.

Again, if you have a compiler you can trust. I know the Visual Studio C++ compiler has some silly bugs, and Clang lets you do things it really shouldn't do because that code won't compile with anything else, to say nothing of NWNNSSComp.

  On 12/26/2020 at 8:06 PM, AmanoJyaku said:

There's a mantra you should repeat to yourself: code for correctness, worry about performance later. Modern CPUs are fast, and compilers are good at optimizing your code for those CPUs. A CPU that operates at 3 billion cycles per second does not benefit from saving a few dozen cycles by omitting a single variable initialization.

I know in most use cases it's minimal, but it can make a difference if you're processing huge amounts of data in a database, which C# and .NET are often used for. The reason C# allows uninitialized local variables, unlike, say, Java, is specifically for the performance gain. It mattered to somebody somewhere at some point.

  On 12/26/2020 at 8:06 PM, AmanoJyaku said:

I'm not sure I know what you mean by hiding variables, but if you place them where you use them then they aren't hidden. Conflicts can easily be avoided by using descriptive names.

I mean if you give a variable in an inner scope the same identifier as one in the outer scope. The variable in the outer scope is hidden.

Which you can also file under don't do that, but people make mistakes.

  On 12/26/2020 at 8:06 PM, AmanoJyaku said:

And a function should only be as large as the single task it performs.

Well yes, ideally you shouldn't have too much nested code in the first place. Another reason why I don't mind NWNNSSComp replacing nested else/if with else if.

Share this post


Link to post
Share on other sites

Life has been crazy, and I've seen things I never thought would happen in my country. But, coding never stops. Time for an overdue update!

An earlier update outlined the structure of a subroutine:

return_type subA(Argument_1, Argument_2...)
{
  //Part 1 - The code that does stuff
  //Part 2 - Values returned
  //Part 3 - Destruction of local variables
  //Part 4 - Destruction of arguments
}

Part 2 returns values from a subroutine using the CPDOWNSP opcode. Parts 3 and 4 destroy local variables and arguments using the MOVSP opcode. These opcodes can appear in any part of a subroutine. So, how do we know if we're returning a value, and destroying variables and/or arguments?

We analyze part 1 to determine what, if anything, is on the stack.

Computers execute operations in sequence unless directed otherwise. In high-level languages like NWScript, control statements alter the sequence. An if statement conditionally executes code. If the test condition fails, this code is jumped over. So, what does an if statement look like in the compiled NCS?

102	CPTOPSP -3 1
110	CONSTI 0
116	EQUALxx
118	JZ 170

124	CPTOPSP -1 1
132	CPTOPSP -3 1
140	CONSTO object
146	JSR 298
152	MOVSP -3
158	JMP 296
164	JMP 170

In the example, the opcode at offset 102 of the NCS file copies the value of a variable to the top of the stack. Offset 110 places the value 0 onto the stack. Offset 116 performs a test of equality of the two values, removes them from the stack, then places the result of the test onto the stack. Offset 118 performs a conditional jump based on the result, removing the result in the process. Offsets 124-164 are the body of the if statement, which consists of a call to a subroutine followed by a return statement:

    if ( nKillMode == 0 )
    {
        DamagingExplosion(OBJECT_SELF, nDelay, nDamage);
        return;
    }

So many ops, so few statements. That's why we write in high-level languages instead of assembly, because the compiler allows us to focus on what the program should do rather than creating the proper sequence of opcodes to keep the program from breaking. Notice in the body of the if statement there is a MOVSP. That is the destruction of local variables, part 3. There is no part 2, because the subroutine returns void (nothing). But, how do we know the MOVSP is part 3 and not part 4? And why are there two unconditional jumps???

The challenge of writing a reverse compiler is knowing the rules used to compile. I've spent the past six months observing, analyzing, and documenting those rules. For example, all control statements other than the switch statement are created using the jump-if-zero (JZ) opcode. (Switch statements are created using the jump-if-not-zero [JNZ] opcode.) However, the existence of a JZ opcode does not mean you've encountered a control statement! JZ is used to create the following:

  1. Logical AND
  2. Logical OR
  3. if statements
  4. else if statements
  5. while statements
  6. for statements
  7. do-while statements (I finally found one!!!)

This means we need to some way of determining what the range of addresses created by the JZ opcode is. In the example above, the range created by the if statement is [124, 170). In this range notation a bracket "[" means the offset is included in the range, and a parenthesis ")" means the offset is outside of the range. The 170 is obvious because it's specified by the JZ opcode; if nKillMode == 0 returns false we jump to 170 and resume execution from there. Where does the 124 come from? Remember what was stated above: "computers execute operations in sequence unless directed otherwise." A JZ opcode is 6 bytes in length and the JZ is at offset 118, so 118 + 6 means the next op in the sequence is at offset 124. If nKillMode == 0 returns true we enter the body of the if statement at offset 124. Offset 124 is the first address in the range, offset 170 is the first address outside of the range.

All control statements are ranges, but not all ranges are control statements. The NCS file is itself a range: [0, file size). Subroutines are ranges. However, our focus is on the ranges created by JZ. How do we determine what they are? The unconditional jump, JMP.

Notice the example ends with two JMPs. This caused me lots of stress because I didn't know why they were there. I'm now convinced this is one of the rules of compiling NCS files: control statements end with a JMP. The if statement is easily identified as such because the second JMP goes to the same offset as the JZ. The true block of an if-else statement ends in a JMP to an offset greater than that of the JZ. The if and else-if statements all end in JMPs to the same offset. Loops all end in JMPs to an offset less than the offset of the JZ.

Logical AND and Logical Or are not control statements, so they do not end in a JMP.

Our example has two JMPs. Why? Most likely the original compiler was dumb. The NWScript return statement is translated into a JMP to the end of the subroutine, offset 296. The opcode at this offset is RETN, the last opcode in a subroutine. Anytime you see a return in NWScript that isn't the last statement in the subroutine, you know it will be implemented as a JMP:

    if ( nKillMode == 0 )
    {
        return;
    }

    if ( nKillMode == 1 )
    {
        return;
    }

    if ( nKillMode == 2 )
    {
       return;
    }

is

118	JZ 170
158	JMP 296
164	JMP 170

186	JZ 230
218	JMP 296
224	JMP 230

246	JZ 290
278	JMP 296
284	JMP 290

296	RETN

The original compiler just stuck the body of the if statement in between the JZ and JMP. A modern compiler would have removed the second JMP since it's unreachable code.

OK, this is great and all. But what about the reverse compiler? Well, now that I have the rules down I'm close to the finish line. The reverse compiler accurately finds ranges and determines what types they are. The output looks like this:

This is definitely a K2 file

Subroutine 13, 21
13, 21		//Subroutine

Subroutine 21, 298
21, 298		//Subroutine
124, 170	//if
192, 230	//if
252, 290	//if

Subroutine 298, 1345
298, 1345	//Subroutine
320, 396	//if
704, 722	//logical AND or OR
728, 1216	//loop
787, 805	//logical AND or OR
811, 1135	//if
914, 970	//if-else
1017, 1043	//if-else

Subroutine 1345, 1601
1345, 1601	//Subroutine
1459, 1553	//if
1511, 1547	//if

Subroutine 1601, 1637
1601, 1637	//Subroutine

Subroutine 1637, 1813
1637, 1813	//Subroutine
1659, 1727	//if

There's more to be done, particularly for loops because they support the break and continue statements (also implemented with JMP). But, hard part seems to be behind me. Should have an update in two weeks.

  • Like 3

Share this post


Link to post
Share on other sites

I promised an update in two weeks... How about two days?

The approach of using ranges has borne fruit. Consider the following code:

int subroutine(int I, float F, vector V)
{
    int i = 1;
    while(true)
    {
      	if(i > 10)
        {
		//do stuff
        }
        else
        {
		return -1;
        }
    }
    return 0;
}

The reverse compiler needs to be able to find the return type and parameter types. To do this, it:

  1. Counts the number of local variables in Part 1
  2. Counts the number of variables returned in Part 2
  3. Destroys the local and returned variables in Part 3
  4. Verifies the stack is empty, and assumes the count being destroyed is the number of input arguments in Part 4

Using the above code as an example, the method outlined in the post from two days ago effectively ignores everything inside the while loop. So:

  1. 1 local variable is counted, created by the RSADDI opcode
  2. 1 return variable is counted, created by the CPTOPSP opcode and copied by the CPDOWNSP opcode
  3. 1 local variable and 1 returned variable are destroyed by the MOVSP opcode
  4. 5 input arguments are destroyed by the MOVSP opcode (vectors are three floats)

Here's the output of the KoTOR2 file k_contain_unlock.ncs:

This is definitely a K2 file

Subroutine 13, 21
JSR

Subroutine 21, 995
JSR

Subroutine 995, 1122
Return variables: 0     Local variables: 0      Parameters: 0

Subroutine 1122, 1766
Return variables: 0     Local variables: 0      Parameters: 3

Subroutine 1766, 2948
Return variables: 1     Local variables: 0      Parameters: 2

Subroutine 2948, 6915
JSR

Subroutine 6915, 7116
Return variables: 1     Local variables: 0      Parameters: 1

A subroutine with the status "JSR" means it depends upon another subroutine to be reverse compiled first:

  1. Sub13 calls Sub21
  2. Sub21 calls Sub995
  3. Sub2948 calls Sub6915

Here's what the other status codes mean:

  1. Return variables: 0 - returns void; 1 - returns a single value, e.g. int, float, string or object; a value of 2 or more returns a struct or vector
  2. Local variables: 0 - should always be 0, because it's calculated after the stack is cleared; non-zero value means the analysis is flawed
  3. Parameters: 0 - there are no input parameters; 1 or more means there are input parameters

So, what are the subroutines?

  1. Sub13 - void _start() - the hidden routine
  2. Sub21 - void _global() - the routine that creates global variables
  3. Sub995 - int main()
  4. Sub1122 - void PlaceTreasureDisposable(object oContainer = OBJECT_SELF, int numberOfItems = 1, int nItemType = 900)
  5. Sub1766 - string GetTreasureBundle (int nItemLevel, int nItemType = 0)
  6. Sub2948 - string GetBundlePrefix (int nItemLevel, int nItemType)
  7. Sub6915 - string SWTR_GetQuantity(int iCount)

Still targeting an update in two weeks. Or earlier. :)

  • Like 2

Share this post


Link to post
Share on other sites

The Phantom Bug

  Reveal hidden contents

Attack of the Cases

  Reveal hidden contents

Revenge of the Switch

  Reveal hidden contents

 

  • Like 2

Share this post


Link to post
Share on other sites

I fixed the switch bug and decided to try the first step of reverse compiling: finding statements. For example, the following has six statements:

//c_003handfight

int StartingConditional()
{
    string tString = GetScriptStringParameter();
    int tInt = GetScriptParameter( 1 );
    int LevelInt = GetScriptParameter( 2 );

    if ( ( GetGlobalNumber(tString) == tInt ) && ( GetGlobalNumber ("G_PC_LEVEL") < LevelInt ) )
    {
        return TRUE;
    }
    return FALSE;
}

The first three are declaration statements, they introduce entities into the StartingConditional subroutine scope. In C, variables and subroutines (functions) are entities. Entities have names and types. In an NCS file, subroutines are found by the JSR opcode and variables are created by the RSADDx opcode.

The return statement is one of the jump statements, which includes the break and continue statements. All three are implemented by the JMP opcode.

The if statement is one of the selection statements, which includes the switch statement. The if statement is also a compound statement or block, meaning it can contain multiple statements.

The previous updates detailed my efforts to correctly identify the various types of statements. Here's what the reverse compiler produces:

This is definitely a K2 file

Sub13, 23
13 adds 1
15 Sub23 removes 0 returns 1
21


Sub23, 250
23 adds 1
25 GetScriptStringParameter adds 1
30
38 removes -1

44 adds 1
46 adds 1
52 GetScriptParameter removes -1 adds 1
57
65 removes -1

71 adds 1
73 adds 1
79 GetScriptParameter removes -1 adds 1
84
92 removes -1

98 adds 1
106 GetGlobalNumber removes -1 adds 1
111 adds 1
119 removes -1
121 adds 1
129 removes -1
135 adds 1
149 GetGlobalNumber removes -1 adds 1
154 adds 1
162 removes -1
164 removes -1
166 if()  removes -1

210 adds 1
216
224 removes -4

230
248

Sub13 is the hidden _start() function. This file is a conditional script used in dialogs, so it returns a value. This is the file you use to practice fighting with Handmaiden, and she won't fight you if you aren't good enough. The value returned is obtained from Sub23 and stored in the space reserved by the RSADDx opcode at offset 13 (written above as "add 1").

Sub23 is int StartingConditional(), which is what is required for conditional scripts. The alternative is void main() for non-dialog scripts. If this was a non-dialog script offset 13 would not be a reservation. It would be Sub21, because the JSR opcode is six bytes and the next opcode is the two-byte RETN (13 + 6 + 2 = 21).

Offsets 23-38 are the a declaration statement, string tString = GetScriptStringParameter(). You see three elements added to the stack, and two removed. Offset 30 is the CPDOWNSP opcode. Blank lines mean elements have not been added to or removed from the stack. Instead, CPDOWNSP copies the value returned by GetScriptStringParameter() to the element reserved by offset 23, string tString. CPDOWNSP is the assignment operator, "=". Offset 38 then destroys the value returned by GetScriptStringParameter() since it is no longer needed. In C this is known as discarding.

The key to writing a reverse compiler is finding these discarded values. The MOVSP opcode discards values, so it's a safe bet that a MOVSP opcode is the end of a statement. In the above example, offsets 38, 65, and 92 are discarding values. You can think of MOVSP as the semicolon at the end of the statement. Offset 164 is the logical AND, 98-162 are the operands. Since we have already found the ranges created by JZ and JNZ, we already know the boundaries of those block statements. Offset 166 is the if () statement, which ends at offset 210. And blocks are surrounded by curly braces.

Offset 210 places the return value onto the stack. This is after the if () statement, so it's value is FALSE. It's then returned at offset 216 using CPDOWNSP to the reservation created at offset 13. Offset 224 then destroys the three local variables and the return value using MOVSP. Finally, the RETN opcode should be at offset 230, but for some reason it's a JMP to 248. (I swear, I hate the NCS compilers.) Offset 248 has the RETN. Neither the JMP or RETN would be printed.

TL;DR

I just have to remove diagnostic info, group expressions into statements, and ensure the branching statements and nested statements are handled correctly. Look forward to another update soon.

  • Like 1

Share this post


Link to post
Share on other sites

Stripped out some of the diagnostic output and added some source code output. Here's the original Handmaiden conditional script again:

//c_003handfight

int StartingConditional()
{
    string tString = GetScriptStringParameter();
    int tInt = GetScriptParameter( 1 );
    int LevelInt = GetScriptParameter( 2 );

    if ( ( GetGlobalNumber(tString) == tInt ) && ( GetGlobalNumber ("G_PC_LEVEL") < LevelInt ) )
    {
        return TRUE;
    }
    return FALSE;
}

Here's the output from the reverse compiler:

//Kotor2\c_003handfight.ncs
This is definitely a K2 file

Sub13, 23
}

Sub23, 250
38 string = GetScriptStringParameter();
65 int = GetScriptParameter(1, );
92 int = GetScriptParameter(2, );
166 if (GetGlobalNumber(CPTOPSP, ) == CPTOPSP && GetGlobalNumber("G_PC_LEVEL", ) < CPTOPSP)
{
}
224 ;
}

Recognizable!

Here's the output of k_inc_npckill.ncs:

  Reveal hidden contents

And k_contain_unlock:

  Reveal hidden contents

Rough, but getting there!

  • Like 1

Share this post


Link to post
Share on other sites

I wanted to give an update a few weeks ago, but I had to tackle a problem I ignored for months. The reverse compiler is now more than 5,000 lines in length, and that makes it difficult to maintain. So, almost 4,000 lines of source code have been moved into a library. Additionally, many functions did too much and have since been split into smaller functions or replaced entirely. It's not sexy, but coding never is.

Or is it?

Cue: Salt-N-Peppa's "Let's talk about sex"

Let's talk about Hex, baby!
Let's talk about Binary!
Let's talk about Octal numbers,
Signed and unsigned, number theory!

OK, OK, I'm back to "normal". Anyway, after three weeks of maintenance I got back to the real work and decided to rewrite the remaining 1,000+ lines entirely. Everything from reading an NCS file from disk to producing NWScript is new. And so much better than before, since I eliminated some bugs in the code. I'll provide a more detailed post later, but here's some sample output from TSL's k_inc_npckill.ncs:

  Reveal hidden contents

I purposely left out nested scopes because I wanted to focus local variables. This script is particularly interesting because it contains a vector. See the really long line that starts with float = GetPositionFromLocation? That's one of the vector's members. The source code looks like this:

  Reveal hidden contents

The really long line in the sample output essentially merges the two lines above, but if you look closely you'll see it's the combination of creating vPos.z, adding 1.0f to vPos.z, then storing the result back to vPos.z. Here's a cut-and-paste from the NSS file to compare to the sample output:

  Reveal hidden contents

Now that I've confirmed the stack is being created correctly, the next tasks are to:

  1. Add variable declarations, e.g. int nParam, object oParam
  2. Properly display modification of variable values, e.g. vPos.z = vPoz.z + 1.0f
  3. Insert nested scopes (if-else, while, etc...)
  4. Probably other stuff I can't think of right now
  • Like 1

Share this post


Link to post
Share on other sites

Just a quick update. Here's output from TSL's k_contain_unlock.ncs:

  Reveal hidden contents

Overall, the results look good! Though, there are things that need to be dealt with:

  1. Subroutines only list the number of return values and arguments; type information needs to be substituted, e.g. void Sub1122(object, int, int)
  2. Arguments in expressions are listed as CPTOPSP; this needs to be replaced with a variable name and argument position, e.g. if (!GetLocalBoolean(Arg0, 57))
  3. Variable declarators with initializers are written as two lines; while correct, these should be merged, e.g. object Object0 = 0;
  4. Arguments that are assigned to need to be printed, e.g. if (Arg2 < 900) Arg2 = 0;
  5. For loops are presented as while loops; while correct, for (Int0 = 1 ; Int0 <= Arg1; ++Int0) is more compact than
    1. Int0 = 0;
    2. while (Int0 <= Arg1)
    3. ++Int0
  6. Increment and decrement are not yet printed (this is trivial, I just didn't do it)
  7. Printing expressions needs to take place at a different point in the process; the current method prints some expressions twice, and omits others
  8. Switch statements need to be printed
  9. Vectors and structs need to be detected, which will require multiple methods
  10. Return statements need to be printed
  11. Break and continue statements in loops need to be detected
  12. Arguments of type "action" (AssignCommand, DelayCommand, ActionDoCommand) need to be printed

These are largely minor issues, but they're taking a backseat to a bug I introduced along the way:

  Reveal hidden contents

The else-if statements have the wrong test condition. They should all be pointing at the same variable as the preceding if statement, Arg1. This error is propagated further as reflected in the last line. Removing 5 arguments? There are only 2!!! Printing code was added yesterday, so I expect this to be a quick fix.

Share this post


Link to post
Share on other sites

Sorry for the delay. I wanted to put out an update, but I've been addressing bugs. More important, I'm dealing with real life and needed to take some time for myself. This project has consumed most of my free time, I haven't played a game since February of last year. I'll pick this up next week, but for now I'm playing some games!

  • Like 2

Share this post


Link to post
Share on other sites

Sometimes, I learn more from mistakes than I do from getting it right the first time...

My last post mentioned bug fixes. In an earlier post I described the structure of an NCS subroutine:

  1. Step 1 - The code that does the stuff you want
  2. Step 2 - Placing a return value on the stack
  3. Step 3 - Destroying local variables and temporaries on the stack
  4. Step 4 - Destroying arguments on the stack

None of these steps are required, I'm almost certain I've seen an empty subroutine that simply returned. That means a missing step cannot impede the analysis of successive steps. So, how do we analyze the following?

Subroutine 1601
1601	CONSTI 0
1607	CPTOPSP -3 1
1615	CPTOPSP -3 1
1623	JSR 298
1629	MOVSP -2
1635	RETN

There is a single path through the code, that is there are no jumps so the ops execute in sequence. (Technically, the JSR is a jump. However, subroutine calls return to the op from which they were called and proceed to the next op in the sequence.) So, how does the stack change? The first three ops each add a value to the stack. The fourth op, a subroutine call, removes those three values from the stack. The subroutine returns void, i.e. nothing, so no return value is placed on the stack. And yet, the fifth op removes values from the empty stack...

This tells us a few things:

  1. The fourth op is the last in Step 1
  2. Subroutine 1601 does not place a value onto the stack after Step 1, so there is no Step 2
  3. There are no local variables and no temporary values on the stack, so there is no Step 3
  4. Therefore, the fifth op must be Step 4
  5. Subroutine 1601 takes two arguments and returns void, i.e. void Sub1601(Arg0, Arg1)

Let's look at a more complicated example:

Subroutine 87299
87299   CPTOPSP -1 1
87307   CPTOPBP -171 1
87315   EQUALxx
87317   JZ
89610   CONSTx "NO COMBO SELECTED"
89631   CPDOWNSP -3 1
89639   MOVSP -1
89657   MOVSP -1

The first two ops each place a value onto the stack. The third removes those two values, and replaces them with a boolean (actually, an integer) result. The fourth removes the result, then performs a conditional jump. To simplify things, I've removed the conditional code branches. At this point, the stack is empty.

The fifth op places a string onto the stack. The sixth copies the string down three positions from the top of the stack. However, there is only one value on the stack so this may be a return... The seventh op clears the stack. The eighth tries to remove a value from the empty stack. So, what does this tell us?

  1. The fourth op is the last in Part 1
  2. Ops five and six may be Part 2
  3. The seventh op is Part 3
  4. The eighth and final op is Part 4

You'll notice the repeated use of "may". We should not assume all writes outside of the subroutine's stack are returns. Subroutine arguments also exist outside of the subroutine's stack, and this turned out to be the source of the bug. One of the subroutines analyzed wrote to its argument, and this was mistakenly marked as a return value. Thus, a subroutine that returned void, i.e. nothing, but also modified an argument messed up the analysis of any other subroutine that called it.

The fix was simple enough: record the last write outside of the stack, along with the difference between the negation of destination and size of the stack. In the example above, the destination is -3 and the stack size is 1: -(-3) - 1 yields 2. When analyzing Part 4, the destruction of subroutine arguments, we compare the result to the size of the negation of number of arguments being destroyed: 2 vs -(-1), or 2 > 1. This tells us the final write is to a position greater than the furthest argument, thus it must be a return value. Subroutine 87299 takes one argument and returns one value. In short, we have to analyze Part 4 before we can analyze Part 2.

Fixing bugs wasn't the only benefit. It's removed from the example, but there is an unconditional jump after the first MOVSP that transfers execution to the second MOVSP. This jump bypasses an op, creating dead code. I seem to have finally discovered the source of dead code: return statements. This also solves a different problem: I now know how to identify return statements in different parts of the subroutine. More in a future update!

  • Like 1

Share this post


Link to post
Share on other sites

"Might as well jump!" - Van Halen

The last update mentioned the mystery of the unconditional jump at the end of a subroutine, and explained it's a return statement. Let's look at it in a bit more detail.

7054	CONSTS [
7059	CPTOPSP -4 1
7067	ADDxx
7069	CONSTS ]
7074	ADDxx
7076	CPDOWNSP -6 1
7084	MOVSP -4
7090	JMP 7108
*7096	MOVSP -1*	//Dead code
*7102	MOVSP -3*	//Dead code
7108	MOVSP -1
7114	RETN

Why is this happening? It's the result of the compiler building the subroutine in stages. Let's separate the op codes into subroutine Parts:

Part 2
7054	CONSTS [
7059	CPTOPSP -4 1
7067	ADDxx
7069	CONSTS ]
7074	ADDxx
7076	CPDOWNSP -6 1

Part 3 (New)
7084	MOVSP -4
7090	JMP 7108
7096	MOVSP -1

Part 3 (Original)
7102	MOVSP -3

Part 4
7108	MOVSP -1
7114	RETN

Oho! This subroutine returns a value (Part 2), clears local variables (Part 3), and clears arguments (Part 4). But, why are there two Part 3's?

Top to bottom, the stack looks like this:

  1. Temporary return values
  2. Local variables
  3. Arguments
  4. Return values

The compiler first writes the code for clearing the arguments (Part 4), which is easy because there is a fixed number. Then, it writes code for clearing the local variables (Part 3 - Original). In the example, the subroutine has 3 local variables. Next, the compiler calculates the number of values returned in Part 2 and adds that to the number of local variables. The result is 3 + 1. The compiler does not remove the original Part 3, so it's necessary to jump over it (Part 3 - New).

So, why is there a MOVSP after the jump? As far as I can tell, it's a quirk of the compiler. Part 2 returns a value by evaluating a set of expressions. The result is then copied down the stack past the arguments. The result is then cleared in the new Part 3. However, the compiler inserts a MOVSP after the jump that would have cleared the result of the last expression.

That leaves us with two MOVSPs that are jumped over: one to clear the local variables, and one to clear the last expression (in this case, the return value). I'm rewriting my code to take advantage of this. Instead of calculating all of the changes made to the stack in order to find the subroutine signature, it could be easier to look for a jump and evaluate the code around it. The question is how to handle subroutines that don't return values, and therefore lack the jump...

This newfound knowledge is useful for another reason: I now know how to properly analyze switch statements. More in a future post.

  • Like 2

Share this post


Link to post
Share on other sites

I wrote my first NWScript in order to test the findings from the previous post. The script:

  Reveal hidden contents

I specifically chose to add a single local variable and return a vector (struct of three floats) to prove the results. Let's break down the test subroutine:

  Reveal hidden contents

Part 1 is where the local variable I is declared and initialized. Part 2 is where the vector is returned. Part 3 is where things get interesting.

Part 3 (Original) clears the local variable. Part 3 (New) clears the return values and local variables. Since Part 3 (Original) is not removed there is a jump over it in Part 3 (New). That jump is followed by the compiler quirk, the MOVSP that clears the last expression from the stack. That expression is the result of GetPositionFromLocation(), a vector.

It's good to finally understand this weird JMP/MOVSP business, but this is awesome for two other reasons. The first is code analysis can be simplified to look for this JMP/MOVSP combination, it's no longer necessary to calculate stack changes for each op code in order to determine if a subroutine returns values. This eliminates the majority of bugs. It's also faster, but that's not as much of a concern.

The second reason is it fundamentally changes the reverse compiler construction. I'll explain more in an upcoming post, but the tl;dr is it's now possible to find the beginning and end of statements!

  Reveal hidden contents

Notice a pattern? 🤔

  • Like 1

Share this post


Link to post
Share on other sites

So, I finally discovered a bug in nwnnsscomp... I think.

The purpose of a reverse compiler is to recreate the syntax of the high-level language. This can only be accomplished if we understand statements, expressions, and values.

Statement: HK-47 is ready to serve, master.

- HK-47

  Reveal hidden contents

If you care for others, then dispense with pity and sacrifice and recognize the value in letting them fight their own battles.

- Kreia

  Reveal hidden contents

Battle is a pure form of expression.

- Handmaiden

  Reveal hidden contents

So, what's all this got to do with nwnnsscomp?

  Reveal hidden contents
  • Like 1

Share this post


Link to post
Share on other sites
  On 6/6/2021 at 8:12 PM, AmanoJyaku said:

There are very few continue statements in the game files

Interesting. I can't claim to have gone through all the K1 scripts (not least because some still can't be decompiled), but in the course of working on K1CP have browsed one or two and I have never come across this one. I did a quick search through the global scripts and found a use of it in TSL's k_inc_force. Is it a TSL-only thing?

Share this post


Link to post
Share on other sites
  On 6/7/2021 at 5:41 AM, DarthParametric said:

Interesting. I can't claim to have gone through all the K1 scripts (not least because some still can't be decompiled), but in the course of working on K1CP have browsed one or two and I have never come across this one. I did a quick search through the global scripts and found a use of it in TSL's k_inc_force. Is it a TSL-only thing?

The continue statement is as old as the C language NWScript is derived from. It exists in K1, and probably NWN, it simply was never used. Another example is do-while; there is only one script out of 2,600+ that has a do-while statement. I haven't seen the ternary operator, the const keyword, or the bitwise operators, either.

Share this post


Link to post
Share on other sites

How is everyone? It's been a while... Several changes have been made to the reverse compiler, resulting in a rewrite from the ground up:

  1. All Opcodes are now properly decoded, including:
    1. RSADDx - Now able to differentiate variables from subroutine return values
    2. JSR - Number of subroutine arguments and return values fully deduced
    3. JZ - Now able to differentiate logical AND|OR expressions from if|while|for statements
  2. Subroutine signatures are now partially deduced:
    1. Number, but not types, of arguments and return values
    2. To do: Collection of type information is dependent upon evaluation of statements
  3. Proto-statements (I made up this term) are discovered:
    1. Input for recreating complete statements (I made up this term, too)
    2. int i = 0; is a complete statement
    3. int i; is a complete statement, and a proto-statement of int i = 0;
    4. i = 0; is a complete statement, and a proto-statement of int i = 0;
    5. To do: Proto-statements must be combined to recreate complete statements
  4. Error handling is slowly being added:
    1. Has helped discover programming and algorithm bugs
    2. Is meant to discover incorrect NCS files, which compilers can generate under certain conditions
    3. To do: "Full" error handling won't happen in the first release

If anyone has questions I'll be happy to explain further. For now, I'm working to complete this before the two-year anniversary of starting this project. 😬

  • Like 2

Share this post


Link to post
Share on other sites
  On 11/6/2021 at 2:41 PM, AmanoJyaku said:

I'm working to complete this before the two-year anniversary of starting this project.

Cool. I'm looking forward to finally being able to decode some of those holdout scripts that still remain impervious to DeNCS's charms (like some of the Star Forge level 4 ones for example). If you manage to release something before the end of the year we can clear out the last of the remaining hijacked scripts in K1CP before v1.9 releases.

  • Like 1

Share this post


Link to post
Share on other sites
  On 11/6/2021 at 2:48 PM, DarthParametric said:

Cool. I'm looking forward to finally being able to decode some of those holdout scripts that still remain impervious to DeNCS's charms (like some of the Star Forge level 4 ones for example). If you manage to release something before the end of the year we can clear out the last of the remaining hijacked scripts in K1CP before v1.9 releases.

I'm not promising anything, but we'll see.

 

I don't know how many people are coders, but I came across a series of articles on creating your own compiler. The articles are largely useless for writing a reverse compiler like this WIP, but they give some insight into how early design choices impact development later on. In part 4 the author wanted to make one change to a feature added in part 2, yet it led to multiple changes because so much was dependent on that one change.

This has been my experience from the start: numerous rewrites from the ground up as I learned more about the NCS byte code. Finally, I'm past interpreting individual codes and on to interpreting statements and data structures. Statements are easy (except switch statements), but data structures will require some thought. Two different data structures could have the same member definitions (e.g. struct S1 {int i; float f;} and struct S2 {int i; float f;}), so I'll need logic to detect this. Finding structures is surprisingly easy, though.

tl;dr

The two obstacles left are switch statements and user-defined structs, and I'm nearly finished with switch statements.

http://visualstudiomagazine.com/articles/2014/05/01/how-to-write-your-own-compiler-part-1.aspx

http://visualstudiomagazine.com/articles/2014/06/01/compiler-basics-part-2.aspx

http://visualstudiomagazine.com/articles/2014/07/01/syntax-analysis.aspx

https://visualstudiomagazine.com/articles/2014/09/01/compiler-basics-how-to-part-4.aspx

Share this post


Link to post
Share on other sites

The last few updates were light on samples, so let's address that. TSL's k_inc_treasure.nss:

  Reveal hidden contents

And the reverse compiler output:

  Reveal hidden contents

And the two overlapped for comparison, with the reverse compiled script labeled as "RCScript":

  Reveal hidden contents

In some cases, the number of RCScript statements matches that of the original NWScript statements, e.g. the assignment expression statement "str = pad + str;".

[An earlier post referred to RCScript statements as proto-statements, and NWScript statements as complete statements. They're all terms I made up.]

In others, multiple RCScripts statements combine to form the NWScript statement. The simplest example is a variable declaration with an initializer: the initializer is an expression statement that immediately follows the declaration.

A more complicated example is the iteration statement, which is followed by a jump statement. Then, the body of the iteration statement (in the example, the assignment expression statement) is placed between them. The return statement is similar, however it has a slight quirk in that it starts off as regular expression statement that gets split in two. Then a jump statement is placed in between the two pieces, with the second being the original MOVSP. The first piece is modified with a new MOVSP that removes the result of the expression and all other values on the stack in order to clear the stack before exiting the subroutine.

However, the ultimate is the switch statement. It's so complicated it needs its own post. It has optional parts...

The final three RCScript statements are unprintable. The first clears all variables from the stack. However, this was already dealt with in the return statement. The jump statement inserted into the return statement jumps past this MOVSP and lands on the final MOVSP. Since the stack has already been cleared, this must be clearing arguments. And then the subroutine returns.

To Do:

  1. Write code to generate expressions, e.g. ADDxx becomes <operand 1> + <operand 2>
  2. Regroup RCScript statements back into NWScript statements
  3. Integrate the value stack so variables can be referenced by unique names
  • Like 1

Share this post


Link to post
Share on other sites

Been a while, time for an update!

Writing code isn't as "simple" as using a modeling tool like 3DS Max or Maya (is Lightwave 3D still a thing?) to create 3D characters. A modeling tool uses physics; imagine being the coder who has to create the physics. Sir Amano Newton has spent the past two years unraveling NCS "physics", because NCS "god" (BioWare) didn't leave us precise notes.

Let's look at we have to do to get started writing a reverse compiler:

Binary, Decimal, and Hexadecimal Output

  Reveal hidden contents

Converting From Binary to Hex

  Reveal hidden contents

Printing

  Reveal hidden contents

Formatting to the Rescue

  Reveal hidden contents

What I Start To Work With

  Reveal hidden contents

Yes, I'm reading that and write code to reverse it. Looks like fun, eh?

  • Like 1

Share this post


Link to post
Share on other sites
  On 1/20/2022 at 10:23 PM, AmanoJyaku said:

OK, so we want to print in hex. How? I had to write my own code for this, I couldn't figure out the C++ formatting and conversion library.

The C++98-era method would be to use stringstream. It's a bit terrible, though.

You could the C function snprintf():

#include <cstdint>
#include <cstdio>
#include <string>

std::string hex_string(uint8_t data) {
	char tmp[3];

	std::snprintf(tmp, 3, "%02X", data);

	return tmp;
}

In C++20, you'll be able to use std::format() instead, which provides a syntax similar to Python's string formating. I.e. you could do std::format("{:02X}", data), which already returns a std::string. Compiler support for that is still sparse though.

Alternatively, you can pull in the (optionally header-only) library fmt, which is actually what became C++20's std::format().

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.