Recommended Posts

7 hours ago, DarthParametric said:

That seems fine to me. I was more concerned that certain files might not able to be decompiled at all. This is the case for DeNCS as well, based on my own experiments.

My concern with nwnnsscomp is that we now know its output differs from that of the original compiler. That leads to three possibilities:

  1. The output files are fully reversible, even if they differ slightly from the source. This is acceptable.
  2. The output files are fully reversible, but require additional code to detect and handle the differences from the original compiler. This will be very annoying.
  3. The output files are not reversible. This is unacceptable.

At the moment, I have no idea which of these is the case. With over 2,500 files, there is no reasonable way I can develop a reverse compiler while inspecting each script in a timely fashion. I feel it's something to take note of, and look into once the reverse compiler is complete.

1 hour ago, JCarter426 said:

This is still common practice, at least as far as I've been taught. Yeah, there are security issues if you then try to read one, but modern compilers either won't let you do that or will initialize it with a default value, if the language itself doesn't. With that in mind, I don't see the issue.

I don't know who taught you this, but it's a practice that's frowned upon even with modern compilers. We try to learn good habits rather than relying on the compiler to handle things we could easily do ourselves. It's not uncommon for bugs to creep into compilers, and you can track these down more readily with proper code. This advice comes from the best programmers at places like Apple, Google, and Microsoft, and even the ISO C and C++ committees.

1 hour ago, JCarter426 said:

The issues you've mentioned before don't seem like they should cause problems to me. Like with the nested else/if vs a simple else if, that'll only be functionally different if you declare something local to the else branch's scope. And, obviously, don't do that. And I doubt the compiler would've optimized for it that way if it would produce functionally different code. It usually knows better than we do.

Why not create variables local to the else branch's scope? That's the whole point of having scopes, to create variables only where they're needed. Variables placed in too high a scope, e.g. the global scope, are at risk of being overwritten erroneously.

As for the compiler knowing better than we do, that's true of professionally written compilers. Which nwnnsscomp is not. I've already listed some of the basic things it fails to optimize, e.g. failing to remove empty functions. To be fair, the original compiler from BioWare has an even bigger problem in that logical AND and logical OR are incorrect. Which nwnnsscomp doesn't fix...

Share this post


Link to post
Share on other sites
2 hours ago, AmanoJyaku said:

I don't know who taught you this, but it's a practice that's frowned upon even with modern compilers. We try to learn good habits rather than relying on the compiler to handle things we could easily do ourselves. It's not uncommon for bugs to creep into compilers, and you can track these down more readily with proper code. This advice comes from the best programmers at places like Apple, Google, and Microsoft, and even the ISO C and C++ committees.

I've seen numerous examples in the documentation for C# and Java, including the very page on variable initialization in C#. The C++ textbooks I've read (Starting Out with C++, Professional C++, C++ Notes for Professionals) obviously warned about the dangers of using uninitialized variables, but still used the extensively when it was safe.

I was taught that a) it is theoretically more efficient in languages that don't require variables to be initialized (e.g. C++ and C#) as minimal as that efficiency gain may be, and b) it's more readable—if you don't know what the value will be, don't say that's the value, as that's potentially misleading.

3 hours ago, AmanoJyaku said:

Why not create variables local to the else branch's scope? That's the whole point of having scopes, to create variables only where they're needed. Variables placed in too high a scope, e.g. the global scope, are at risk of being overwritten erroneously.

This, on the other hand, I was taught was bad practice because it increases the chance of hiding variables by accident and creating conflicts. So I was told that it's better to avoid this except for temp variables in a for loop or for a swap. This was just within a function, though—obviously global variables have their own problems.

I was also told that, in terms of memory, it's better to allocate it at the start of a function and have it crash there if it's going to crash rather than in the middle of execution. But I question whether that's a realistic problem anymore.

3 hours ago, AmanoJyaku said:

As for the compiler knowing better than we do, that's true of professionally written compilers. Which nwnnsscomp is not. I've already listed some of the basic things it fails to optimize, e.g. failing to remove empty functions. To be fair, the original compiler from BioWare has an even bigger problem in that logical AND and logical OR are incorrect. Which nwnnsscomp doesn't fix...

Well... yes. I believe you. 😧

Share this post


Link to post
Share on other sites
4 hours ago, AmanoJyaku said:

At the moment, I have no idea which of these is the case. With over 2,500 files, there is no reasonable way I can develop a reverse compiler while inspecting each script in a timely fashion. I feel it's something to take note of, and look into once the reverse compiler is complete.

You might want to see if the round-trip holds, at least using nwnnsscomp and existing source files.

Like, source file -> nwnnsscomp (ncs #1) -> decompiler -> nwnnsscomp (ncs #2), then compare whether ncs #1 and ncs #2 identical. There shouldn't be any differences in the bytecode, probably. Possibly also run it through the decompiler again to check that the output is stable.

You can automate that steps and just focus on the examples where it fails. That's possible also good as a sort of test suite between big changes or releases.

There never was a public release of a KotOR-era BioWare compiler, though, right? Has anybody ever tried rigging the compiler in the Neverwinter Nights toolset (does it read the nwscript.nss or is it more hardcoded)? If it can be rigged, that would give you another angle to test on, though it's not necessarily guaranteed any of the NWN releases matches the version (versions?) that was used during KotOR development.

Share this post


Link to post
Share on other sites
2 hours ago, JCarter426 said:

I've seen numerous examples in the documentation for C# and Java, including the very page on variable initialization in C#. The C++ textbooks I've read (Starting Out with C++, Professional C++, C++ Notes for Professionals) obviously warned about the dangers of using uninitialized variables, but still used the extensively when it was safe.

Often, documentation is written with the assumption that the reader has some familiarity with the topic and is simply looking for a reference as a refresher on minutae. For example, the C# link shows a class definition A with an unassigned static scalar variable x, and unassigned instance scalar variable y. This is valid because it states:

Quote

Static variables

The initial value of a static variable is the default value (Default values) of the variable's type.

Instance variables

The initial value of an instance variable of a class is the default value (Default values) of the variable's type.

The default value of an int is 0. This is achieved by the compiler writing the assignment code for you, so the initialization occurs. In contrast:

Quote

Local variables

A local variable introduced by a local_variable_declaration is not automatically initialized and thus has no default value.

With the exception of the section "Try-catch-finally statements", the remainder of the document specifically uses uninitialized variables to point out their dangers. The "Try-catch-finally statements" section demonstrates a contrived (meaning, unrealistic) example of "safe" use of uninitialized variables. Documentation and tutorials use contrived examples to demonstrate how something works, but often caution such examples are not acceptable in production code.

I should point out Microsoft's documentation is mostly filled contrived examples. Post one of their examples on StackOverflow and prepare to be burned.

tl;dr: never assume uninitialized variables are safe.

2 hours ago, JCarter426 said:

I was taught that a) it is theoretically more efficient in languages that don't require variables to be initialized (e.g. C++ and C#) as minimal as that efficiency gain may be, and b) it's more readable—if you don't know what the value will be, don't say that's the value, as that's potentially misleading.

There's a mantra you should repeat to yourself: code for correctness, worry about performance later. Modern CPUs are fast, and compilers are good at optimizing your code for those CPUs. A CPU that operates at 3 billion cycles per second does not benefit from saving a few dozen cycles by omitting a single variable initialization. If the compiler thinks you can benefit, it will often make the change for you. For example, creating variables inside of loops is considered a bad practice because you're just going to create and destroy that variable repeatedly. Which is why compilers perform variable hoisting, meaning the variable is taken out of the loop and placed before it. So:

int Count = 10;
while (Count > 0)
{
  int i = 0;
  //Do something with i
  --Count;
}

Becomes:

int Count = 10;
int i = 0;
while (Count > 0)
{
  //Do something with i
  --Count;
}

But, you should probably just use a for loop:

for (int i = 0, Count = 10; Count > 0; --Count)
{
  //Do something with i
}

Readability is a concern, yes. But, it's more readable when you give a variable a descriptive name, initialize it with an appropriate starting value, and place it where you intend to use it. In the example, I can guess Count is the number of times to execute the loop. But, I should give i a better name. Maybe it should be Meters, or Years, or Euros? Both variables are created once, and destroyed once the loop completes. See how much we can learn from this simple code fragment?

2 hours ago, JCarter426 said:

This, on the other hand, I was taught was bad practice because it increases the chance of hiding variables by accident and creating conflicts. So I was told that it's better to avoid this except for temp variables in a for loop or for a swap. This was just within a function, though—obviously global variables have their own problems.

I was also told that, in terms of memory, it's better to allocate it at the start of a function and have it crash there if it's going to crash rather than in the middle of execution. But I question whether that's a realistic problem anymore.

I'm not sure I know what you mean by hiding variables, but if you place them where you use them then they aren't hidden. Conflicts can easily be avoided by using descriptive names. And a function should only be as large as the single task it performs.

And, please, no global variables. I don't think there is ever a reason for global variables.

As for memory, that's only a problem with embedded systems and some 32-bit platforms. No matter how little RAM a computer has it still has access to virtual memory. It won't crash, but it will be slow.

 
 1 hour ago, DrMcCoy said:

You might want to see if the round-trip holds, at least using nwnnsscomp and existing source files.

Like, source file -> nwnnsscomp (ncs #1) -> decompiler -> nwnnsscomp (ncs #2), then compare whether ncs #1 and ncs #2 identical. There shouldn't be any differences in the bytecode, probably. Possibly also run it through the decompiler again to check that the output is stable.

You can automate that steps and just focus on the examples where it fails. That's possible also good as a sort of test suite between big changes or releases.

That's the plan.

1 hour ago, DrMcCoy said:

There never was a public release of a KotOR-era BioWare compiler, though, right? Has anybody ever tried rigging the compiler in the Neverwinter Nights toolset (does it read the nwscript.nss or is it more hardcoded)? If it can be rigged, that would give you another angle to test on, though it's not necessarily guaranteed any of the NWN releases matches the version (versions?) that was used during KotOR development.

I have no idea, I only learned about these tools when I answered your request for help. It's embarrassing, because I have two copies of NWN, I just need to find them...

Share this post


Link to post
Share on other sites
1 hour ago, AmanoJyaku said:

I should point out Microsoft's documentation is mostly filled contrived examples. Post one of their examples on StackOverflow and prepare to be burned.

Yeah, I know. And with a managed language, they have the luxury of being more willy-nilly than if it were C++ or C. Still, I don't think they would be doing it all over the place if it were so universally unacceptable. It's safe as long as you don't read before assigning at some point, and modern IDEs and compilers go out of their way to prevent you from doing that by accident.

Again, if you have a compiler you can trust. I know the Visual Studio C++ compiler has some silly bugs, and Clang lets you do things it really shouldn't do because that code won't compile with anything else, to say nothing of NWNNSSComp.

1 hour ago, AmanoJyaku said:

There's a mantra you should repeat to yourself: code for correctness, worry about performance later. Modern CPUs are fast, and compilers are good at optimizing your code for those CPUs. A CPU that operates at 3 billion cycles per second does not benefit from saving a few dozen cycles by omitting a single variable initialization.

I know in most use cases it's minimal, but it can make a difference if you're processing huge amounts of data in a database, which C# and .NET are often used for. The reason C# allows uninitialized local variables, unlike, say, Java, is specifically for the performance gain. It mattered to somebody somewhere at some point.

1 hour ago, AmanoJyaku said:

I'm not sure I know what you mean by hiding variables, but if you place them where you use them then they aren't hidden. Conflicts can easily be avoided by using descriptive names.

I mean if you give a variable in an inner scope the same identifier as one in the outer scope. The variable in the outer scope is hidden.

Which you can also file under don't do that, but people make mistakes.

1 hour ago, AmanoJyaku said:

And a function should only be as large as the single task it performs.

Well yes, ideally you shouldn't have too much nested code in the first place. Another reason why I don't mind NWNNSSComp replacing nested else/if with else if.

Share this post


Link to post
Share on other sites

Life has been crazy, and I've seen things I never thought would happen in my country. But, coding never stops. Time for an overdue update!

An earlier update outlined the structure of a subroutine:

return_type subA(Argument_1, Argument_2...)
{
  //Part 1 - The code that does stuff
  //Part 2 - Values returned
  //Part 3 - Destruction of local variables
  //Part 4 - Destruction of arguments
}

Part 2 returns values from a subroutine using the CPDOWNSP opcode. Parts 3 and 4 destroy local variables and arguments using the MOVSP opcode. These opcodes can appear in any part of a subroutine. So, how do we know if we're returning a value, and destroying variables and/or arguments?

We analyze part 1 to determine what, if anything, is on the stack.

Computers execute operations in sequence unless directed otherwise. In high-level languages like NWScript, control statements alter the sequence. An if statement conditionally executes code. If the test condition fails, this code is jumped over. So, what does an if statement look like in the compiled NCS?

102	CPTOPSP -3 1
110	CONSTI 0
116	EQUALxx
118	JZ 170

124	CPTOPSP -1 1
132	CPTOPSP -3 1
140	CONSTO object
146	JSR 298
152	MOVSP -3
158	JMP 296
164	JMP 170

In the example, the opcode at offset 102 of the NCS file copies the value of a variable to the top of the stack. Offset 110 places the value 0 onto the stack. Offset 116 performs a test of equality of the two values, removes them from the stack, then places the result of the test onto the stack. Offset 118 performs a conditional jump based on the result, removing the result in the process. Offsets 124-164 are the body of the if statement, which consists of a call to a subroutine followed by a return statement:

    if ( nKillMode == 0 )
    {
        DamagingExplosion(OBJECT_SELF, nDelay, nDamage);
        return;
    }

So many ops, so few statements. That's why we write in high-level languages instead of assembly, because the compiler allows us to focus on what the program should do rather than creating the proper sequence of opcodes to keep the program from breaking. Notice in the body of the if statement there is a MOVSP. That is the destruction of local variables, part 3. There is no part 2, because the subroutine returns void (nothing). But, how do we know the MOVSP is part 3 and not part 4? And why are there two unconditional jumps???

The challenge of writing a reverse compiler is knowing the rules used to compile. I've spent the past six months observing, analyzing, and documenting those rules. For example, all control statements other than the switch statement are created using the jump-if-zero (JZ) opcode. (Switch statements are created using the jump-if-not-zero [JNZ] opcode.) However, the existence of a JZ opcode does not mean you've encountered a control statement! JZ is used to create the following:

  1. Logical AND
  2. Logical OR
  3. if statements
  4. else if statements
  5. while statements
  6. for statements
  7. do-while statements (I finally found one!!!)

This means we need to some way of determining what the range of addresses created by the JZ opcode is. In the example above, the range created by the if statement is [124, 170). In this range notation a bracket "[" means the offset is included in the range, and a parenthesis ")" means the offset is outside of the range. The 170 is obvious because it's specified by the JZ opcode; if nKillMode == 0 returns false we jump to 170 and resume execution from there. Where does the 124 come from? Remember what was stated above: "computers execute operations in sequence unless directed otherwise." A JZ opcode is 6 bytes in length and the JZ is at offset 118, so 118 + 6 means the next op in the sequence is at offset 124. If nKillMode == 0 returns true we enter the body of the if statement at offset 124. Offset 124 is the first address in the range, offset 170 is the first address outside of the range.

All control statements are ranges, but not all ranges are control statements. The NCS file is itself a range: [0, file size). Subroutines are ranges. However, our focus is on the ranges created by JZ. How do we determine what they are? The unconditional jump, JMP.

Notice the example ends with two JMPs. This caused me lots of stress because I didn't know why they were there. I'm now convinced this is one of the rules of compiling NCS files: control statements end with a JMP. The if statement is easily identified as such because the second JMP goes to the same offset as the JZ. The true block of an if-else statement ends in a JMP to an offset greater than that of the JZ. The if and else-if statements all end in JMPs to the same offset. Loops all end in JMPs to an offset less than the offset of the JZ.

Logical AND and Logical Or are not control statements, so they do not end in a JMP.

Our example has two JMPs. Why? Most likely the original compiler was dumb. The NWScript return statement is translated into a JMP to the end of the subroutine, offset 296. The opcode at this offset is RETN, the last opcode in a subroutine. Anytime you see a return in NWScript that isn't the last statement in the subroutine, you know it will be implemented as a JMP:

    if ( nKillMode == 0 )
    {
        return;
    }

    if ( nKillMode == 1 )
    {
        return;
    }

    if ( nKillMode == 2 )
    {
       return;
    }

is

118	JZ 170
158	JMP 296
164	JMP 170

186	JZ 230
218	JMP 296
224	JMP 230

246	JZ 290
278	JMP 296
284	JMP 290

296	RETN

The original compiler just stuck the body of the if statement in between the JZ and JMP. A modern compiler would have removed the second JMP since it's unreachable code.

OK, this is great and all. But what about the reverse compiler? Well, now that I have the rules down I'm close to the finish line. The reverse compiler accurately finds ranges and determines what types they are. The output looks like this:

This is definitely a K2 file

Subroutine 13, 21
13, 21		//Subroutine

Subroutine 21, 298
21, 298		//Subroutine
124, 170	//if
192, 230	//if
252, 290	//if

Subroutine 298, 1345
298, 1345	//Subroutine
320, 396	//if
704, 722	//logical AND or OR
728, 1216	//loop
787, 805	//logical AND or OR
811, 1135	//if
914, 970	//if-else
1017, 1043	//if-else

Subroutine 1345, 1601
1345, 1601	//Subroutine
1459, 1553	//if
1511, 1547	//if

Subroutine 1601, 1637
1601, 1637	//Subroutine

Subroutine 1637, 1813
1637, 1813	//Subroutine
1659, 1727	//if

There's more to be done, particularly for loops because they support the break and continue statements (also implemented with JMP). But, hard part seems to be behind me. Should have an update in two weeks.

  • Like 3

Share this post


Link to post
Share on other sites

I promised an update in two weeks... How about two days?

The approach of using ranges has borne fruit. Consider the following code:

int subroutine(int I, float F, vector V)
{
    int i = 1;
    while(true)
    {
      	if(i > 10)
        {
		//do stuff
        }
        else
        {
		return -1;
        }
    }
    return 0;
}

The reverse compiler needs to be able to find the return type and parameter types. To do this, it:

  1. Counts the number of local variables in Part 1
  2. Counts the number of variables returned in Part 2
  3. Destroys the local and returned variables in Part 3
  4. Verifies the stack is empty, and assumes the count being destroyed is the number of input arguments in Part 4

Using the above code as an example, the method outlined in the post from two days ago effectively ignores everything inside the while loop. So:

  1. 1 local variable is counted, created by the RSADDI opcode
  2. 1 return variable is counted, created by the CPTOPSP opcode and copied by the CPDOWNSP opcode
  3. 1 local variable and 1 returned variable are destroyed by the MOVSP opcode
  4. 5 input arguments are destroyed by the MOVSP opcode (vectors are three floats)

Here's the output of the KoTOR2 file k_contain_unlock.ncs:

This is definitely a K2 file

Subroutine 13, 21
JSR

Subroutine 21, 995
JSR

Subroutine 995, 1122
Return variables: 0     Local variables: 0      Parameters: 0

Subroutine 1122, 1766
Return variables: 0     Local variables: 0      Parameters: 3

Subroutine 1766, 2948
Return variables: 1     Local variables: 0      Parameters: 2

Subroutine 2948, 6915
JSR

Subroutine 6915, 7116
Return variables: 1     Local variables: 0      Parameters: 1

A subroutine with the status "JSR" means it depends upon another subroutine to be reverse compiled first:

  1. Sub13 calls Sub21
  2. Sub21 calls Sub995
  3. Sub2948 calls Sub6915

Here's what the other status codes mean:

  1. Return variables: 0 - returns void; 1 - returns a single value, e.g. int, float, string or object; a value of 2 or more returns a struct or vector
  2. Local variables: 0 - should always be 0, because it's calculated after the stack is cleared; non-zero value means the analysis is flawed
  3. Parameters: 0 - there are no input parameters; 1 or more means there are input parameters

So, what are the subroutines?

  1. Sub13 - void _start() - the hidden routine
  2. Sub21 - void _global() - the routine that creates global variables
  3. Sub995 - int main()
  4. Sub1122 - void PlaceTreasureDisposable(object oContainer = OBJECT_SELF, int numberOfItems = 1, int nItemType = 900)
  5. Sub1766 - string GetTreasureBundle (int nItemLevel, int nItemType = 0)
  6. Sub2948 - string GetBundlePrefix (int nItemLevel, int nItemType)
  7. Sub6915 - string SWTR_GetQuantity(int iCount)

Still targeting an update in two weeks. Or earlier. :)

  • Like 2

Share this post


Link to post
Share on other sites

The Phantom Bug

Spoiler

Given the progress in the last few updates, I figured I'd make an attempt at reverse compiling to see what issues might be lurking. It's easy enough to reverse this:


18344	CONSTI 15
18350	CONSTO object
18356	int GetHitDice
18361	ADDxx

That's 15 + GetHitDice(OBJECT_SELF).

Similarly, the JSR issue is easily dealt with:


RSADDx
JSR
JSR

If we determine the signature of the first JSR returns 1 variable and has 0 parameters, then we know the code is JSR(JSR()).

It makes sense to use the previously discussed strategy of skipping ranges to reverse compile. And the reverse compiler plowed through all of the files... Then suddenly got stuck on one.

Attack of the Cases

Spoiler

It was a file from the M478 mod, c_m478_globl_1.ncs. It wasn't a problem before, so it was clearly related to the output of the range finding algorithm. But what? Let's look at a switch statement:


switch (var)
{
  case 0:
    //more code
    break;
  case 1:
    //more code
    break;
  case 2:
    //more code
   break;
  default:
    //more code
    break;
}

The case labels in NCS look like this:


CPTOPSP -1 1
CONSTI 0
EQUALxx
JNZ offset_1

CPTOPSP -1 1
CONSTI 1
EQUALxx
JNZ offset_2

CPTOPSP -1 1
CONSTI 1
EQUALxx
JNZ offset_3

JMP offset_4

offset_1
//more code
offset_2
//more code
offset_3
//more code
offset_4
//more code

The difference between JZ and JNZ is more than just jumping based on 0 or not 0. Since JNZ is only used for switch case labels, the offsets are the first opcodes in the case statements. What's important to note is that, unlike JZ, JNZ does not tell you the end of the range. To find the end, you need to look at the next JNZ. The next JNZ's start is the previous JNZ's end.

Which begs the question: how do you find the end of the last JNZ? It's the offset specified by the JMP. That's the default case. And as you can see, we don't know where it ends, either. In fact, it's not even a guarantee that the switch has a default case at all!

Revenge of the Switch

Spoiler

The JMP is always present, so we need another way to determine if there is a default case. You can't do it easily be examining the code starting from the JMP offset, so I tried another way. Each case label that ends with a break statement is implemented as a JMP to the end of the switch. So, I read through each of the case ranges for the first one to end with a JMP. Why the first one? Because all JMPs go to the end, so there's no need to check all of them. However, case statements aren't required to end with a break statement (this is known as "fall through"), so those that don't should lack a JMP.


switch (var)
{
  case 0:	//falls through to case 1, case 2, and default case
  case 1:	//falls through to case 2 and default case
  case 2:	//falls through to the default case 
  default:
    break;
}

And there was the flaw in my logic. What I never considered was a case statement that falls through, but also includes code that ends in a JMP. What does that?


switch (var1)
{
  case 0:
    if (var2){/*more code*/}
  case 1:
    if (var3){/*more code*/}
  case 2:
    if (var3){/*more code*/}
  default:
    if (var4){/*more code*/}
    break;
}

Remember, if, else-if, while, for, and do-while all end with a JMP. The loops jump backward, if and else-if jump to the next op. The next op is the start of the next case offset. And that offset was set as the end of the default case's range!!!

As it turns out, I was testing a more efficient method of traversing the inside of a subroutine that relied on input data being perfect. My earlier traversal code read all ops, and skipped them if they were in the ranges of if, else-if, else, while, for, and do-while statements. It was slow, but it always moved forward. The new code specifically looks for child ranges, reads up to the start of those ranges, then jumps to the end of those ranges:


Sub()
{
	//Read this
	if()
	{
		//Skip this
		while()
		{
			//Which means this is skipped, too
		}
	}
	//Resume reading from here
}

The new traversal code is correct, but the default case range was incorrectly defined as ending with a jump backward because of the combination fall through, an if statement, and the crappy method of using the switch cases to find the end of the switch statement.

The potential solution? Remove identification of switch statement ranges until a later point in analysis that includes the stack. Can't complain, I suppose, since stack analysis is part of the phase that produces text output. I'm not adding anything in, I'm just moving things around. It's just pure luck that I caught this bug due to the most unlikely combination of things that are rarely done.

 

  • Like 2

Share this post


Link to post
Share on other sites

I fixed the switch bug and decided to try the first step of reverse compiling: finding statements. For example, the following has six statements:

//c_003handfight

int StartingConditional()
{
    string tString = GetScriptStringParameter();
    int tInt = GetScriptParameter( 1 );
    int LevelInt = GetScriptParameter( 2 );

    if ( ( GetGlobalNumber(tString) == tInt ) && ( GetGlobalNumber ("G_PC_LEVEL") < LevelInt ) )
    {
        return TRUE;
    }
    return FALSE;
}

The first three are declaration statements, they introduce entities into the StartingConditional subroutine scope. In C, variables and subroutines (functions) are entities. Entities have names and types. In an NCS file, subroutines are found by the JSR opcode and variables are created by the RSADDx opcode.

The return statement is one of the jump statements, which includes the break and continue statements. All three are implemented by the JMP opcode.

The if statement is one of the selection statements, which includes the switch statement. The if statement is also a compound statement or block, meaning it can contain multiple statements.

The previous updates detailed my efforts to correctly identify the various types of statements. Here's what the reverse compiler produces:

This is definitely a K2 file

Sub13, 23
13 adds 1
15 Sub23 removes 0 returns 1
21


Sub23, 250
23 adds 1
25 GetScriptStringParameter adds 1
30
38 removes -1

44 adds 1
46 adds 1
52 GetScriptParameter removes -1 adds 1
57
65 removes -1

71 adds 1
73 adds 1
79 GetScriptParameter removes -1 adds 1
84
92 removes -1

98 adds 1
106 GetGlobalNumber removes -1 adds 1
111 adds 1
119 removes -1
121 adds 1
129 removes -1
135 adds 1
149 GetGlobalNumber removes -1 adds 1
154 adds 1
162 removes -1
164 removes -1
166 if()  removes -1

210 adds 1
216
224 removes -4

230
248

Sub13 is the hidden _start() function. This file is a conditional script used in dialogs, so it returns a value. This is the file you use to practice fighting with Handmaiden, and she won't fight you if you aren't good enough. The value returned is obtained from Sub23 and stored in the space reserved by the RSADDx opcode at offset 13 (written above as "add 1").

Sub23 is int StartingConditional(), which is what is required for conditional scripts. The alternative is void main() for non-dialog scripts. If this was a non-dialog script offset 13 would not be a reservation. It would be Sub21, because the JSR opcode is six bytes and the next opcode is the two-byte RETN (13 + 6 + 2 = 21).

Offsets 23-38 are the a declaration statement, string tString = GetScriptStringParameter(). You see three elements added to the stack, and two removed. Offset 30 is the CPDOWNSP opcode. Blank lines mean elements have not been added to or removed from the stack. Instead, CPDOWNSP copies the value returned by GetScriptStringParameter() to the element reserved by offset 23, string tString. CPDOWNSP is the assignment operator, "=". Offset 38 then destroys the value returned by GetScriptStringParameter() since it is no longer needed. In C this is known as discarding.

The key to writing a reverse compiler is finding these discarded values. The MOVSP opcode discards values, so it's a safe bet that a MOVSP opcode is the end of a statement. In the above example, offsets 38, 65, and 92 are discarding values. You can think of MOVSP as the semicolon at the end of the statement. Offset 164 is the logical AND, 98-162 are the operands. Since we have already found the ranges created by JZ and JNZ, we already know the boundaries of those block statements. Offset 166 is the if () statement, which ends at offset 210. And blocks are surrounded by curly braces.

Offset 210 places the return value onto the stack. This is after the if () statement, so it's value is FALSE. It's then returned at offset 216 using CPDOWNSP to the reservation created at offset 13. Offset 224 then destroys the three local variables and the return value using MOVSP. Finally, the RETN opcode should be at offset 230, but for some reason it's a JMP to 248. (I swear, I hate the NCS compilers.) Offset 248 has the RETN. Neither the JMP or RETN would be printed.

TL;DR

I just have to remove diagnostic info, group expressions into statements, and ensure the branching statements and nested statements are handled correctly. Look forward to another update soon.

  • Like 1

Share this post


Link to post
Share on other sites

Stripped out some of the diagnostic output and added some source code output. Here's the original Handmaiden conditional script again:

//c_003handfight

int StartingConditional()
{
    string tString = GetScriptStringParameter();
    int tInt = GetScriptParameter( 1 );
    int LevelInt = GetScriptParameter( 2 );

    if ( ( GetGlobalNumber(tString) == tInt ) && ( GetGlobalNumber ("G_PC_LEVEL") < LevelInt ) )
    {
        return TRUE;
    }
    return FALSE;
}

Here's the output from the reverse compiler:

//Kotor2\c_003handfight.ncs
This is definitely a K2 file

Sub13, 23
}

Sub23, 250
38 string = GetScriptStringParameter();
65 int = GetScriptParameter(1, );
92 int = GetScriptParameter(2, );
166 if (GetGlobalNumber(CPTOPSP, ) == CPTOPSP && GetGlobalNumber("G_PC_LEVEL", ) < CPTOPSP)
{
}
224 ;
}

Recognizable!

Here's the output of k_inc_npckill.ncs:

Spoiler

//Kotor2\k_inc_npckill.ncs
This is definitely a K2 file

Sub13, 21
13 Sub21()
}

Sub21, 298
42 int = GetScriptParameter(1, );
69 int = GetScriptParameter(2, );
96 int = GetScriptParameter(3, );
118 if (CPTOPSP == 0)
{
}
186 if (CPTOPSP == 1)
{
}
246 if (CPTOPSP == 2)
{
}
290 ;
}

Sub298, 1345
314 if (CPTOPSP > 0)
{
}
412 int = 15;
434 int = 0;
463 location = GetLocation(CPTOPSP, );
492 float = GetFacing(CPTOPSP, );
525 vector_z = GetPositionFromLocation(CPTOPSP, );
563 vector_z = GetPositionFromLocation(CPTOPSP, ) = CPTOPSP + 1.000000;
600 location = Location(CPTOPSP, CPTOPSP, CPTOPSP, CPTOPSP, );
671 object = GetFirstObjectInShape(4, 4.000000, CPTOPSP, 0, 65, 0.000000, 0.000000, 0.000000, );
722 while(GetIsObjectValid(CPTOPSP, ) && CPTOPSP > 0)
{
}
1253 ApplyEffectAtLocation(0, EffectVisualEffect(3003, 0, ), CPTOPSP, 0.000000, )
1289 AssignCommand(CPTOPSP, )
1326 DestroyObject(CPTOPSP, 0.000000, 1, 0.000000, 0, )
1331 ;
1337 ;
}

Sub1345, 1601
1374 int = GetHasFeat(56, CPTOPSP, );
1409 int = GetHasFeat(57, CPTOPSP, );
1431 int = 0;
1453 if (CPTOPSP == 1)
{
}
1569 ;
1593 ;
}

Sub1601, 1637
1623 Sub298(CPTOPSP, CPTOPSP, 0, )
1629 ;
}

Sub1637, 1813
1653 if (CPTOPSP > 0)
{
}
1760 effect = EffectDeath(0, 0, 1, );
1794 ApplyEffectToObject(0, CPTOPSP, CPTOPSP, 0.000000, )
1799 ;
1805 ;
}

 

And k_contain_unlock:

Spoiler

//Kotor2\k_contain_unlock.ncs
This is definitely a K2 file

Sub13, 21
13 Sub21()
}

Sub21, 995
37 int = 0;
59 int = 1;
81 int = 2;
103 int = 3;
125 int = 4;
147 int = 5;
169 int = 6;
191 int = 7;
213 int = 8;
235 int = 9;
257 int = 10;
279 int = 11;
301 int = 12;
323 int = 13;
345 int = 14;
367 int = 15;
389 int = 16;
411 int = 17;
433 int = 18;
455 int = 19;
477 int = 1100;
501 int = -6;
525 int = -5;
549 int = -4;
573 int = -2;
597 int = -1;
619 int = 0;
641 int = 1;
663 int = 1;
685 int = 2;
707 int = 3;
729 int = 4;
751 int = 5;
773 int = 6;
795 int = 7;
817 int = 8;
839 int = 9;
861 int = 10;
883 int = 11;
905 int = 12;
927 int = 13;
949 int = 14;
971 int = 15;
979 Sub995()
987 ;
}

Sub995, 1122
1011 object = 0;
1038 if (!GetLocalBoolean(CPTOPSP, 55, ))
{
}
1114 ;
}

Sub1122, 1766
1143 if (!GetLocalBoolean(CPTOPSP, 57, ))
{
}
1758 ;
}

Sub1766, 2948
1782 int = 0;
1819 int = GetGlobalNumber("G_PC_LEVEL", );
1839 string = "";
1890 if (CPTOPSP == 990 && !IsAvailableCreature(7, ))
{
}
1938 else if(CPTOPSP == 0)
{
}
2053 else if(CPTOPSP % 100 == 0)
{
}
2347 else if(CPTOPSP % 10 == 0)
{
}
2916 ;
2940 ;
}

Sub2948, 6915
2964 int = 1;
2987 string = GetModuleName();
3363 if (CPTOPSP == "851NIH" || CPTOPSP == "852NIH" || CPTOPSP == "901MAL" || CPTOPSP == "902MAL" || CPTOPSP == "903MAL" || CPTOPSP == "904MAL" || CPTOPSP == "905MAL" || CPTOPSP == "906MAL")
{
}
3636    case label
3658    case label
3680    case label
3702    case label
3724    case label
3746    case label
3768    case label
3790    case label
3812    case label
3834    case label
3856    case label
3878    case label
3900    case label
3922    case label
3944    case label
3966    case label
3988    case label
4010    case label
4032    case label
4054    case label
4076    case label
4098    case label
4120    case label
4142    case label
4164    case label
4186    case label
4208    case label
4230    case label
6843 string = GetModuleName();
6883 ;
6907 ;
}

Sub6915, 7116
6938 string = IntToString(CPTOPSP, );
6959 string = "0";
6981 int = 4;
7010 while(GetStringLength(CPTOPSP, ) < CPTOPSP)
{
}
7084 ;
7108 ;
}

 

Rough, but getting there!

  • Like 1

Share this post


Link to post
Share on other sites

I wanted to give an update a few weeks ago, but I had to tackle a problem I ignored for months. The reverse compiler is now more than 5,000 lines in length, and that makes it difficult to maintain. So, almost 4,000 lines of source code have been moved into a library. Additionally, many functions did too much and have since been split into smaller functions or replaced entirely. It's not sexy, but coding never is.

Or is it?

Cue: Salt-N-Peppa's "Let's talk about sex"

Let's talk about Hex, baby!
Let's talk about Binary!
Let's talk about Octal numbers,
Signed and unsigned, number theory!

OK, OK, I'm back to "normal". Anyway, after three weeks of maintenance I got back to the real work and decided to rewrite the remaining 1,000+ lines entirely. Everything from reading an NCS file from disk to producing NWScript is new. And so much better than before, since I eliminated some bugs in the code. I'll provide a more detailed post later, but here's some sample output from TSL's k_inc_npckill.ncs:

Spoiler

21, 298()
{
        int = GetScriptParameter(1);
        int = GetScriptParameter(2);
        int = GetScriptParameter(3);
}

298, 1345()
{
        int = 15;
        int = 0;
        location = GetLocation(Argument);
        float = GetFacing(Argument);
        float = GetPositionFromLocation(location = GetLocation(Argument));
        float = GetPositionFromLocation(location = GetLocation(Argument));
        float = GetPositionFromLocation(location = GetLocation(Argument)) = float = GetPositionFromLocation(location = GetLocation(Argument)) + 1.000000;
        location = Location(vector, float = GetFacing(Argument));
        object = GetFirstObjectInShape(4, 4.000000, location = GetLocation(Argument), 0, 65, vector);
        Destroying 3 arguments
}

1345, 1601()
{
        int = GetHasFeat(56, Argument);
        int = GetHasFeat(57, Argument);
        int = 0;
        return int = 0;
        Destroying 1 arguments
}

1601, 1637()
{
        Destroying 2 arguments
}

1637, 1813()
{
        effect = EffectDeath(0, 0, 1);
        Destroying 2 arguments
}

 

I purposely left out nested scopes because I wanted to focus local variables. This script is particularly interesting because it contains a vector. See the really long line that starts with float = GetPositionFromLocation? That's one of the vector's members. The source code looks like this:

Spoiler

vector vPos = GetPositionFromLocation( oLoc );

vPos.z = vPos.z + 1.0f ;

 

The really long line in the sample output essentially merges the two lines above, but if you look closely you'll see it's the combination of creating vPos.z, adding 1.0f to vPos.z, then storing the result back to vPos.z. Here's a cut-and-paste from the NSS file to compare to the sample output:

Spoiler

int nDC = 15;
int nDCCheck = 0;

location oLoc = GetLocation(oCreature);
float oOri = GetFacing(oCreature);
vector vPos = GetPositionFromLocation( oLoc );

vPos.z = vPos.z + 1.0f ;
location oExplosionLoc = Location( vPos, oOri );
object oTarget = GetFirstObjectInShape(SHAPE_SPHERE, 4.0, oLoc, FALSE, 65);

 

Now that I've confirmed the stack is being created correctly, the next tasks are to:

  1. Add variable declarations, e.g. int nParam, object oParam
  2. Properly display modification of variable values, e.g. vPos.z = vPoz.z + 1.0f
  3. Insert nested scopes (if-else, while, etc...)
  4. Probably other stuff I can't think of right now
  • Like 1

Share this post


Link to post
Share on other sites

Just a quick update. Here's output from TSL's k_contain_unlock.ncs:

Spoiler

//void main()
0 995, 1122(0)
{
        object Object0;
        Object0 = 0
        if (!GetLocalBoolean(Object0, 55))
        {
                SetLocalBoolean(Object0, 55, 1);
                Sub1122(Object0, Random(8) + 1, 920);
        }
}

//void PlaceTreasureDisposable(object oContainer = OBJECT_SELF, int numberOfItems = 1, int nItemType = 900)
0 1122, 1766(3)
{
        if (!GetLocalBoolean(CPTOPSP, 57))
        {
                SetLocalBoolean(CPTOPSP, 57, 1);
                int Int0;
                int Int1;
                Int1 = GetGlobalNumber("G_PC_LEVEL")
                if (CPTOPSP < 900)
                {
                }
                int Int2;
                Int2 = Int1 + Random(6) - 4
                if (Int2 < 1)
                {
                        Int2 = 1
                }
                if (Int2 > 30)
                {
                        Int2 = 30
                }
                Int0 = 1
                while (Int0 <= CPTOPSP)
                {
                        string String3;
                        string String4;
                        int Int5;
                        Int5 = 1
                        int Int6;
                        String3 = Sub1766(Int2, CPTOPSP)
                        Int6 = FindSubString(String3, "[")
                        if (FindSubString(String3, "[") >= 0)
                        {
                                Int5 = StringToInt(GetSubString(String3, Int6 + 1, 4))
                                String4 = GetSubString(String3, 0, Int6)
                        }
                        else
                        {
                                String4 = String3
                        }
                }
        }
        Removing -3 arguments
}

 

Overall, the results look good! Though, there are things that need to be dealt with:

  1. Subroutines only list the number of return values and arguments; type information needs to be substituted, e.g. void Sub1122(object, int, int)
  2. Arguments in expressions are listed as CPTOPSP; this needs to be replaced with a variable name and argument position, e.g. if (!GetLocalBoolean(Arg0, 57))
  3. Variable declarators with initializers are written as two lines; while correct, these should be merged, e.g. object Object0 = 0;
  4. Arguments that are assigned to need to be printed, e.g. if (Arg2 < 900) Arg2 = 0;
  5. For loops are presented as while loops; while correct, for (Int0 = 1 ; Int0 <= Arg1; ++Int0) is more compact than
    1. Int0 = 0;
    2. while (Int0 <= Arg1)
    3. ++Int0
  6. Increment and decrement are not yet printed (this is trivial, I just didn't do it)
  7. Printing expressions needs to take place at a different point in the process; the current method prints some expressions twice, and omits others
  8. Switch statements need to be printed
  9. Vectors and structs need to be detected, which will require multiple methods
  10. Return statements need to be printed
  11. Break and continue statements in loops need to be detected
  12. Arguments of type "action" (AssignCommand, DelayCommand, ActionDoCommand) need to be printed

These are largely minor issues, but they're taking a backseat to a bug I introduced along the way:

Spoiler

//string GetTreasureBundle (int nItemLevel, int nItemType = 0)
1 1766, 2948(2)
{
        int Int0 = 0;
        int Int1;
        int Int2 = GetGlobalNumber("G_PC_LEVEL")
        string String3 = ""
        if (Arg1 == 990 && !IsAvailableCreature(7))
        {
        }
        if (Arg1 == 0)
        {
                Int1 = Random(0) + 9
                String3 = Sub1766(CPTOPSP, Int1 * 100)
        }
        else if (String3)
        {
                Int0 = Random(CPTOPSP) + 1
                Int2 = Sub1766(CPTOPSP, CPTOPSP + 10 * Int0)
        }
        else if (Int2)
        {
                Int1 = Sub2948(CPTOPSP, CPTOPSP + CPTOPSP)
        }
        else
        {
                Int1 = Sub2948(CPTOPSP, CPTOPSP)
        }
        Removing -5 arguments
}

 

The else-if statements have the wrong test condition. They should all be pointing at the same variable as the preceding if statement, Arg1. This error is propagated further as reflected in the last line. Removing 5 arguments? There are only 2!!! Printing code was added yesterday, so I expect this to be a quick fix.

Share this post


Link to post
Share on other sites

Sorry for the delay. I wanted to put out an update, but I've been addressing bugs. More important, I'm dealing with real life and needed to take some time for myself. This project has consumed most of my free time, I haven't played a game since February of last year. I'll pick this up next week, but for now I'm playing some games!

  • Like 2

Share this post


Link to post
Share on other sites

Sometimes, I learn more from mistakes than I do from getting it right the first time...

My last post mentioned bug fixes. In an earlier post I described the structure of an NCS subroutine:

  1. Step 1 - The code that does the stuff you want
  2. Step 2 - Placing a return value on the stack
  3. Step 3 - Destroying local variables and temporaries on the stack
  4. Step 4 - Destroying arguments on the stack

None of these steps are required, I'm almost certain I've seen an empty subroutine that simply returned. That means a missing step cannot impede the analysis of successive steps. So, how do we analyze the following?

Subroutine 1601
1601	CONSTI 0
1607	CPTOPSP -3 1
1615	CPTOPSP -3 1
1623	JSR 298
1629	MOVSP -2
1635	RETN

There is a single path through the code, that is there are no jumps so the ops execute in sequence. (Technically, the JSR is a jump. However, subroutine calls return to the op from which they were called and proceed to the next op in the sequence.) So, how does the stack change? The first three ops each add a value to the stack. The fourth op, a subroutine call, removes those three values from the stack. The subroutine returns void, i.e. nothing, so no return value is placed on the stack. And yet, the fifth op removes values from the empty stack...

This tells us a few things:

  1. The fourth op is the last in Step 1
  2. Subroutine 1601 does not place a value onto the stack after Step 1, so there is no Step 2
  3. There are no local variables and no temporary values on the stack, so there is no Step 3
  4. Therefore, the fifth op must be Step 4
  5. Subroutine 1601 takes two arguments and returns void, i.e. void Sub1601(Arg0, Arg1)

Let's look at a more complicated example:

Subroutine 87299
87299   CPTOPSP -1 1
87307   CPTOPBP -171 1
87315   EQUALxx
87317   JZ
89610   CONSTx "NO COMBO SELECTED"
89631   CPDOWNSP -3 1
89639   MOVSP -1
89657   MOVSP -1

The first two ops each place a value onto the stack. The third removes those two values, and replaces them with a boolean (actually, an integer) result. The fourth removes the result, then performs a conditional jump. To simplify things, I've removed the conditional code branches. At this point, the stack is empty.

The fifth op places a string onto the stack. The sixth copies the string down three positions from the top of the stack. However, there is only one value on the stack so this may be a return... The seventh op clears the stack. The eighth tries to remove a value from the empty stack. So, what does this tell us?

  1. The fourth op is the last in Part 1
  2. Ops five and six may be Part 2
  3. The seventh op is Part 3
  4. The eighth and final op is Part 4

You'll notice the repeated use of "may". We should not assume all writes outside of the subroutine's stack are returns. Subroutine arguments also exist outside of the subroutine's stack, and this turned out to be the source of the bug. One of the subroutines analyzed wrote to its argument, and this was mistakenly marked as a return value. Thus, a subroutine that returned void, i.e. nothing, but also modified an argument messed up the analysis of any other subroutine that called it.

The fix was simple enough: record the last write outside of the stack, along with the difference between the negation of destination and size of the stack. In the example above, the destination is -3 and the stack size is 1: -(-3) - 1 yields 2. When analyzing Part 4, the destruction of subroutine arguments, we compare the result to the size of the negation of number of arguments being destroyed: 2 vs -(-1), or 2 > 1. This tells us the final write is to a position greater than the furthest argument, thus it must be a return value. Subroutine 87299 takes one argument and returns one value. In short, we have to analyze Part 4 before we can analyze Part 2.

Fixing bugs wasn't the only benefit. It's removed from the example, but there is an unconditional jump after the first MOVSP that transfers execution to the second MOVSP. This jump bypasses an op, creating dead code. I seem to have finally discovered the source of dead code: return statements. This also solves a different problem: I now know how to identify return statements in different parts of the subroutine. More in a future update!

  • Like 1

Share this post


Link to post
Share on other sites

"Might as well jump!" - Van Halen

The last update mentioned the mystery of the unconditional jump at the end of a subroutine, and explained it's a return statement. Let's look at it in a bit more detail.

7054	CONSTS [
7059	CPTOPSP -4 1
7067	ADDxx
7069	CONSTS ]
7074	ADDxx
7076	CPDOWNSP -6 1
7084	MOVSP -4
7090	JMP 7108
*7096	MOVSP -1*	//Dead code
*7102	MOVSP -3*	//Dead code
7108	MOVSP -1
7114	RETN

Why is this happening? It's the result of the compiler building the subroutine in stages. Let's separate the op codes into subroutine Parts:

Part 2
7054	CONSTS [
7059	CPTOPSP -4 1
7067	ADDxx
7069	CONSTS ]
7074	ADDxx
7076	CPDOWNSP -6 1

Part 3 (New)
7084	MOVSP -4
7090	JMP 7108
7096	MOVSP -1

Part 3 (Original)
7102	MOVSP -3

Part 4
7108	MOVSP -1
7114	RETN

Oho! This subroutine returns a value (Part 2), clears local variables (Part 3), and clears arguments (Part 4). But, why are there two Part 3's?

Top to bottom, the stack looks like this:

  1. Temporary return values
  2. Local variables
  3. Arguments
  4. Return values

The compiler first writes the code for clearing the arguments (Part 4), which is easy because there is a fixed number. Then, it writes code for clearing the local variables (Part 3 - Original). In the example, the subroutine has 3 local variables. Next, the compiler calculates the number of values returned in Part 2 and adds that to the number of local variables. The result is 3 + 1. The compiler does not remove the original Part 3, so it's necessary to jump over it (Part 3 - New).

So, why is there a MOVSP after the jump? As far as I can tell, it's a quirk of the compiler. Part 2 returns a value by evaluating a set of expressions. The result is then copied down the stack past the arguments. The result is then cleared in the new Part 3. However, the compiler inserts a MOVSP after the jump that would have cleared the result of the last expression.

That leaves us with two MOVSPs that are jumped over: one to clear the local variables, and one to clear the last expression (in this case, the return value). I'm rewriting my code to take advantage of this. Instead of calculating all of the changes made to the stack in order to find the subroutine signature, it could be easier to look for a jump and evaluate the code around it. The question is how to handle subroutines that don't return values, and therefore lack the jump...

This newfound knowledge is useful for another reason: I now know how to properly analyze switch statements. More in a future post.

  • Like 2

Share this post


Link to post
Share on other sites

I wrote my first NWScript in order to test the findings from the previous post. The script:

Spoiler

vector test()
{
	int I = 0;
	return GetPositionFromLocation(GetLocation(OBJECT_SELF));
}

void main()
{
	vector V = test();
}

 

I specifically chose to add a single local variable and return a vector (struct of three floats) to prove the results. Let's break down the test subroutine:

Spoiler

Part 1
61      RSADDx
63      CONSTx
69      CPDOWNSP -2 1
77      MOVSP -1
Part 2
83      CONSTx
89      1 GetLocation(1)
94      3 GetPositionFromLocation(1)
99      CPDOWNSP -7 3
Part 3 (New)
107     MOVSP -4
113     JMP 131
119     MOVSP -3
Part 3 (Original)
125     MOVSP -1
131     RETN

 

Part 1 is where the local variable I is declared and initialized. Part 2 is where the vector is returned. Part 3 is where things get interesting.

Part 3 (Original) clears the local variable. Part 3 (New) clears the return values and local variables. Since Part 3 (Original) is not removed there is a jump over it in Part 3 (New). That jump is followed by the compiler quirk, the MOVSP that clears the last expression from the stack. That expression is the result of GetPositionFromLocation(), a vector.

It's good to finally understand this weird JMP/MOVSP business, but this is awesome for two other reasons. The first is code analysis can be simplified to look for this JMP/MOVSP combination, it's no longer necessary to calculate stack changes for each op code in order to determine if a subroutine returns values. This eliminates the majority of bugs. It's also faster, but that's not as much of a concern.

The second reason is it fundamentally changes the reverse compiler construction. I'll explain more in an upcoming post, but the tl;dr is it's now possible to find the beginning and end of statements!

Spoiler

void _start()
//Hidden start function
//Essentially, main();
13      JSR 21()
19      RETN

void main()
//vector v = test();
21      RSADDx
23      RSADDx
25      RSADDx
27      RSADDx
29      RSADDx
31      RSADDx
33      JSR 61()
39      CPDOWNSP -6 3
47      MOVSP -3
//Not printed
53      MOVSP -3
59      RETN

vector test()
//int i = 0;
61      RSADDx
63      CONSTx
69      CPDOWNSP -2 1
77      MOVSP -1
//return GetPositionFromLocation(GetLocation(OBJECT_SELF));
83      CONSTx
89      1 GetLocation(1)
94      3 GetPositionFromLocation(1)
99      CPDOWNSP -7 3
107     MOVSP -4
//Not printed
113     JMP 131
119     MOVSP -3
125     MOVSP -1
131     RETN

 

Notice a pattern? 🤔

  • Like 1

Share this post


Link to post
Share on other sites

So, I finally discovered a bug in nwnnsscomp... I think.

The purpose of a reverse compiler is to recreate the syntax of the high-level language. This can only be accomplished if we understand statements, expressions, and values.

Statement: HK-47 is ready to serve, master.

- HK-47

Spoiler

There are seven categories of statements in NWScript:
Declaration statements
    int I;
    float F;
    string S;
    etc...
Expression statements
    I = 0;
    ++I;
    PrintString("Print me!!!");
    etc...
Selection statements
    if (condition)
    switch (condition)
Iteration statements
    while (condition)
    for (condition)
    do while(condition)
Labeled statements
    case 0:
    default:
Jump statements
    break;
    continue;
    return;
Compound statements
    {
        //statement 0
        //statement 1
        //...
    }
    
Each statement is identified by locating an initiating op code, terminating op code, or combination of both. For example, variable declarations are a single op code, RSADDx. However, RSADDx is also used for return values of subroutines. If the Random engine function were written as a subroutine call it might look like this:

RSADDx
CONSTI 10
JSR <offset to subroutine>

Which means we identify RSADDx as a code of interest, but need to process the op codes that follow to be certain.

As it turns out, finding the terminator is usually easier than finding the initiator. But, why?

If you care for others, then dispense with pity and sacrifice and recognize the value in letting them fight their own battles.

- Kreia

Spoiler

The majority of statements in a program are expression statements, so it's necessary to understand expressions. In general, expressions produce values:

1 + 1;

What happens to the value 2? Before we get to that, consider that not all expressions produce values. Function/subroutine calls are expressions that can produce void, or no value:

//PrintString doesn't produce a value
void PrintString(string);

It's also important to understand the NCS runtime places variables and values onto the same stack. Thus:

int I = 0;
float F = 0.0;
1 + 1;
//more statements...

looks like this:
//Top of stack

  1. 2
  2. 0.0
  3. 0

//Bottom of stack

At least, it looks that way before proceeding to the statement after 1 + 1. A variable is essentially a value that persists until the end of its enclosing scope (the curly braces). All other values are guaranteed to be destroyed by the end of their statement. Thus, another term for an expression statement is discarded-value expression.

Battle is a pure form of expression.

- Handmaiden

Spoiler

When we reach the end of a statement any value that is not a variable is guaranteed to be destroyed. The MOVSP op code is what destroys the values of expression statements. This was illustrated in the previous update. It can also be seen in declarations with initializers:

RSADDx            //Place a variable on the stack
CONSTx    0        //Place a value on the stack
CPDOWNSP -2 1    //Assign the value to the variable (initialization)
MOVSP -1        //Destroy the value, we've reached the end of a statement

But, what about expressions that return void? Expressions can produce void, but they cannot accept void. Thus, a void expression ends a statement.

What about selection, iteration, and labeled statements? Their conditions are expressions. What happens to their resulting values?

int I = 0;
//The result of the comparison is destroyed by the if statement
if (I > 0)
{
}

Thus, by the time the program proceeds to a branch of a selection or iteration statement all values are destroyed and only variables are left on the stack. This isn't true for labeled statements due to the way switch statements work, but once you exit a switch entirely all values are destroyed.

tl;dr
We find statements in many ways, including detecting the destruction of expressions.

So, what's all this got to do with nwnnsscomp?

Spoiler

That thing I mentioned about labeled statements and the way switch statements work. Usually, destroying the condition of a switch statement is the very last thing to occur. The exception to this is when a switch contains a return statement:

switch (condition)
{
case 0:
    //CPDOWNSP here, e.g. Part 2 of subroutine
    //MOVSP -N here, e.g. Part 3 of subroutine
    return 0;
case 1:
    //Jumps to the MOVSP -1
    break;
}
//MOVSP -1 here to destroy condition

We've seen the analysis of returns, so this should be familiar. So, what's the bug? Consider the following:

while (condition)
{
    switch (condition)
    {
    case 0:
        break;
    case 1:
        continue;
    }
}

Case 1 exits the switch and jumps to the end of the while statement, so the switch condition should be destroyed before the jump. It's not. Nwnnsscomp leaves the condition on the stack, which means any reference to the stack is off by the number of times the continue is executed. There are very few continue statements in the game files, and none that I could find inside of a switch statement. But, if someone were to write perfectly valid code like this, well... kaboom!!!

//Top of stack

  1. Condition
  2. Condition
  3. Condition
  4. ...
  5. Variables

//Bottom of stack

Why am I not certain this isn't a bug? It clearly is, I'm just not certain which version of nwnnsscomp is the latest, and whether or not it's been fixed. I have three versions of the program, and I've only tested one.

  • Like 1

Share this post


Link to post
Share on other sites
9 hours ago, AmanoJyaku said:

There are very few continue statements in the game files

Interesting. I can't claim to have gone through all the K1 scripts (not least because some still can't be decompiled), but in the course of working on K1CP have browsed one or two and I have never come across this one. I did a quick search through the global scripts and found a use of it in TSL's k_inc_force. Is it a TSL-only thing?

Share this post


Link to post
Share on other sites
8 hours ago, DarthParametric said:

Interesting. I can't claim to have gone through all the K1 scripts (not least because some still can't be decompiled), but in the course of working on K1CP have browsed one or two and I have never come across this one. I did a quick search through the global scripts and found a use of it in TSL's k_inc_force. Is it a TSL-only thing?

The continue statement is as old as the C language NWScript is derived from. It exists in K1, and probably NWN, it simply was never used. Another example is do-while; there is only one script out of 2,600+ that has a do-while statement. I haven't seen the ternary operator, the const keyword, or the bitwise operators, either.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.