Every Time You Try to Micro-Reuse Code, God Kills a Kitten.

– where by ‘macro-reuse’ I mean reusing objects or libraries (which is good), and by ‘micro-reuse’ I mean seeing a class that has some overlap with your needs and saying ‘hey, let’s derive from that and just override these two interfaces’, or seeing a function that looks similar to the one you were about to code and saying ‘hey, let’s call that and add an internal branch for my new case’ (which kills kittens).

Code reuse seems to hold the promise for software heaven. Forcing it into a paragraph, the mantra says: identify common functionality, code it at a single location, then use it all over the place. When a fix is needed, just apply it at this single location – and have your entire app enjoy it at no additional cost.

The world of software design seems to consider this an unquestionable truism[*]. Alas, it does not hold in (at least my subset of) reality.

I have virtually never seen code whose lifecycle followed this sterile path: code and maintain at a single place , enjoy everywhere. When some functionality is indeed made to service multiple clients, the real life probable scenario is that no one will ever, ever, dare to maintain it.

The cases where a product bug is identified that is truly internal to some well defined common functionality, regardless of its users, are rare at best. Much more often, the process is –

  1. The needs of a specific code-client change, or a behavioral bug is detected in a specific context.
  2. The poor soul who is tasked with the fix cannot possibly know of all other code clients and thus dares not touch the common functionality.
  3. The specific bug or requirements change are addressed with a patch at the client code.
  4. The common functionality quickly turns into a code fossil, never to be modified again.

Thus blind code reuse, intended to increase maintainability, ends up creating tight coupling – thereby vastly reducing maintainability.

I see this all the time, both in twisted flow control and overload of flow-control flags, and in ridiculous class hierarchies. I recently had to deal with ~6 layers of ~30 classes (!) aimed solely at reusing existing code. This was occasionally as few as 10 lines to reuse from a parent, but apparently someone was very literal (and very thorough) at interpreting the code-reuse mantra. The result is, of course, a huge spaghetti bowl that no one dares touch – one would always prefer to override over risking regression. Such code is also unmaintainable at another level: the functionality of a single concrete object is spread across as many as 6 classes! Just following the action paths or the content of various containers (modified throughout the hierarchy layers) was a formidable task.

______________________

[*] Only after writing this up did I find a software heavyweight that holds similar opinions.  Others have also articulately phrased similar objections specifically to reuse via inheritance

Posted in Design | 2 Comments

Debugging Memory Leaks, part 3.5: Hacks with Hooks

Alan rightfully comments that setting a conditional breakpoint at _heap_alloc_dbg  significantly slows down the application.

If run time is an issue for you even in debug builds and re-compilation is not an issue, here’s an alternative trick: use an allocation hook that detects a given allocation size. Within the hook itself you can set a break and detect the stack yourself, or dump the stack via a tracepoint.  Somewhere early in the application you’d need to assign it with _CrtSetAllocHook.

The hook can be as simple as:

int MyAllocHook( int allocType, void *userData, size_t size, int  blockType, long requestNumber,
const unsigned char *filename, int lineNumber)
{
if(size==68)
DebugBreak();
return 1;
}

And the assignment as simple as:

BOOL MyApp::InitInstance()
{
_CrtSetAllocHook(MyAllocHook);

…

To dump the call stack to the output window using the debugger, set a tracepoint in a line that runs only when your requested size is detected:

(Alternatively, set a regular breakpoint, then right click it at the breakpoints window and select ‘when hit…’).

Then, type $CALLSTACK at the message line:

Allocation hooks are often used in much heavier debugging facilities – usually involving procedurally walking call stacks (more on that some day). This is quite an overkill for a hack such as this, and the debugger supplies convenient alternatives anyway.

Posted in Debugging, VC++ | 1 Comment

Debugging Memory Leaks, Part 3: Breaking on Allocations of Given Size

When battling memory leaks, you often start from the output of _CrtMemDumpAllObjectsSince or _CrtDumpMemoryLeaks (called for you if you use MFC) – something similar to:

C:\myfile.cpp(20): {130} normal block at 0x00780E80, 68 bytes long.
Data: <      > CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD
Object dump complete.

I’ve mentioned a way to break at the 130th allocation by setting _crtBreakAlloc to 130 (either by _CrtSetBreakAlloc or from the debugger). Unfortunately this works only if the allocation number is consistent across runs – and this is not the case when the code contains threaded regions (with allocations) up to the leaking allocation.

As an alternative, here’s another trick that works very well for me: the allocated sizes are virtually always consistent across runs. They are not guaranteed to be unique, of course, but you can often notice at least one seemingly non-random requested size – notice the bold 68 at the dump above. Once you do, all that remains is to set a breakpoint at _heap_alloc_dbg, and set a condition on it:

nSize == RequestedAllocSizeToTrack .

A glimpse at the code might help:

There is no direct CRT API for this, but the hassle is minimal and I prefer anyway to maximize the use of the debugger and avoid recompilation.

I hoped to be able to set the breakpoint from the breakpoints window using the context operator, but I can’t locate _heap_alloc_dbg in msvcrXXd.dll exports.  Any advice would be appreciated!

You can try similar tricks with _malloc_dbg (which is visibly exported), but that would probably work only if you’re positive that’s where your allocation are channeled through (and not, e.g., _aligned_malloc_dbg).

Posted in Debugging, VC++ | 3 Comments

Checking Memory Corruption with _CrtCheckMemory – From the Debugger

Edit: In VS2015+ versions this trick is still useful but is a bit different.

Continue reading

Posted in Debugging, VC++, Visual Studio | 2 Comments

/d1reportAllClassLayout – Dumping Object Memory Layout

Roman just posted a nice investigation he did, mostly using the g++ switch -fdump-class-hierarchy – which dumps to stdout a graphical representation of generated classes layout.

VC++ has no official similar switch, but deep inside  its undocumented pile of goodies lies the /d1reportAllClassLayout compiler switch, which achieves identical results.   It has made its first (AFAIK) public appearance here, and is mentioned in an official MS support reply and sporadically in other places. An individual object layout dump is available too, as -/d1reportSingleClassLayout.

To see it in action, compile the following with /d1reportAllClassLayout:

class A
{
  int m_a, m_b;
  virtual void cookie() {}
  virtual void monster() {}
}

class B : public A
{
  double m_c;
  virtual void cookie() {};
}

And observe the output (excerpt):

class A size(12):
+—
0 | {vfptr}
4 | m_a
8 | m_b
+—
A::$vftable@:
| &A_meta
|  0
0 | &A::cookie
1 | &A::monster
A::cookie this adjustor: 0
A::monster this adjustor: 0
class B size(24):
+—
| +— (base class A)
0 | | {vfptr}
4 | | m_a
8 | | m_b
| +—
| <alignment member> (size=4)
16 | m_c
+—
B::$vftable@:
| &B_meta
|  0
0 | &B::cookie
1 | &A::monster
B::cookie this adjustor: 0

As the VC team blog post demonstrated, this can be an irreplaceable tool in tracking down obj-compatibility problems – which can go uncaught by the compiler. It is also a great learning tool, and since /d1reportAllClassLayout dumps all compiled objects, including CRT internals – this it is also a unique reversing tool, already used for some deep spelunking into VC++ internals.

Posted in Debugging, VC++ | 3 Comments

fflush Fails to Switch From Read to Write on Files fopen’ed with ‘r+’ Access

The fopen documentation states:

When the "r+", "w+", or "a+" access type is specified, both reading and writing are allowed (the file is said to be open for "update"). However, when you switch between reading and writing, there must be an intervening fflush, fsetpos, fseek, or rewind operation.

Indeed? A brief experiment, if you please.

This indeed succeeds:

char buff[10]; 
FILE* fp = fopen("c:\\test.txt" ,"r+" ); 
// file contains, say, "1 \n 2 \n 3 \n 4 \n 5 \n" 
fgets(buff, 10, fp); // buff contains "1" 
fgets(buff, 10, fp); // buff contains "2" 
fseek(fp, 0, SEEK_CUR); 
fputs("a" , fp); // succeeds
fclose(fp); 

While this fails miserably:

char buff[10]; 
FILE* fp = fopen("c:\\test.txt","r+" ); 
fgets(buff, 10, fp); 
fgets(buff, 10, fp); 
fflush(fp); 
fputs("a" , fp); // fails 
fclose(fp); 

Stepping into the CRT source gives some conjectured reasons:

(1) fopen with ‘r+’ sets fp->_flag to (_IORW & ~(_IOREAD | _IOWRT)).

(2) fgets adds _IOREAD.

(3) fputs contains code (in _flswbuf), logically equivalent to –

    if (stream->_flag & _IOREAD && !(stream->_flag & _IOEOF))) 
    { 
     stream->_flag |= _IOERR; 
     return (_TEOF); 
    }

with an explicit comment (snip):

/* Note that _IOREAD and IOEOF both being set implies switching 
from read to write at end-of-file, which is allowed by ANSI. */

So indeed an explicit transition is mandated.

(4) fseek contains just such a transition. Code in _fseek_nolock reads:

 /* If file opened for read/write, clear flags since we 
     don't know what the user is going to do next [..snip] */
     if (stream->_flag & _IORW) 
     stream->_flag &= ~(_IOWRT|_IOREAD); 

(5) However, fflush does NOT. It does contain code (in _flush) equivalent to –

if ((stream->_flag & (_IOREAD | _IOWRT)) == _IOWRT [..snip]) 
{ 
     _write([..snip]); 
     if ( stream->_flag & _IORW ) 
     stream->_flag &= ~_IOWRT; 
}

So fflush allows transitioning ONLY from write to read.

This is either a CRT bug (my guess), or a documentation issue. Might it be a subtle ANSI-C conformance thingy (as hinted by the _flswbuf comment) that I’m missing? (it’s still a documentation issue in that case). Opinions are very welcome.

This is almost verbatim a feedback I submitted at MS connect a while ago following a forum advice. The issue was triaged (I imagine) and escalated to the product team, and after a few weeks its status changed to ‘fixed’ without any word of explanation.    Oh well.   Hopefully this post will help someone out there some day.

Posted in VC++ | 2 Comments

std::vector of Aligned Elements – Revisited

A while ago I posted about vector::resize – how it takes a parameter by value and thus cannot store aligned elements. I confessed I didn’t understand Stephen Lavavej’s words:

…There’s a reason that our resize() is currently taking T by value – it guards against being given one of the vector’s own elements …

Some interpretations were suggested by commenters, but the mystery pretty much remained. In the last few days, a StackOverflow post –which I can’t find now – led me to Standard Library Defect #679:

The C++98 standard specifies that one member function alone of the containers passes its parameter (T) by value instead of by const reference:

void resize(size_type sz, T c = T());

This fact has been discussed / debated repeatedly over the years, the first time being even before C++98 was ratified. The rationale for passing this parameter by value has been:

So that self referencing statements are guaranteed to work, for example:

v.resize(v.size() + 1, v[0]);

However this rationale is not convincing as the signature for push_back is:

void push_back(const T& x);

And push_back has similar semantics to resize (append). And push_back must also work in the self referencing case:

v.push_back(v[0]);  // must work
…

In the former post I noted that since gcc does implement vector::resize with a by-reference argument, and so wrongfully concluded that the by-value argument is an MS issue. Turns out it is (well, was) a C++ standard acknowledged issue – and unlike gcc, MS were being strictly conformant. You could say, though, that MS did implement large parts of C++0x in VS2010 and they might have squeezed in this standard correction – but fact is they didn’t and since it isn’t yet a standard it isn’t yet a bug.

Either way, Stephen’s words make sense now. I still think, however, that the current implementation can take a const reference as is, and I can’t see any further internal changes required.  (Than again, that is probably part of the reason that I’m not the one maintaining MS’s STL…).

Posted in VC++ | 1 Comment

PGO and OpenMP

Alan Ning just posted about a cool hack of VS’s PGO switch, which was a sad reminder of all the wonderful goodies I’d never taste in my present job.

PGO builds can incorporate many optimizations that regular builds cannot, thereby leading to ~7.5% speedup (as measured by the VC++ team) with *zero* coding effort.  Still, eagerly trying it – I got struck with "Fatal Error C1310: profile guided optimizations are not available with OpenMP".

You see occasional rants about this incompatibility, but these are met with consistent MS reluctance to address it. Quoting from the Connect bug:

Since PGO and OpenMP are both designed to increase performance, it is a shame that they can’t be used together, but there were some major design issues preventing this. There is an ordering dependency in the designs of PGO and OpenMP. OpenMP changes some things about the program state that are important to PGO, and it does this after PGO has already processed the program state and made many decisions based on it.

And further yet –

We will be considering a modification of the designs of PGO and/or OpenMP to get this to work in a future product version.

That was written in April 2005. 3 VC++ versions later, the incompatibility persists.

We are using OpenMP, extensively, as do probably all dev shops today who care about performance. Seems anyone who does any new native code today probably cares deeply about performance – and those are exactly the devs that both OpenMP and PGO are aimed at!  The stated obstacles in addressing this issue (essentially, getting the order of operations right) don’t sound that insurmountable to me. I’d be surprised if other compilers suffer similar limitations (but frankly I have no idea. Anyone has better info?).

As crescens2k answered me some time ago,

The way Microsoft generally do things is by demand. So if you think this is something which should get fixed soon, the best thing you can do is submit a suggestion and get everyone you know to support it.

Well said.  Please, take 20 seconds and upvote it.

Posted in VC++ | Leave a comment

Ternary Operator and Type Compatibility

The (?,:) operator, as in –

int a = (b>5)? 6 : 7 ;

is commonly referred to as the Ternary Operator. (It is a common abuse of terminology: a ternary operator is one that operates on three arguments. This particular ternary operator is the Conditional Expression operator.)  It has several surprising attributes.

First – did you know that it can result in an lvalue? This is perfectly valid c++ code:

int a, b, c = 5;
((c>4) ? a : b ) = 42;

I can imagine some scenarios where this could make code more succinct,  but for the most part (quoting my friend Liron Leist here) this is just the kind of code that makes compilers scratch their heads and go open a c++ book.

Here’s a second gotcha that bit me a while ago, and only recently did I feel I actually understand it. Lo the following:

class Pet   {};
class Dog : public Pet {};
class Cat : public Pet  {};

Pet* pPet = NULL;

// This compiles perfectly:
if( IsSocialGuy() )
    pPet = new Dog;
else
    pPet = new Cat;

// While this doesn't:
pPet = IsSocialGuy() ? new Dog : new Cat;

Kinda surprising. The compiler feedback even more so:

error C2446: ‘:’ : no conversion from ‘Cat *’ to ‘Dog *’

Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

Huh? cast Cat* to Dog*? Who ordered that? MSDN isn’t much help either:

If both expressions are of pointer types (…) , pointer conversions are performed to convert them to a common type.

Ok, what? Seems both of the expression alternatives needs to be evaluated as expressions – why must they result in identical types?  Even worse: suppose you can accept this as a c++ quirk – isn’t it painfully obvious what casting has to be done to convert both Cat* and Dog* to a common type (namely, Pet*)?

Well – thank the universe for Eric Lippert – no, it is not obvious one bit. The code continues:

void PutInMicrowave(Dog* pPet)
{
    cout << _T("Woof?  Splat.");
}

void PutInMicrowave(Cat* pPet)
{
    cout << _T("Meaw?  Splat.");
}

// How would you suggest compiling this call?
PutInMicrowave( IsThai() ? new Cat : new Dog);

If the types of the two expression alternatives do not coincide, the compiler faces the risk of being unable to resolve overloads. As Eric beautifully put it, ‘We need to be able to work out the type of an expression without knowing what it is being assigned to. Type information flows out of an expression, not into an expression.’

Elsewhere, Eric lays out the details of the logic behind ‘pointer conversions are performed to convert them to a common type’. This is, however, in C# context (as was his StackOverflow reply) – I can only speculate that the C++ semantic logic is similar, but can’t find any authoritative links (suggestions are welcome!)

Posted in C++ | 4 Comments

On _purecall and the Overhead(s) of Virtual Functions

A while ago a friend asked me whether pure virtual functions have higher overhead than regular virtual functions. At the time I answered that this cannot be – the pure/non-pure distinction is meaningful only at compile time, and non-existent at runtime. Only a (long) while later did I connect the dots, and understand what he meant than (sorry Mordy..)

Regular Overhead

Virtual calls are known to be more costly than calls that are resolved at compile time.  Elan Ruskin measured ~50% difference – I measured a bit less, but the difference is certainly there. For functions that do real work the actual call overhead can be mostly neglected, but for functions that get called a lot in a scenario you’re struggling to optimize – you can get tangible results by eliminating virtual calls. It’s widely considered a good practice to use the added flexibility of virtual functions when you have a concrete reason and not just for the fun of it.

There are two reasons for the added call cost – main one being the CPU instruction-prefetch mechanism.  A regular call is resolved at compile time into something like –

call 0xabcd1234

– which the instruction caching apparatus (a.k.a trace cache) easily resolves ahead of time, and can tell where to continue the fetching from. However, a virtual call is compiled into something like –

call eax

Faced with the impossible task of predicting the contents of eax dozens of instructions in advance, the trace cache just stalls.

The second potential extra cost is an extra dereferencing. The first DWORD_PTR of a c++ object (neglecting virtual inheritance) is a pointer to it’s virtual table –  a table that is common to all instances of the same class. A call to a virtual function is resolved by first dereferencing that vtable pointer, and only then calling the function at a fixed offset from the vtable start. Maciej Sinilo tried to isolate this cost by comparing calls via explicit function pointers to calls via virtual functions, and turns out the difference is practically non-measurable. (BTW, I didn’t check but I suspect part of the reason is the compiler’s ability to resolve that dereference in compile time, in many situations).

Pure-Virtual Extra Overhead??

Well, not really. At least not directly. But the function ‘containers’ – classes that are used as interfaces – do come with some extra weight.

To put it short, an interface class by default has its own constructor and destructor – just like any other class. These are called just before/after a child class constructor/destructor – as you would expect in any hierarchy.

But wait… what? Why? If you didn’t provide an implementation for such a constructor,  what can it possibly do?

What every compiler-generated constructor does: set up the class vtable pointer. In this case, it is a very short lived pointer – one that is immediately overwritten by the child ctor.

Take the following code:

class A
{
public:
	__declspec(noinline) A()
		{ _tprintf_s(_T(" A::A "));}
	virtual ~A()
		{ _tprintf_s(_T(" A::~A "));}

	virtual	void	f() = 0;
	virtual	void	g() = 0;
};

class B : public A
{
public:
	__declspec(noinline) B()
		{ _tprintf_s(_T(" B::B "));}
	virtual ~B()
		{ _tprintf_s(_T(" B::~B "));}

	virtual	void	f()
		{ _tprintf_s(_T(" B::f "));}
	virtual	void	g()
		{ _tprintf_s(_T(" B::g "));}
};

int _tmain(int argc, _TCHAR* argv[])
{
	B b;
	return 0;
}

The order of events in B’s construction is as follows:

(1) The child ctor, B::B() is called and immediatly calls the parent ctor A::A().

(2) The parent ctor A::A(), setd the object vfptr to point to the common A vtable –

~A
xxx
xxx

(keep those xxx placeholders in mind – more on them soon). Watching the object state in VS at this point you see:

ACtor

– emphasized are the instruction that populates the object vfptr, and the referenced vtable itself.

(3) B::B() continues, and modifies the object vfptr to point to the common B vtable:

~B
B::f()
B::g()

and again a VS view:

BCtor

If you’d call f() from B::B() A::A() [thanks Roman!], which is the sort of self-foot-shots c++ allows,  you’d be using A’s vtable (which is the only one the object knows at that time), end up calling the nonexistent xxx and gloriously crash in runtime. Of course it’s never that direct, and it’s pretty much a consensus you should never call a class virtuals from its own ctor/dtor.

Why all that hassle??

Frankly, I don’t know. It seems C++ goes out of its way to build and tear apart abstract-parent vtables, than exposes them briefly only during child construction/destruction, and then the community expert unanimously recommend never to use them. Herb Sutter does give 3 scenarios where you might consider using them – I find none of them convincing, and generally consider this to be one of the C++ semantic warts.

So, can this extra weight be mitigated?

Yes – at least in MS-specific ways. The direct way is by adding a __declspec(novtable) modifier to the abstract interface declaration. If you can guarantee that the interface class would never need any constructors/destructors (which can be tricky at times), it would be more readable to use __interface instead.

Beyond the direct saving of the extra ctor/dtor work, a happy side effect of novtable is that it eliminates all references to the modified interface vtable. The linker is then able to remove it from the binary altogether – thereby reducing the binary size and providing some extra boost. (When applied to a lot of interfaces, this can get tangible results!)

Bonus – so what is xxx, really?

Google xxx and see for yourself. I’ll wait until you return. It’d probably be a while.

Ok – this probably deserved it’s own post, as there still appears to be some online confusion regarding it. Apparently the ‘= 0’ pure virtual syntax is leading some to believe that the xxx entries are truly zeros. In fact MSDN columnist Paul DiLascia wrote sometime in 2000 that –

…the compiler still generates a vtable all of whose entries are NULL and still generates code to initialize the vtable in the constructor or destructor for A.

That may actually have been true than (I’m not even sure of that), but certainly isn’t now.

xxx is the address of the CRT function _purecall, which is essentially a debugging hook. You can  control xxx’s value directly by overloading purecall yourself, or alternatively use _set_purecall_handler to route into your own handler from within _purecall. You might consider doing so, e.g., to collect stack traces or minidumps in production code.

Posted in VC++ | 4 Comments