Disclaimer: I've dabbled as a driver writer in a past life - but not OpenGL ES.
The problem is a 100% compatibility layer is not necessarily easy nor valuable. The makers of OpenGL ES don't want a lifetime of maintaining someone elses problem. Also there is a line where you cross and you lose hardware acceleration and the mapping breaks down.
Their charter is to make a new lightweight API that meets the needs of device manufacturers and low level app developers. As soon as they adopt 100% compatability at their core or even offering an additional adapation layer they will be taking time and effort from their focus.
In this instance any OpenGL shim is an Apple responsibility as they are the SDK and environment provider. Apple and Videologic need to nut that one out themselves.
As to a shim speeding up code its essentially comes down to any impedance mismatch that may occur between an application writer and the API. This is identical to buffered versus non buffered IO and whose responsibility is it to filter idempotent operations.
When you look at a typical call stack. You'll see an application (potentially caching and filtering state), calling a library shim (potentially caching and filtering state), queuing and batching calls to a device driver (potentially caching and filtering state), dispatching to a management layer (potentially caching and filtering state), and so on, eventually getting to a graphics card processor potentially caching and filtering state and finally to a pipeline or set of functional blocks (which may have some idempotent de-duping as well).
Again how this is communicated to the developer or structured is an issue of the platform provider.
Apple can choose to say we optimize nothing (ie add no fat, waste no extra cycles) its up to you to dispatch minimal state changes, or we optimize a,b & c... - don't repeat this work, but maybe add optimizations for d, e &f... Thats something they need to document and advise on for their platform. Its not part of most standards.
Warm fuzzies for calling us Videologic instead of Imagination or PowerVR. Your description of the layers between an application and execution on the graphics core on iOS is pretty good. There's nothing between driver and hardware though.
As for why OpenGL ES is different to OpenGL, it's documented in myriad places. The resulting API might be bad in many ways, but it was never designed to allow easy porting of OpenGL (at the same generational level). It was designed to be small, efficient and not bloated, to allow for small, less complicated drivers and execution on resource-constrained platforms. It mostly succeeds.
Long live mgl/sgl! The mention about hardware dedupe/filtering was more a hat tip to culling sub pixel triangles and early culling of obscured primitives that seems to happen on many chips these days :)
We tip our hat right back! It happens to be pixel-perfect for us in this context, and it's a large part of why we draw so efficiently. Oh, and I still have a working m3D-based system that plays SGL games under DOS!
There actually are PDFs out there for the various GPU IPs on how to write best for them (Adreno, PowerVR, etc.). Sometimes they even disagree, so using triangle strips with degenerate triangles to connect separate portions can be better than using all separate triangles on another, depending on their optimizations. Apple also has recommendations:
http://developer.apple.com/library/ios/#documentation/3DDraw...
Although I don't recall off hand if any of them have mentioned sorting commands by state and deduping, which I suppose is one of the most basic optimizations for OpenGL * APIs.
The problem is a 100% compatibility layer is not necessarily easy nor valuable. The makers of OpenGL ES don't want a lifetime of maintaining someone elses problem. Also there is a line where you cross and you lose hardware acceleration and the mapping breaks down.
Their charter is to make a new lightweight API that meets the needs of device manufacturers and low level app developers. As soon as they adopt 100% compatability at their core or even offering an additional adapation layer they will be taking time and effort from their focus.
In this instance any OpenGL shim is an Apple responsibility as they are the SDK and environment provider. Apple and Videologic need to nut that one out themselves.
As to a shim speeding up code its essentially comes down to any impedance mismatch that may occur between an application writer and the API. This is identical to buffered versus non buffered IO and whose responsibility is it to filter idempotent operations.
When you look at a typical call stack. You'll see an application (potentially caching and filtering state), calling a library shim (potentially caching and filtering state), queuing and batching calls to a device driver (potentially caching and filtering state), dispatching to a management layer (potentially caching and filtering state), and so on, eventually getting to a graphics card processor potentially caching and filtering state and finally to a pipeline or set of functional blocks (which may have some idempotent de-duping as well).
Again how this is communicated to the developer or structured is an issue of the platform provider.
Apple can choose to say we optimize nothing (ie add no fat, waste no extra cycles) its up to you to dispatch minimal state changes, or we optimize a,b & c... - don't repeat this work, but maybe add optimizations for d, e &f... Thats something they need to document and advise on for their platform. Its not part of most standards.