osgEarth 2.10 OpenThreads Undefined behavior

classic Classic list List threaded Threaded
8 messages Options
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

osgEarth 2.10 OpenThreads Undefined behavior

Hey,

So I've been on the 2.9 branch for quite some time, but finally had the chance to upgrade to the osgEarth 2.10 branch. I had previously relied upon the osgEarth::ViewerWidget, but knew that it would be deprecated in 2.10. As a result of this, I took it upon myself to subclass the QOpenGLWidget for myself to work with Qt still, which has been working rather well. In a previous post I had concerns over why draping did not work, but this was because of the FBO that Qt wouldn't recognize. Not a big deal, I can live without it to be honest. However, something strange happened when I switched from 2.9 to 2.10.

Half of the time when calling osgViewer::Viewer::frame() on start up, my window's viewer will seg-fault due to the OpenThreads::Atomic::operator-- call. It seems it is trying to unreference a pointer inside of a FeatureNode's base implementation for AnnotationNode when attempting to call the osgEarth::SpatialReference::operator= on what I assume to be the SRS of the underlying node's field. I find this strange because this has never happened before in a graph traversal in 2.9. Did something change with how the SRS are compared under the hood? Also, if the program starts without crashing, then it is stable, it's only an issue on start up. It seems like a race condition or some osg::ref_ptr<T> seems to be released much too early. Does anyone have an idea of what might be wrong?

Thanks for any assistance!

    - Blanky
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

Further investigation into the issue has come up with the possibility of one of the new osgEarth features in 2.10 causing it. I'm not quite sure, but here is the stack trace and all relevant functions along the stack trace I thought seemed useful in depicting the code execution and finding the actual culprit code.

bottom up stack - # implies not in the stack trace directly, but called in between function calls to get the better idea of where in the execution loop we are, this was from byte code comments so they may not be formatted completely correctly:
   

Final stack trace line of code:
    13a: this = 0x40 (Also nullptr?)
13: return __sync_sub_and_fetch(&value, 1);

    12a: Then in osg::Referenced the code appears as:
        int newRef;
        osg::Referenced::unref(){
            newRef = --_refCount;
            bool needDelete = (newRef == 0);
        }
        The variables according to the stack trace were the following:
            this = 0x30 (nullptr?)
            newRef = 0
            needDelete = true => the SRS is being deleted but is still held by the MapNode???
12: osg::ref_ptr<osgEarth::SpatialReference const>::operator=(osgEarth::SpatialReference const *)
    11i : #osgEarth::SpatialReference=
    11h: #osgEarth::AltitudeCullCallback(mapNode->getMapSRS())
    11g: #osgEarth::HorizonCullCallback::setEnabled()
    11f : #osgEarth::MapNode::getMapSRS()
    11e: #osgEarth::Annotation::AnnotationNode::setMapNode(osgEarth::MapNode *)
    11d: #osgEarth::MapNode::isGeocentric()
    11c: #osgEarth::MapNode::getTerrain()
    11b: #osgEarth::Annotation::FeatureNode::setMapNode(osgEarth::MapNode *)
    11a: #osgEarth::TerrainCallbackAdapter
11: osgEarth::Annotation::FeatureNode::setMapNode(osgEarth::MapNode *)
10: osgEarth::Annotation::AnnotationNode::traverse(osg::NodeVisitor &)
    9d:#osg::ref_ptr<Terrain> terrain = osgEarth::MapNode::getTerrain() ;
    9c:#osg::ref_ptr<Terrain>::TerrainValid()
    9b:#osgEarth::Terrain::getGraph()
    9a:#osgEarth::Annotation::FeatureNode::clamp(osg::Node*, Terrain*)
9: osgEarth::Annotation::FeatureNode::traverse(osg::NodeVisitor &)
8: osgEath::Annotation::FeatureNode::accept(osg::NodeVisitor &)
7: osg::Group::traverse(osg::NodeVisitor &)
6: osgEarth::MapNode::traverse(osg::NodeVisitor &)
5: osg::Group::accept(osg::NodeVisitor &)
4: osgViewer::Viewer::updateTraversal()
3: osgViewer::ViewerBase::frame(double)
2 :MyGLWidget::paintGL()
1: MyGLWidget::event()
... (Many useless Qt calls to our code)
main()


After diffing some files from osgEarth 2.9 and 2.10, we've noticed the osgEarth::MapNode is mostly untouched, save for some comments. The osgEarth::Annotation::FeatureNode seems to have some very small differences. The osgEarth::Annotation::AnnotationNode seems to have a different clamp function added to it for a new callback. Also, the osgEarth::SpatialReference seems to have some internal fields removed/renamed for deprecated functionality reasons, like the ECEF boolean.

In the middle of the stack trace it seems that both the HorizonCullCallback and AltitudeCullCallback are being fired off as per #11g and #11h. Right after this, it calls the equal operator on the osgEarth::SpatialReference and then apparently gets deleted as per trace #12a's code and its current state's values.

However, I believe there seems to be undefined behavior of some kind because this is before we've ever even populated the MapNode ourselves. We're using both a projected and geocentric earth and both will crash at the same point, whether we spawn a single window with just one projection of either type. The Earth file has not changed from osgEarth 2.9 and this has never ever occurred in 2.9 once. All we changed was the includes of our files of our Qt .pro to use 2.10 and we removed any compiler errors for deprecated functions (which weren't many changes at all really since we already scrapped using the osgEarth::ViewerWidget long ago). Any of the changes we did make were after the seg-fault implying our code wasn't the source of the issue. Is there a chance one of the new callbacks added to the AnnotationNode is not playing well the QOpenGLWidget that we're using because clamping isn't supported by Qt's FBO? I saw that the HorizonCullCallback::setEnabled is being called, but I don't know if with a true or false flag by default. We're not quite sure what else to try at this point, because we need 2.10 to allow for Camera specific layer callbacks to work such that each camera has its own unique group of VisibleLayers and the rest of them to be culled for every other camera since we're sharing one osgEarth::MapNode amongst multiple Cameras/Viewers and wish to have separate brightness/contrast/opacity control for each layer.

Really stuck at this point and don't know how to move forward with integrating 2.10 with our current codebase. Any help or even an idea of a config setting that might have an effect would be great. Thanks.

- Blanky
gwaldron gwaldron
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

Blanky,
The whole thing smells like a bad build to me. Your "this" pointer is garbage, and SpatialReference instances actually never get deleted in osgEarth. If you are 100% positive the build and path are good, I would recommend setting a breakpoints and stepping through the offending code and see if there's a null reference somewhere.
Glenn Waldron / Pelican Mapping
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

From what I tested, the code won't crash unless the viewer's frame() is called. If I defer starting a QTimer to update the QOpenGLWidget, nothing will crash.

I thought maybe something changed with the MapNodeHelper, so I tested what both the mapNode's mapSRS() and said mapSRS' geographicSRS() since we use both and the operator= is what's complaining in the stack trace, but both turned out to be correctly formatted pointers that we could access correctly. I'm not even sure which base class Annotation node is being called that's causing the offending undefined behavior since like I said, it happens before I actually add anything myself to the osgEarth::mapNode.

We essentially never changed any code when reformatting from 2.9 to 2.10 other than replacing any of the deprecated constructors that took an osgEarth::MapNode*, like the osgEarth::Annotations::FeatureNode() and then using setMapNode(osgEarth::MapNode*) on it instead. I can't set breakpoints either because the crash is always occurring in one of your .so libraries so the stack trace is greyed out in Qt's debugger, not our code. The last trace we can even see is the frame() call in our custom widget. I think we'll try and do a clean install of osgEarth2.10 because like you said, it sounds like a bad build. In the .pro for the Qt projects I had it defined by an environment path variable that definitely points to the correct folder location.

I'll try a minimal and complete example that can replicate the behavior on our system and get back to you on that. Thanks for all of the help as usual!

    - Blanky
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

In reply to this post by gwaldron
Hey Glenn,

We created a minimal example and like you said, you thought we did something funky or the build/path was wrong. Our build was fine as exemplified when I made a minimal example. I had to strip away literally everything from our code to find the culprit and it turned out to be a osgEarth::Annotation::FeatureNode that we mistakenly set to have a Clamping set for its Altitude Symbol Style. However, we never set a size for the points for its osgEarth::Geometry on creation and it was a mistake to even have it made before the user clicked and dragged, so we assumed it wasn't being added to the scene graph during the traversal, but I guess in 2.10 something happened to make it become incorrectly traversed. We were implementing a rubber band tool and added it to the scene graph in some setup code and never suspected it to be the culprit since this behavior was never present in 2.9. We wrapped the FeatureNode in an osg::ref_ptr<T>() because we were trying to assign a new parent to it while removing the previous parent. It seems when reparenting it that the setMapNode on it was causing the issue, but the code wouldn't crash until we called osgViewer::Viewer::frame() since it wasn't being traversed yet. Why the behavior was undefined and why the code ran only sometimes is a little strange, but alas we found the underlying issue. Thanks for pointing us towards some things to look at because it helped us isolate the issue finally.

- Blanky
gwaldron gwaldron
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

Blanky,
I'm glad you found the issue. So just to clarify, trying to create a FeatureNode from an empty osgEarth::Geometry resulted in the crash? If so would you mind opening a GitHub issue so we can track and fix it? Thanks.
Glenn Waldron / Pelican Mapping
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

I would love to open the issue, but my work is classified for the most part and I don't have a working environment at my home environment for osg nor osgEarth. From the top of my head we set up the osgEarth::Annotation::FeatureNode* like this:
   

    osgEarth::MapNode* mapNode = dynamic_cast<osgEarth::MapNode*>(...//map node helper code that returned the osg::Node* from the agument parser using a .earth file path);

    osgEarth::Symbology::Style s;

    osgEarth::Features::Feature* feature = new osgEarth::Features::Feature(new osgEarth::Geometry, mapNode->getMapSRS()->getGeographicSRS(), s); //This was the thing you may have been worried about since we constructed a feature with an empty Geometry

    osgEarth::Annotation::FeatureNode* featureNode = new osgEarth::Annotation::FeatureNode(feature, s);

    mapNode->addChild(featureNode);


Then later we wrapped the featureNode object in an osg::ref_ptr<osgEarth::Annotation::FeatureNode*>(featureNode) when trying to "re-parent it" to another internal scene graph since we wanted the rubber band tool to be visually only to the current in focus viewer window and camera for that projection's scene graph. Since we had both projected and geocentric views we had to have two different osgEarth::MapNodes*, although we essentially synced them together(doing something on one we did on the other). Sometimes we would re-parent the node from one osgEarth::MapNode* to another when the user clicked in a different viewer and dragged. Either the code would run, regardless of the fact that we had just one viewer of one type or not, or it would crash. I'm not sure if this helps, but removing this node from the code prevented our crashes in 2.10, but this behavior was never noticed in 2.9.

- Blanky
Blanky Blanky
Reply | Threaded
Open this post in threaded view
|

Re: osgEarth 2.10 OpenThreads Undefined behavior

Hey Glenn,

We looked into a bit a more and have ruled out it being due to an empty osgEarth::Geometry. We learned (and we're not sure why) installing a custom node call back on the osgEarth::Annotation::FeatureNode was causing the issue. However, it's unclear why. We even made the operator() do nothing but call traverse on the node and its node visitor (or even nothing, just return), but it will crash in an undefined fashion regardless. Strange undefined behavior occurred when we manipulated the node's user values too. Things we noticed:

1. Setting the userValue with something like node->setUserValue(_cameraName, true); will crash it more often than not (like 5% success on the first viewer->frame() call)

2. Setting a random userValue that does nothing/never used like node->setUserValue("test", 32154); will make it more stable (but only after setting the original _cameraName user value too, like 85% success on first viewer->frame() call)

3. Deferring adding of the callback until much later fixed the issue (only on the actual click event in our osg::handler instead of in the constructor for our handler object which gets called on scene graph setup before the viewer even loads/calls frame)

Rolling back to 2.9 never presents this behavior (we checked multiple times). The only real solution seemed to be deferring the addition of the node callback until later, but we're just unsure as to why it would be necessary since we're using the same version of osg and nothing there should have changed.

Also, I'd like to note we made a minimal example and we couldn't replicate the behavior in a smaller test case. The only difference I would say was that it was being done in a separate constructor call for an osg::handler. We add it to the osgEarth::MapNode in said constructor both in the crashing and correct code (so nothing different there).

- Blanky