Simulation de particules de l'ascenseur de Newton sur CUDA



https://www.youtube.com/playlist?list=PLwr8DnSlIMg0KABru36pg4CvbfkhBofAi



Sur HabrĂ©, je suis tombĂ© sur un article assez curieux "Mythes scientifiques et techniques, partie 1. Pourquoi les avions volent-ils?" ... L'article dĂ©crit en dĂ©tail les problĂšmes qui surviennent lorsque l'on tente d'expliquer la portance des ailes en termes de loi de Bernoulli ou de modĂšle de portance newtonienne . Et bien que l'article offre d'autres explications, j'aimerais quand mĂȘme m'attarder plus en dĂ©tail sur le modĂšle de Newton. Oui, le modĂšle de Newton est incomplet et comporte des hypothĂšses, mais il donne une description plus prĂ©cise et intuitive des phĂ©nomĂšnes que la loi de Bernoulli.



— . - , , .



, . ? GitHub.



Computational Fluid Dynamics



, (Computational fluid dynamics, CFD), -.



, , 7- . 38 (“”), 40 (“ ‘’ ”) 41 (“ ‘’ ”). , - — . (, ) . , , .



: . . .



— . , SpaceX . , , , . . , . , (Smoothed-particle hydrodynamics, SPH). : , , , . . , . , , (screen-space meshes, dual contouring, marching tetrahedra, metaballs). , .



Discrete Element Method



, .



(Discrete Element Method, DEM) , . , , , .





, ( X Y), . , . . — NASA . , STS-1 . Mission Anomalies.



DEM — (Discrete Collision Detection). .



Continuous Collision Detection (CCD), , . , , . . CCD , « ». , Unity Unreal .





Continuous Collision Detection



CCD . — . . — , , , . , “ ”





, DEM.





, , , , . , . , . .



National Geographic. , . , , .



( National Geographic)



. — . — . . .



https://youtu.be/cx8XbaQNnxw?t=2206



, . , . (heatmap) ( ) ( ) . .



.





( )





,









, . :



  1. .
  2. CPU CUDA.


. . CUDA- , , 16 .





. , , , , , .., . . .



, . (Ordinary Differential Equation, ODE). dx/dt=f(x,t), x — , f — , . x0 dx/dt, x1=x0+dxdtdt=x0+dx.



, f, — .



'Differential Equation Basics' and 'Particle Dynamics' https://www.cs.cmu.edu/~baraff/sigcourse/



3Blue1Brown :

https://www.youtube.com/playlist?list=PLZHQObOWTQDNPOjrT6KVlfJuKtYTftqH6





, — (Forward Euler). - 4- (RK4), , . RK4 , , . , , - . , , 'Differential Equation Basics' lecture notes, section 3, 'Adaptive Stepsizes'.



, , . , . .



CPU- . GPU- .
float CSimulationCpu::ComputeMinDeltaTime(float requestedDt) const
{
    auto rad = m_state.particleRad;
    auto velBegin = m_curOdeState.cbegin() + m_state.particles;
    auto velEnd = m_curOdeState.cend();

    return std::transform_reduce(std::execution::par_unseq, velBegin, velEnd, requestedDt, [](const auto t1, const auto t2)
    {
        return std::min(t1, t2);
    }, [&](const auto& v)
    {
        auto vel = glm::length(v);
        auto radDt = rad / vel;
        return radDt;
    });
}

float CSimulationCpu::Update(float dt)
{
    dt = ComputeMinDeltaTime(dt);
    m_odeSolver->NextState(m_curOdeState, dt, m_nextOdeState);
    ColorParticles(dt);
    m_nextOdeState.swap(m_curOdeState);
    return dt;
}




— f, “ ”. CPU CUDA . , CPU , . CUDA , . . “ ”.



//CPU- 
void CDerivativeSolver::Derive(const OdeState_t& curState, OdeState_t& outDerivative) 
{
    ResetForces();
    BuildParticlesTree(curState);
    ResolveParticleParticleCollisions(curState);
    ResolveParticleWingCollisions(curState);
    ParticleToWall(curState); 
    ApplyGravity();
    BuildDerivative(curState, outDerivative);
} 

//CUDA- 
void CDerivativeSolver::Derive(const OdeState_t& curState, OdeState_t& outDerivative) 
{ 
    BuildParticlesTree(curState);
    ReorderParticles(curState);
    ResetParticlesState();
    ResolveParticleParticleCollisions();
    ResolveParticleWingCollisions();
    ParticleToWall();
    ApplyGravity();
    BuildDerivative(curState, outDerivative);
}




, , , — . 2’097’152 . - . , . — .



— Uniform Grid, . GPU “Chapter 32. Broad-Phase Collision Detection with CUDA”.





( Tero Karras, NVIDIA Corporation)



, O(1). 9 (3x3) 27 (3x3x3) 2D 3D . — . , , RCU lock-free . Nvidia , malloc()/free() device , .





CppCon 2017: Fedor Pikus “Read, Copy, Update, then what? RCU for non-kernel programmers”



, :



  1. .
  2. RAM/VRAM, -, .
  3. , .
  4. .
  5. GPU, lock-free (https://youtu.be/86seb-iZCnI?t=2311, ).


, — BVH- Z-. , .



— Z-, .





Z- ( Wikipedia)





—

( Wikipedia)



, space-filling curve, , . 2D/3D , , , , , . . , , , , , .



, . , . , Tero Karras Nvidia, .



“Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees”.

:



  1. : https://developer.nvidia.com/blog/thinking-parallel-part-ii-tree-traversal-gpu/
  2. : https://developer.nvidia.com/blog/thinking-parallel-part-iii-tree-construction-gpu/


, . N , bounding box , , ( Z-). .



( Tero Karras, NVIDIA Corporation)



. , , , N N-1 . , . . , . , . .





( Tero Karras, NVIDIA Corporation)





( Tero Karras, NVIDIA Corporation)



, , BVH-.



BVH- ( Tero Karras, NVIDIA Corporation)



N , , . , , , . atomicExch() 1. , . 0, . , , . . 1, , , .



.





“-” “- ”.



“-” , , “” , . Tero Karras. , A-B B-A , . . (rightmost leaf), . , rightmost(N2) = 4, rightmost(N3) = 8. , , , O6, , N2. rightmost, , O6 N2. O6 N2. , , N2, N3. , O6 , , N2.



:



template<typename TDeviceCollisionResponseSolver, size_t kTreeStackSize>
void CMortonTree::TraverseReflexive(const TDeviceCollisionResponseSolver& solver);


“- ” :



template<typename TDeviceCollisionResponseSolver, size_t kTreeStackSize>
void CMortonTree::Traverse(const thrust::device_vector<SBoundingBox>& objects, const TDeviceCollisionResponseSolver& solver);


TDeviceCollisionResponseSolver — , :



struct Solver
{
    struct SDeviceSideSolver
    {
        ... 
        __device__ SDeviceSideSolver(...);
        __device__ void OnPreTraversal(TIndex curId);
        __device__ void OnCollisionDetected(TIndex leafId);
        __device__ void OnPostTraversal();
    };
    Solver(...);
    __device__ SDeviceSideSolver Create();
}; 


, , . Create(). OnPreTraversal, . - , OnCollisionDetected . . OnPostTraversal.



. -. , Tero Karras. . NxO, N — , O — . , . . , (coalesced memory access). , . , .



“-”
struct SParticleParticleCollisionSolver
{
    struct SDeviceSideSolver
    {
        CDerivativeSolver::SIntermediateSimState& simState;
        TIndex curIdx;
        float2 pos1;
        float2 vel1;
        float2 totalForce;
        float totalPressure;

        __device__ SDeviceSideSolver(CDerivativeSolver::SIntermediateSimState& state) : simState(state)
        {

        }

        __device__ void OnPreTraversal(TIndex curLeafIdx)
        {
            curIdx = curLeafIdx;
            pos1 = simState.pos[curLeafIdx];
            vel1 = simState.vel[curLeafIdx];
            totalForce = make_float2(0.0f);
            totalPressure = 0.0f;
        }

        __device__ void OnCollisionDetected(TIndex anotherLeafIdx)
        {
            const auto pos2 = simState.pos[anotherLeafIdx];
            const auto deltaPos = pos2 - pos1;
            const auto distanceSq = dot(deltaPos, deltaPos);
            if (distanceSq > simState.diameterSq || distanceSq < 1e-8f)
                return;

            const auto vel2 = simState.vel[anotherLeafIdx];

            auto dist = sqrtf(distanceSq);
            auto dir = deltaPos / dist;
            auto springLen = simState.diameter - dist;

            auto force = SpringDamper(dir, vel1, vel2, springLen);
            auto pressure = length(force);
            totalForce += force;
            totalPressure += pressure;

            atomicAdd(&simState.force[anotherLeafIdx].x, -force.x);
            atomicAdd(&simState.force[anotherLeafIdx].y, -force.y);
            atomicAdd(&simState.pressure[anotherLeafIdx], pressure);
        }

        __device__ void OnPostTraversal()
        {
            atomicAdd(&simState.force[curIdx].x, totalForce.x);
            atomicAdd(&simState.force[curIdx].y, totalForce.y);
            atomicAdd(&simState.pressure[curIdx], totalPressure);
        }
    };

    CDerivativeSolver::SIntermediateSimState simState;
    SParticleParticleCollisionSolver(const CDerivativeSolver::SIntermediateSimState& state) : simState(state)
    {
    }
    __device__ SDeviceSideSolver Create()
    {
        return SDeviceSideSolver(simState);
    }
};

void CDerivativeSolver::ResolveParticleParticleCollisions()
{
    m_particlesTree.TraverseReflexive<SParticleParticleCollisionSolver, 24>(SParticleParticleCollisionSolver(m_particles.GetSimState()));
    CudaCheckError();
}


, , OnCollistionDetected . : - A, B, C D, Z , :



lock-step # Thread #1 Thread #2 Thread #3
1 OnCollisionDetected

A <-> C
OnCollisionDetected

B <-> C
OnCollisionDetected

C <-> D
2 OnCollisionDetected

A <-> D
OnCollisionDetected

B <-> D
INACTIVE
3 OnPostTraversal(1) OnPostTraversal(2) OnPostTraversal(3)


, 1 2 #1 #2 atomicAdd C D OnCollistionDetected. atomic .



Volta, Nvidia warp-vote . match_any warp, , .



match_any match_all



, warp shuffle .



Warp-wide



, . SM . , Pascal (1080 Ti) , . , , . , atomic , Arithmetic Workload . Volta Turing . , RTX 2060 , atomic . “ ”.





, , .



SoA



Structure of Arrays — , .





. , , SoA:



typedef uint32_t TIndex; 

struct STreeNodeSoA 
{
    const TIndex leafs;

    int* __restrict__ atomicVisits; 
    TIndex* __restrict__ parents; 
    TIndex* __restrict__ lefts; 
    TIndex* __restrict__ rights; 
    TIndex* __restrict__ rightmosts; 
    SBoundingBox* __restrict__ boxes; 
    const uint32_t* __restrict__ sortedMortonCodes; 
};


:



struct SIntermediateSimState 
{ 
    const TIndex particles; 
    const float particleRad; 
    const float diameter; 
    const float diameterSq; 

    float2* __restrict__ pos; 
    float2* __restrict__ vel; 
    float2* __restrict__ force; 
    float* __restrict__ pressure; 
}; 


, bounding box’ SoA , . Array of Structures (AoS):



struct SBoundingBox 
{ 
    float2 min; 
    float2 max; 
}; 

struct SBoundingBoxesAoS 
{ 
    const size_t  count; 
    SBoundingBox* __restrict__ boxes; 
}; 




, , . , . . , uncoalesced memory access.



GPU. coalesced memory access, . , . SpaceX: https://youtu.be/vYA0f6R5KAI?t=1939 ( ).





( SpaceX)



50% : 8 FPS 12 FPS .



UPD: , 16 , 192 . , .



Shared Memory



, . , — . SM , uncoalesced. , Shared Memory Streaming Multiprocessor’.



__device__ void traverseIterative( CollisionList& list,
                                   BVH& bvh, 
                                   AABB& queryAABB, 
                                   int queryObjectIdx)
{
    // Allocate traversal stack from thread-local memory,
    // and push NULL to indicate that there are no postponed nodes.
    NodePtr stack[64]; //<----------------------------  
    NodePtr* stackPtr = stack;
    *stackPtr++ = NULL; // push

    // Traverse nodes starting from the root.
    NodePtr node = bvh.getRoot();
    do
    {
        // Check each child node for overlap.
        NodePtr childL = bvh.getLeftChild(node);
        NodePtr childR = bvh.getRightChild(node);
        bool overlapL = ( checkOverlap(queryAABB, 
                                       bvh.getAABB(childL)) );
        bool overlapR = ( checkOverlap(queryAABB, 
                                       bvh.getAABB(childR)) );

        // Query overlaps a leaf node => report collision.
        if (overlapL && bvh.isLeaf(childL))
            list.add(queryObjectIdx, bvh.getObjectIdx(childL));

        if (overlapR && bvh.isLeaf(childR))
            list.add(queryObjectIdx, bvh.getObjectIdx(childR));

        // Query overlaps an internal node => traverse.
        bool traverseL = (overlapL && !bvh.isLeaf(childL));
        bool traverseR = (overlapR && !bvh.isLeaf(childR));

        if (!traverseL && !traverseR)
            node = *--stackPtr; // pop
        else
        {
            node = (traverseL) ? childL : childR;
            if (traverseL && traverseR)
                *stackPtr++ = childR; // push
        }
    }
    while (node != NULL);
}


Shared Memory
template<typename TDeviceCollisionResponseSolver, size_t kTreeStackSize, size_t kWarpSize = 32>
__global__ void TraverseMortonTreeKernel(const CMortonTree::STreeNodeSoA treeInfo, const SBoundingBoxesAoS externalObjects, TDeviceCollisionResponseSolver solver)
{
    const auto threadId = blockIdx.x * blockDim.x + threadIdx.x;
    if (threadId >= externalObjects.count)
        return;

    const auto objectBox = externalObjects.boxes[threadId];
    const auto internalCount = treeInfo.leafs - 1;

    __shared__ TIndex stack[kTreeStackSize][kWarpSize]; //  

    unsigned top = 0;
    stack[top][threadIdx.x] = 0;

    auto deviceSideSolver = solver.Create();
    deviceSideSolver.OnPreTraversal(threadId);

    while (top < kTreeStackSize) //top == -1 also covered
    {
        auto cur = stack[top--][threadIdx.x];

        if (!treeInfo.boxes[cur].Overlaps(objectBox))
            continue;

        if (cur < internalCount)
        {
            stack[++top][threadIdx.x] = treeInfo.lefts[cur];
            if (top + 1 < kTreeStackSize)
                stack[++top][threadIdx.x] = treeInfo.rights[cur];
            else
                printf("stack size exceeded\n");

            continue;
        }

        deviceSideSolver.OnCollisionDetected(cur - internalCount);
    }

    deviceSideSolver.OnPostTraversal();
}


Shared Memory 43%: 14 FPS 20 FPS. :



https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses





, — 1080 Ti Pascal. , , . 20- . .





20- RTX . .



, , . . :





.



atomic- lock-free . , (Out-of-order execution), , . lock-free . , , , . -, , , C++ . std::memory_order.



__device__ void CMortonTree::STreeNodeSoA::BottomToTopInitialization(size_t leafId)
{
    auto cur = leafs - 1 + leafId;
    auto curBox = boxes[cur];

    while (cur != 0)
    {
        auto parent = parents[cur];
        //1.  atomic 
        auto visited = atomicExch(&atomicVisits[parent], 1);
        if (!visited)
            return;

        TIndex siblingIndex;
        SBoundingBox siblingBox;

        TIndex rightmostIndex;
        TIndex rightmostChild;

        auto leftParentChild = lefts[parent];
        if (leftParentChild == cur)
        {
            auto rightParentChild = rights[parent];
            siblingIndex = rightParentChild;
            rightmostIndex = rightParentChild;
        }
        else
        {
            siblingIndex = leftParentChild;
            rightmostIndex = cur;
        }

        siblingBox = boxes[siblingIndex];
        rightmostChild = rightmosts[rightmostIndex];

        SBoundingBox parentBox = SBoundingBox::ExpandBox(curBox, siblingBox);
        boxes[parent] = parentBox;
        rightmosts[parent] = rightmostChild;

        cur = parent;
        curBox = parentBox;

        //2.   . 
        //       
        __threadfence();
    }
}


, , BVH , . , . , SIMT ( Volta SIMT Model Starvation-Free Algorithms) Nvidia Volta Out-of-order execution. , , atomicExch(), .. , Turing . , , , , . data race.



thrust::device_vector



, thurst::device_vector . resize() GPU cudaDeviceSynchronize(). , . , . , , . — . — (reduce, sum, min, max) , . Cuda UnBound (CUB) . , thrust , .





, , .





,





, GPU -, . , . “ ”, APOD .



, , , , GPU.



Si vous souhaitez commencer Ă  apprendre CUDA mais ne savez pas par oĂč commencer, il existe un excellent cours sur Youtube d'Udacity «Introduction Ă  la programmation parallĂšle». Je recommande de revoir.

https://www.youtube.com/playlist?list=PLAwxTw4SYaPnFKojVQrmyOGFCqHTxfdv2



Enfin, quelques simulations vidéo supplémentaires:





Version CPU, 8 threads, 131'072 particules





Version CUDA, 4,2 millions de particules, simulation de 30 minutes




All Articles