拆分重叠段

提问者：小点点

这是一个段的向量

class Segment
{
public:
   size_t left;
   size_t right;
   char ID;
   Segment(size_t a, size_t b, char c):left(a), right(b), ID(c){assert(left<right);}
};


std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}}

向量a基于属性left进行排序。可以将该向量的内容绘制为

       -------  'A'
           ---------------  'B'
                  ---  'C'
                    ---   'D' 
                             ------  'E'
                                ----   'F'
                                      ---  'G'
                                      ---  'H'
                                                  ---  'I'
                                                        -------  'J'

我想创建对象b，它包含来自a中那些大段的不重叠的小段。要进入b，我们必须收缩彼此重叠的段，并为所有重叠的地方创建一个ID为x的新段。矢量b也需要基于left属性进行排序。

对于上面的示例，预期的输出是

std::vector<Segment> B = {{3, 7, 'A'}, {7, 10, `X`}, {10, 14, 'B'}, {14, 19, 'X'}, {19, 22, 'B'} , {25, 28, 'E'}, {28, 31, 'X'}, {31, 32, 'F'}, {34, 37, 'X'}, {46, 49, 'I'}, {52, 59, 'J'}}


       ----  'A'
           ---  'X'  (overlap between 'A' and 'B')
              ----  'B'
                  --------  'X' (overlap between 'B', 'C' and 'D')
                          ---  'B'  -> Note that 'B' is now split in two
                                ---  'E'
                                   ---   'X'  (overlap between 'E' and 'F')
                                      -  'F'
                                         ---  'X' (overlap between 'G' and 'H')
                                                     ---  'I'
                                                           -------  'J'

有人能帮我一把吗？

注意，与上面的示例不同，a中的两个段实际上可以具有相同的ID（无论它们是否重叠）。还要注意，a不是const，可以在操作过程中修改。出于性能考虑，请注意向量a通常相对较短；长度在1到大约100（或几百）段之间。值left和right通常非常大（在0到大约1e9)，并且这些段中只有很少部分会相交。通常，当有很少的段时，那些段会相当宽（当size为1时，单个段的宽度通常约为1e9)。最后，您可以用

void print(std::vector<Segment>& v)
{
    for (auto& elem : v)
    {
        std::cout << "{" << elem.left << ", " << elem.right << ", " << elem.ID << "} ";
    }
    std::cout << "\n";

    for (auto& elem : v)
    {
        for (size_t i = 0 ; i < elem.left ; ++i)
            std::cout << " ";
        for (size_t i = 0 ; i < elem.right - elem.left ; ++i)
            std::cout << "-";
        std::cout << "  " << elem.ID << "\n";           
    }
    std::cout << "\n\n\n";  
}

尝试

为了显示努力，这里有一个尝试，1）是错误的，2）将是一个相对缓慢的实现。让我们调用“断点”，在下面的b向量中的任何左边的右边。其思想是通过系统地在前面和后面的段中搜索潜在的下一个断点，从一个断点跳到下一个断点。在这样做的时候，它应该跟踪应该在新的段中给出什么ID（如果断点之间的距离与a中的至少一个段匹配）。

std::vector<Segment> foo(std::vector<Segment>& A)
{
    if (A.size() <= 1) return A;

    std::vector<Segment> B;
    B.reserve(A.size());
    
    size_t A_index = 0;
    size_t currentPos = A[A_index].left;
    while ( A_index < A.size())
    {
        auto nextPos = A[A_index].right;

        //std::cout << "currentPos = " << currentPos << "\n";
        //std::cout << "nextPos before search = " << nextPos << "\n";

        bool isIntersection = false;
        // Search in preceding Segments
        for (size_t i = A_index - 1 ; i < A.size() ; --i)
        {
            if (A[i].right > currentPos && A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found " << nextPos << " in preceding segment\n";
            }
        }

        // Search in following Segments
        for (size_t i = A_index+1 ; i < A.size() ; ++i)
        {
            if ( A[i].left > currentPos && A[i].left < nextPos)
            {
                nextPos = A[i].left;
                //std::cout << "Found left of " << nextPos << " in following segment\n";
                break;
            }

            if ( A[i].right > currentPos &&  A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found right of " << nextPos << " in following segment\n";
                break;
            }
        }

        // create new Segment
        if (!isIntersection)
        {
            B.push_back({currentPos, nextPos, A[A_index].ID});
        } else
        {
            B.push_back({currentPos, nextPos, 'X'});
        }
        if (nextPos == A[A_index].right)
        {
            ++A_index;
            nextPos = A[A_index].left;
        }
        currentPos = nextPos;
    }

    return B;
}


int main()
{
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};
    print(A);
    auto B = foo(A);
    print(B);
}

共1个答案

匿名用户

下面的并不是最有效率的，但它产生了预期的产出。战略如下:

为了方便起见，我使用了

struct Overlap {
    std::vector<char> IDs;
    Overlap() {}
    void add(char id) {IDs.push_back(id);}
};

最后一步的实现需要删除构造函数中的left<；right要求。

int main() {
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};   
    // dissect
    std::map<size_t,Overlap> over;
    for (const auto& s : A) {
        for (size_t i = s.left; i < s.right; ++i) {
            over[i].add(s.ID);
        }
    }
    // assign new segments 
    std::map<size_t,char> pieces;
    for (const auto& o : over) {
        if (o.second.IDs.size() == 1) {
            pieces[o.first] = o.second.IDs.front();
        } else {
            pieces[o.first] = 'X';
        }
    }
    // glue them
    std::vector<Segment> result;
    auto it = pieces.begin();
    Segment current(it->first,it->first,it->second);  // here left==right !
    ++it;
    for ( ; it != pieces.end(); ++it) {
        if (it->second == current.ID) continue;
        current.right = it->first -1;
        result.push_back(current);
        current = Segment{it->first,it->first,it->second};
    }

    print(result);
}

输出:

{3, 6, A} {7, 9, X} {10, 13, B} {14, 18, X} {19, 24, B} {25, 27, E} {28, 30, X} {31, 33, F} {34, 45, X} {46, 51, I} 
   ---  A
       --  X
          ---  B
              ----  X
                   -----  B
                         --  E
                            --  X
                               --  F
                                  -----------  X
                                              -----  I

活生生的例子

实际上唯一昂贵的部分是1（O（段数x它们的宽度）。我相信可以通过使用两个容器来提高效率，一个根据left排序，另一个根据right排序，以便更容易地检测重叠。然而，上面的简单实现是我要开始的。另外，如果宽度相对于段数来说相当小，那么对于排序来说，O（段数x宽度）可能比O（段数x段数）更好。