Skip to content

Guide

Usage

The package provides two different kinds of functions. The first class is intended to operate on two arrays. The second class operates on two or more arrays and calls the first class of functions internally.

Two-way functions

Two sorted numpy arrays can be intersected with the intersect function, which takes two numpy arrays and returns the sorted intersection of the two arrays.

## intersect.py
import numpy as np
import sortednp as snp

a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])

i = snp.intersect(a, b)
print(i)

If you run this, you should see the intersection of both arrays as a sorted numpy array.

$ python3 intersect.py
[3 7]

Two numpy sorted arrays can be merged with the merge function, which takes two numpy arrays and returns the sorted union of the two arrays.

## merge.py
import numpy as np
import sortednp as snp

a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])

m = snp.merge(a, b)
print(m)

If you run this, you should see the union of both arrays as a sorted numpy array.

$ python3 merge.py
[0 1 2 3 3 4 5 6 7 7 9]

Since version 0.4.0, the library provides the issubset(a, b) function which checks if the array a is a subset of b, and the isitem(v, a) function which checks if value is contained in array a.

## set.py
import numpy as np
import sortednp as snp

a = np.array([2, 4, 5, 10])
b = np.array([1, 2, 3, 4, 5, 6, 10, 11])

print(snp.issubset(a, b))  # a is subset of b
print(snp.issubset(b, a))  # b is not a subset of a

print(snp.isitem(4, a))  # 4 is an item of a
print(snp.isitem(3, a))  # 3 is not an item of a

If you execute this example, you get the expected result: a is a subset ob b, 4 is a member of a.

$ python3 set.py
True
False
True
False

Returning array indices

The intersect function takes an optional argument indices which is False by default. If this is set to True, the return value consists of the intersection array and a tuple with the indices of the common values for both arrays. The index arrays have the length of the output. The indices show the position in the input from which the value was copied.

## intersect_indices.py
import numpy as np
import sortednp as snp

a = np.array([2,4,6,8,10])
b = np.array([1,2,3,4])

intersection, indices = snp.intersect(a,b, indices=True)

print(intersection)
print(indices)

The above example gives:

$ python3 intersect_indices.py
[2 4]
(array([0, 1]), array([1, 3]))

The first line shows the intersection of the two arrays. The second line prints a tuple with the indices where the common values appeared in the input arrays. For example, the value 4 is at position 1 in array a and at position 3 in array b.

Since version 0.3.0, the merge has to indices argument too. The returned indices have the length of the inputs. The indices show the position in the output to which an input value was copied.

## merge_indices.py
import numpy as np
import sortednp as snp

a = np.array([2,4])
b = np.array([3,4,5])

merged, indices = snp.merge(a,b, indices=True)

print(merged)
print(indices)

The above example gives:

$ python3 merge_indices.py
[2 3 4 4 5]
(array([0, 2]), array([1, 3, 4]))

The first line shows that the two arrays have been merged. The second line prints a tuple with the indices. For example, the value 3 from array b can be found at position 1 in the output.

Duplicate treatment

Since version 0.3.0, sortednp supported multiple different strategies to deal with duplicated entries.

Duplicates during intersecting

There are three different duplicate treatments for the intersect function:

  • DROP: Ignore any duplicated entries. The output will contain only unique values.

  • KEEP_MIN_N: If an entry occurs n times in one input array and m times in the other input array, the output will contain the entry min(n, m) times.

  • KEEP_MAX_N: If an entry occurs n times in one input array and m times in the other input array, the output will contain the entry max(n, m) times (assuming the entry occurs at least once in both arrays, i.e. n > 0 and m > 0).

The strategy can be selected with the optional duplicates argument of intersect. The default is sortednp.IntersectDuplicates.KEEP_MIN_N. Consider the following example.

## intersect_duplicates.py
import numpy as np
import sortednp as snp

a = np.array([2, 4, 4, 5])    # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times

intersect_drop = snp.intersect(
    a, b,
    duplicates=snp.IntersectDuplicates.DROP
)
print(intersect_drop)  # Contains a single 4

intersect_min = snp.intersect(
    a, b,
    duplicates=snp.IntersectDuplicates.KEEP_MIN_N
)
print(intersect_min)  # Contains 4 twice

intersect_max = snp.intersect(
    a, b,
    duplicates=snp.IntersectDuplicates.KEEP_MAX_N
)
print(intersect_max)  # Contains 4 three times

The above example gives:

$ python3 intersect_duplicates.py
[4 5]
[4 4 5]
[4 4 4 5]

Duplicates during merging

The merge function offers three different duplicates treatment strategies:

  • DROP: Ignore any duplicated entries. The output will contain only unique values.

  • DROP_IN_INPUT: Ignores duplicated entries in the input arrays separately. This is the same as ensuring that each input array unique values. The output contains every value at most twice.

  • KEEP: Keep all duplicated entries. If an item occurs n times in one input array and m times in the other input array, the output contains the item n + m times.

The strategy can be selected with the optional duplicates. The default is KEEP. Consider the following example.

## merge_duplicates.py
import numpy as np
import sortednp as snp

a = np.array([2, 4, 4, 5])    # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times

merge_drop = snp.merge(
    a, b,
    duplicates=snp.MergeDuplicates.DROP
)
print(merge_drop)  # Contains a single 4

merge_dii = snp.merge(
    a, b,
    uplicates=snp.MergeDuplicates.DROP_IN_INPUT)
print(merge_dii)  # Contains 4 twice

merge_keep = snp.merge(
  a, b,
  duplicates=snp.MergeDuplicates.KEEP)
print(merge_keep)  # Contains 4 five times

The above example gives:

$ python3 merge_duplicates.py
[2 3 4 5]
[2 3 4 4 5 5]
[2 3 4 4 4 4 4 5 5]

Duplicates during subset checks

The issubset function offers two different duplicates treatment strategies:

  • IGNORE: Ignore any duplications. The function returns True if each value in the first array is contained at least once in the second array. Duplicated entries in the first array do not change the return value.

  • REPEAT: For each duplicated item in the first array, require at least as many items in the second array. If for one value the first array contains more duplicated entries than the second array, the function returns False.

The strategy can be selected with the optional duplicates. The default is IGNORE. Consider the following example.

## subset_duplicates.py
import numpy as np
import sortednp as snp

a = np.array([3, 4, 4, 5])    # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times

# Number of occurances ignored
print(snp.issubset(a, b, duplicates=snp.SubsetDuplicates.IGNORE))  # is subset
print(snp.issubset(b, a, duplicates=snp.SubsetDuplicates.IGNORE))  # is subset

# Number of in subset must be smaller or equal
print(snp.issubset(a, b, duplicates=snp.SubsetDuplicates.REPEAT))  # is subset

# three 4s not subset of two 4s
print(snp.issubset(b, a, duplicates=snp.SubsetDuplicates.REPEAT))

The above example gives:

$ python3 subset_duplicates.py
True
True
True
False

Index tracking and duplicates

Tracking indices with the indices=True argument is possible while selecting a non-default duplicate treatment strategy. For merging the indices point to the position in the output array. If the input has duplicates that were skipped, the index is simply repeated. For example with snp.IntersectDuplicates.DROP, if the input is [9, 9, 9, 9], the index array for this input contains four times the position where 9 is found in the output.

Similarly, with snp.IntersectDuplicates.KEEP_MAX_N and intersect, the index of the last item in the array with less occurrences is duplicates.

## duplicates_index.py
import numpy as np
import sortednp as snp

a = np.array([2, 4, 4, 5])    # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times

# Merge
merge_drop, (index_a, index_b) = snp.merge(
    a, b,
    duplicates=snp.IntersectDuplicates.DROP,
    indices=True
)
print(index_b)

# Intersect
intersect_max, (index_a, index_b) = snp.intersect(
    a, b,
    duplicates=snp.IntersectDuplicates.KEEP_MAX_N,
    indices=True
)
print(index_a)

The above example gives:

$ python3 duplicates_index.py
[1 2 2 2 3]
[1 2 2 3]

For merging, this means that the three 4s from the input all appear at same position in the output, namely position 2.

For the intersect, this means that the second and third occurrence of 4 in the output, both came from item at position 2 in the input.

k-way functions

Similarly, the k-way intersect and merge functions take two or more arrays and perform the merge or intersect operation on its arguments.

## kway_intersect.py
import numpy as np
import sortednp as snp

a = np.array([0, 3, 4, 6, 7])
b = np.array([0, 3, 5, 7, 9])
c = np.array([1, 2, 3, 5, 7, 9])
d = np.array([2, 3, 6, 7, 8])

i = snp.kway.intersect(a, b, c, d)
print(i)

If you run this, you should see the intersection of all four arrays as a sorted numpy array.

$ python3 kway.intersect.py
[3 7]

The k-way merger sortednp.kway.merge works analogously. However, the native numpy implementation is faster compared to the merge provided by this package. The k-way merger has been added for completeness. The package heapq provides efficient functions to merge multiple arrays simultaneously.

The functions sortednp.kway.merge and sortednp.kway.intersect accept the optional keyword argument assume_sorted. By default, it is set to True. If it is set to False, the function calls sort() on the input arrays before performing the operation. The default should be kept if the arrays are already sorted to save the time it takes to sort the arrays.

Since the arrays might be too large to keep all of them in memory at the same time, it is possible to pass a callable instead of an array to the functions. The callable is expected to return the actual array. It is called immediately before the array is required. This reduces the memory consumption.

Algorithms

Intersections are calculated by iterating both arrays. For a given element in one array, the function needs to search the other and check if the element is contained. In order to make this more efficient, we can use the fact that the arrays are sorted. There are three search functions, which can be selected via the optional keyword argument algorithm.

  • SIMPLE_SEARCH: Search for an element by linearly iterating over the array element-by-element. More Information.
  • BINARY_SEARCH: Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.
  • GALLOPING_SEARCH: First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.

The default is sortednp.Algorithms.GALLOPING. The performance of all three algorithms is compared in the next section. The functions issubset() and isitem() also support the algorithm keyword.