Guide
Usage
The package provides two different kinds of functions. The first class is intended to operate on two arrays. The second class operates on two or more arrays and calls the first class of functions internally.
Two-way functions
Two sorted numpy arrays can be intersected with the intersect function, which takes two numpy arrays and returns the sorted intersection of the two arrays.
## intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
i = snp.intersect(a, b)
print(i)
If you run this, you should see the intersection of both arrays as a sorted numpy array.
Two numpy sorted arrays can be merged with the merge function, which takes two numpy arrays and returns the sorted union of the two arrays.
## merge.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
m = snp.merge(a, b)
print(m)
If you run this, you should see the union of both arrays as a sorted numpy array.
Since version 0.4.0, the library provides the
issubset(a, b) function which checks if the array a
is a
subset of b
, and the isitem(v, a) function which checks if
value
is contained in array a
.
## set.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 5, 10])
b = np.array([1, 2, 3, 4, 5, 6, 10, 11])
print(snp.issubset(a, b)) # a is subset of b
print(snp.issubset(b, a)) # b is not a subset of a
print(snp.isitem(4, a)) # 4 is an item of a
print(snp.isitem(3, a)) # 3 is not an item of a
If you execute this example, you get the expected result: a
is a subset ob
b
, 4
is a member of a
.
Returning array indices
The intersect function takes an optional argument
indices
which is False
by default. If this is set to True
, the return
value consists of the intersection array and a tuple with the indices of the
common values for both arrays. The index arrays have the length of the output.
The indices show the position in the input from which the value was copied.
## intersect_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4,6,8,10])
b = np.array([1,2,3,4])
intersection, indices = snp.intersect(a,b, indices=True)
print(intersection)
print(indices)
The above example gives:
The first line shows the intersection of the two arrays. The second line
prints a tuple with the indices where the common values appeared in the input
arrays. For example, the value 4
is at position 1
in array a
and at
position 3
in array b
.
Since version 0.3.0, the merge has to indices
argument too. The
returned indices have the length of the inputs. The indices show the position
in the output to which an input value was copied.
## merge_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4])
b = np.array([3,4,5])
merged, indices = snp.merge(a,b, indices=True)
print(merged)
print(indices)
The above example gives:
The first line shows that the two arrays have been merged. The second line
prints a tuple with the indices. For example, the value 3
from array b
can
be found at position 1
in the output.
Duplicate treatment
Since version 0.3.0, sortednp supported multiple different strategies to deal with duplicated entries.
Duplicates during intersecting
There are three different duplicate treatments for the intersect function:
-
DROP: Ignore any duplicated entries. The output will contain only unique values.
-
KEEP_MIN_N: If an entry occurs
n
times in one input array andm
times in the other input array, the output will contain the entrymin(n, m)
times. -
KEEP_MAX_N: If an entry occurs
n
times in one input array andm
times in the other input array, the output will contain the entrymax(n, m)
times (assuming the entry occurs at least once in both arrays, i.e.n > 0
andm > 0
).
The strategy can be selected with the optional duplicates
argument of
intersect. The default is
sortednp.IntersectDuplicates.KEEP_MIN_N
. Consider the following example.
## intersect_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
intersect_drop = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.DROP
)
print(intersect_drop) # Contains a single 4
intersect_min = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.KEEP_MIN_N
)
print(intersect_min) # Contains 4 twice
intersect_max = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.KEEP_MAX_N
)
print(intersect_max) # Contains 4 three times
The above example gives:
Duplicates during merging
The merge
function offers three different duplicates treatment strategies:
-
DROP: Ignore any duplicated entries. The output will contain only unique values.
-
DROP_IN_INPUT: Ignores duplicated entries in the input arrays separately. This is the same as ensuring that each input array unique values. The output contains every value at most twice.
-
KEEP: Keep all duplicated entries. If an item occurs
n
times in one input array andm
times in the other input array, the output contains the itemn + m
times.
The strategy can be selected with the optional duplicates
.
The default is KEEP. Consider the following example.
## merge_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
merge_drop = snp.merge(
a, b,
duplicates=snp.MergeDuplicates.DROP
)
print(merge_drop) # Contains a single 4
merge_dii = snp.merge(
a, b,
uplicates=snp.MergeDuplicates.DROP_IN_INPUT)
print(merge_dii) # Contains 4 twice
merge_keep = snp.merge(
a, b,
duplicates=snp.MergeDuplicates.KEEP)
print(merge_keep) # Contains 4 five times
The above example gives:
Duplicates during subset checks
The issubset function offers two different duplicates treatment strategies:
-
IGNORE: Ignore any duplications. The function returns True if each value in the first array is contained at least once in the second array. Duplicated entries in the first array do not change the return value.
-
REPEAT: For each duplicated item in the first array, require at least as many items in the second array. If for one value the first array contains more duplicated entries than the second array, the function returns False.
The strategy can be selected with the optional duplicates
.
The default is IGNORE. Consider the following example.
## subset_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([3, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Number of occurances ignored
print(snp.issubset(a, b, duplicates=snp.SubsetDuplicates.IGNORE)) # is subset
print(snp.issubset(b, a, duplicates=snp.SubsetDuplicates.IGNORE)) # is subset
# Number of in subset must be smaller or equal
print(snp.issubset(a, b, duplicates=snp.SubsetDuplicates.REPEAT)) # is subset
# three 4s not subset of two 4s
print(snp.issubset(b, a, duplicates=snp.SubsetDuplicates.REPEAT))
The above example gives:
Index tracking and duplicates
Tracking indices with the indices=True
argument is possible while selecting a
non-default duplicate treatment strategy. For merging the indices point to the
position in the output array. If the input has duplicates that were skipped, the
index is simply repeated. For example with
snp.IntersectDuplicates.DROP, if
the input is [9, 9, 9, 9]
, the index array for this input contains four times
the position where 9
is found in the output.
Similarly, with
snp.IntersectDuplicates.KEEP_MAX_N
and intersect
, the index of the last item in the array with less occurrences
is duplicates.
## duplicates_index.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Merge
merge_drop, (index_a, index_b) = snp.merge(
a, b,
duplicates=snp.IntersectDuplicates.DROP,
indices=True
)
print(index_b)
# Intersect
intersect_max, (index_a, index_b) = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.KEEP_MAX_N,
indices=True
)
print(index_a)
The above example gives:
For merging, this means that the three 4
s from the input all appear at same position
in the output, namely position 2
.
For the intersect, this means that the second and third occurrence of 4
in the
output, both came from item at position 2
in the input.
k-way functions
Similarly, the k-way intersect and merge functions take two or more arrays and perform the merge or intersect operation on its arguments.
## kway_intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([0, 3, 5, 7, 9])
c = np.array([1, 2, 3, 5, 7, 9])
d = np.array([2, 3, 6, 7, 8])
i = snp.kway.intersect(a, b, c, d)
print(i)
If you run this, you should see the intersection of all four arrays as a sorted numpy array.
The k-way merger sortednp.kway.merge works analogously.
However, the native
numpy
implementation is faster compared to the merge provided by this package.
The k-way merger has been added for completeness. The package heapq
provides
efficient functions to merge multiple arrays simultaneously.
The functions sortednp.kway.merge and
sortednp.kway.intersect accept the optional keyword
argument assume_sorted
. By default, it is set to True
. If it is set to False
,
the function calls sort()
on the input arrays before performing the operation.
The default should be kept if the arrays are already sorted to save the time it
takes to sort the arrays.
Since the arrays might be too large to keep all of them in memory at the same
time, it is possible to pass a callable
instead of an array to the functions.
The callable
is expected to return the actual array. It is called immediately
before the array is required. This reduces the memory consumption.
Algorithms
Intersections are calculated by iterating both arrays. For a given element in
one array, the function needs to search the other and check if the element is
contained. In order to make this more efficient, we can use the fact that the
arrays are sorted. There are three search functions, which can be selected via the
optional keyword argument algorithm
.
- SIMPLE_SEARCH: Search for an element by linearly iterating over the array element-by-element. More Information.
- BINARY_SEARCH: Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.
- GALLOPING_SEARCH: First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.
The default is sortednp.Algorithms.GALLOPING. The performance of all three algorithms is compared in the next section. The functions issubset() and isitem() also support the algorithm keyword.