sortednp
Sortednp (sorted numpy) is a python package which provides methods to perform efficient set operations on sorted numpy arrays. This includes intersecting and merging sorted numpy arrays. The returned intersections and unions are also sorted.
Classes
Algorithms
Bases: IntEnum
Intersections are calculated by iterating both arrays. For a given element
in one array, the method needs to search the other and check if the element
is contained. In order to make this more efficient, we can use the fact
that the arrays are sorted. There are three search methods, which can be
selected via the optional keyword argument algorithm.
Attributes
BINARY_SEARCH
class-attribute
instance-attribute
Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.
GALLOPING_SEARCH
class-attribute
instance-attribute
First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.
SIMPLE_SEARCH
class-attribute
instance-attribute
Search for an element by linearly iterating over the array element-by-element. More Information.
IntersectDuplicates
Bases: IntEnum
Specify how to handle duplicated items in the input arrays when computing their intersection.
Attributes
DROP
class-attribute
instance-attribute
With DROP, the intersection ignore any duplicated entries. The output will contain only unique values.
KEEP_MAX_N
class-attribute
instance-attribute
With KEEP_MAX_N, the
intersection an item occurs n > 0 times in one input array and m > 0
times in the other array, the output will contain the item min(n, m)
times.
KEEP_MIN_N
class-attribute
instance-attribute
With KEEP_MIN_N, if an item
occurs n > 0 times in one input array and m > 0 times in the other
array, the output will contain the item min(n, m) times.
MergeDuplicates
Bases: IntEnum
Specify how to handle duplicated items in the input arrays during merging.
Attributes
DROP
class-attribute
instance-attribute
Ignore any duplicated elements. The output contains only unique values.
DROP_IN_INPUT
class-attribute
instance-attribute
Ignores duplicated elements in the input arrays separately. This is the same as ensuring that each input array unique values. The output contains every value at most twice.
SubsetDuplicates
Bases: IntEnum
Procedure to handle duplicates for issubset() function.
Attributes
IGNORE
class-attribute
instance-attribute
Ignore duplicates in the potential subset and the superset arrays. The issubset() function returns true if every unique values in the first array exists at least once in the superset array.
REPEAT
class-attribute
instance-attribute
Account for the number of occurrences of each item in the potential subset. The issubset() function returns true if and only if every item in the first array appears at least as often in the superset array as it is contained in the subset array.
Functions
intersect
intersect(
a: np.ndarray,
b: np.ndarray,
algorithm=Algorithms.GALLOPING_SEARCH,
duplicates=IntersectDuplicates.KEEP_MIN_N,
indices: bool = False,
) -> Union[np.ndarray, Tuple[np.ndarray, Tuple[np.ndarray, np.ndarray]]]
Return a sorted array with all elements that appear in both a and b, i.e., it computes the intersection of a and b.
import numpy as np
import sortednp as snp
a = np.array([1, 3, 5])
b = np.array([1, 2, 3])
assert snp.intersect(a, b).tolist() == [1, 3]
The algorithm parameter determines the search algorithm to use. See
Algorithms for possible values. The default is
GALLOPING_SEARCH.
If duplicates is set to DROP, the
return value contains only unique values that appear in both input arrays.
This behavior represents mathematical set intersection and ignores any
duplication in the input.
a = np.array([2, 2])
b = np.array([2, 2, 2])
intersection = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.DROP
)
assert intersection.tolist() == [2]
If duplicates is set to
KEEP_MIN_N, the return value
contains each item min(n, m) times, where n and m are the number of
occurrences in array a and b respectively. The rational is that
each duplicated item in the input arrays matches exactly once with an item
in the other array. For example, consider:
a = np.array([2, 2])
b = np.array([2, 2, 2])
intersection = snp.intersect(
a, b,
# That's the default behavior
duplicates=snp.IntersectDuplicates.KEEP_MIN_N
)
assert intersection.tolist() == [2, 2]
The first 2 in array a matches with the first 2 in array b. The
second 2 in array a matches with the second 2 in array b. However,
the third 2 in array b has no match in array a. The previous
duplications are already taken.
If duplicates is set to
KEEP_MAX_N, the return value
contains each item max(n, m) times, where n and m are the number of
occurrences in array a and b respectively. The rational is that
duplicates items in one input array can all match with a single item in the
other array. For example, consider:
a = np.array([2, 2])
b = np.array([2, 2, 2])
intersection = snp.intersect(
a, b,
duplicates=snp.IntersectDuplicates.KEEP_MAX_N
)
assert intersection.tolist() == [2, 2, 2]
For every occurrence of 2 in array b, we can verify that the element,
i.e., 2, is contained in array a.
If the optional parameter indices is set to True, the function returns
the intersection array and a tuple of arrays with integer indices. For
each element in the intersection, the corresponding indices point to the
position where the element is contained in the input arrays. Assuming
intersection, (indices_a, indices_b) = snp.intersect(a, b, indices=True),
the following conditions hold for all i in the valid range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
ndarray
|
First array to intersect |
required |
b |
ndarray
|
Second array to intersect |
required |
algorithm |
Algorithms
|
Search algorithm to search common items in the arrays |
GALLOPING_SEARCH
|
duplicates |
IntersectDuplicates
|
Specifies how to handle duplicated items in the input arrays |
KEEP_MIN_N
|
indices |
bool
|
If True, return the indices of the intersection items in the input arrays. |
False
|
Returns:
| Type | Description |
|---|---|
Union[ndarray, tuple[ndarray, tuple[ndarray, ndarray]]]
|
The intersection of the input arrays. If |
Source code in sortednp/__init__.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 | |
isitem
Returns True if element is contained in the array, otherwise False.
import numpy as np
import sortednp as snp
a = np.array([1, 2, 3])
assert snp.isitem(1, a)
assert not snp.isitem(4, a)
The algorithm parameter determines the search algorithm to use. See Algorithms for possible values. The default is GALLOPING_SEARCH.
Warning
Please note that the search element is casted to the numpy type of the given array. This might lead to unexpected results if the element is outside the domain of the date type.
Future numpy versions will raise an Exception.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
element |
Any
|
The element to search in array |
required |
array |
ndarray
|
The sorted array to search the element. Adjacent items must be in ascending order. The behavior of the function is undefined if the array is not sorted. |
required |
algorithm |
int
|
The algorithm to use for the search. See Algorithms for possible values. |
GALLOPING_SEARCH
|
Returns:
| Type | Description |
|---|---|
bool
|
True if subset is a subset of array, otherwise False. |
Source code in sortednp/__init__.py
issubset
issubset(
subset: np.ndarray,
array: np.ndarray,
algorithm: Algorithms = Algorithms.GALLOPING_SEARCH,
duplicates: SubsetDuplicates = SubsetDuplicates.IGNORE,
)
Returns True if all values in subset appear in array,
i.e., return True if subset is a subset of array, otherwise False.
import numpy as np
import sortednp as snp
a = np.array([1, 3])
b = np.array([1, 2, 3])
assert snp.issubset(a, b)
The algorithm parameter determines the search algorithm to use. See
Algorithms for possible values. The default is
GALLOPING_SEARCH.
If duplicates is
SubsetDuplicates.IGNORE, returns True
if and only if each unique element in subset appears at least once in
array.
a = np.array([2])
b = np.array([2, 2])
assert snp.issubset(a, b, duplicates=snp.SubsetDuplicates.IGNORE)
assert snp.issubset(b, a, duplicates=snp.SubsetDuplicates.IGNORE)
If duplicates is
SubsetDuplicates.REPEAT, returns True
if and only if each unique element in subset appears at least as often
in array as it appears in subset.
a = np.array([2])
b = np.array([2, 2])
assert snp.issubset(a, b, duplicates=snp.SubsetDuplicates.REPEAT)
assert not snp.issubset(b, a, duplicates=snp.SubsetDuplicates.REPEAT)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
subset |
ndarray
|
The potential subset in question. |
required |
array |
ndarray
|
The sorted array to check if subset is a true subset. Adjacent items must be in ascending order. The behavior of the function is undefined if the array is not sorted. |
required |
algorithm |
int
|
The algorithm to use for the search. See Algorithms for possible values. |
GALLOPING_SEARCH
|
duplicates |
int
|
How to handle duplicates. See Duplicates for possible values. |
IGNORE
|
Returns:
| Type | Description |
|---|---|
bool
|
True if subset is a subset of array, otherwise False. |
Source code in sortednp/__init__.py
kway_intersect
Return a sorted array containing the intersection of all input arrays.
If the optional flag assume_sorted is set to False, the function sorts the arrays prior to intersecting. The arrays are order by increasing size before starting to intersection the arrays one-by-one.
The method raises a TypeError, if the array count is zero.
The arguments can load the arrays on the fly. If an argument is callable, its return value is used as an array instead of the argument itself. This makes it possible to load one array after another to avoid having all arrays in memory at the same time. Arrays deferred in this way, are loaded only after all in-memory arrays are intersected. The computed intersection is empty at any point, the function stops and returns an empty array.
Note
Note on the performance: The function intersects the arrays one-by-one. This is not the most performant implementation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*arrays |
ndarray or callable
|
The arrays to intersect as numpy arrays or deferred arrays. |
()
|
assume_sorted |
bool
|
If True, the function assumes that the arrays are sorted. If False, the function sorts the arrays before intersecting. The default is True. |
True
|
**kwds |
dict
|
Any other keyword argument is passed to the native intersect function. |
{}
|
Returns:
| Type | Description |
|---|---|
ndarray
|
The union array. |
Source code in sortednp/kway.py
kway_merge
Return a sorted array containing the union of all input arrays.
If the optional flag assume_sorted is set to False, the function sorts the arrays before merging.
The method raises a TypeError, if the array count is zero.
The arguments can load the arrays on the fly. If an argument is callable, its return value is used as an array instead of the argument itself. This makes it possible to load one array after another to avoid having all arrays in memory at the same time.
Note
Note on the performance: The function merges the arrays one-by-one. This is not the most performant implementation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*arrays |
ndarray or callable
|
The arrays to merge as numpy arrays or deferred arrays. |
()
|
assume_sorted |
bool
|
If True, the function assumes that the arrays are sorted. If False, the function sorts the arrays before merging. The default is True. |
True
|
**kwds |
dict
|
Any other keyword argument is passed to the native merge function. |
{}
|
Returns:
| Type | Description |
|---|---|
ndarray
|
The merged array. |
Source code in sortednp/kway.py
merge
Return a sorted array containing all elements from both arrays.
import numpy as np
import sortednp as snp
a = np.array([2, 4])
b = np.array([1, 2, 3])
assert snp.merge(a, b).tolist() == [1, 2, 2, 3, 4]
The algorithm parameter determines the search algorithm to use. See
Algorithms for possible values. The default is
GALLOPING_SEARCH.
If duplicates is set to DROP, the
any duplicates in the merged output are omitted. The return values
only contains unique values.
a = np.array([2, 2, 2])
b = np.array([2, 2, 2, 2])
merged = snp.merge(
a, b,
sortednp.MergeDuplicates.DROP
)
assert merged.tolist() == [2]
If duplicates is set to
DROP_IN_INPUT, the output is
obtained by merging the unique values from the input arrays. Values occur
at most twice in the output. A value appearing twice in the output implies
it exists in both input arrays.
a = np.array([2, 2, 2, 3, 3])
b = np.array([2, 2, 2, 2])
merged = snp.merge(
a, b,
sortednp.MergeDuplicates.KEEP_IN_INPUT
)
assert merged.tolist() == [2, 2, 3]
If duplicates is set to DROP, the
retains any duplication in the input arrays. If an item occurs n times
in one input array and m times in the other input array, the output
contains the item n + m times.
a = np.array([2, 2, 3, 3])
b = np.array([2, 2, 2])
merged = snp.merge(
a, b,
sortednp.MergeDuplicates.KEEP_IN_INPUT
)
assert merged.tolist() == [2, 2, 2, 2, 2, 3, 3]
If the optional parameter indices is set to True, the function returns
the union array and a tuple of arrays with integer indices. For
each element in the union, the corresponding indices point to the
position where the element is contained in the input arrays. Assuming
union, (indices_a, indices_b) = snp.merge(a, b, indices=True),
the following conditions hold for all i in the valid range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a |
ndarray
|
First array to merge |
required |
b |
ndarray
|
Second array to merge |
required |
duplicates |
MergeDuplicates
|
Specifies how to handle duplicated items in the input arrays |
KEEP
|
indices |
bool
|
If True, return the indices of the union items in the input arrays. |
False
|
Returns:
| Type | Description |
|---|---|
Union[ndarray, tuple[ndarray, tuple[ndarray, ndarray]]]
|
The union of the input arrays. If |
Source code in sortednp/__init__.py
435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 | |
resolve
Helper function to check whether the given object is callable. If yes, return its return value, otherwise return the object itself.
This function is used by the package to load large datasets on demand to avoid out-of-memory scenarios.