Dataset helper functions

Some comonly used functions for datasets.

Checking

check_duplicate_in_dataset

 check_duplicate_in_dataset (x, dataset)

Check if ‘x’ is in ‘dataset’

source

check_duplicates_in_dataset

 check_duplicates_in_dataset (xs, dataset, return_ind=False, invert=False)

Checks if xs is are dataset. Boolean invert changes if we count duplicates (False) or ones that are not in dataset (True). Uses torch.vmap which copies dataset for every element in xs.

Check if this works:

xs = torch.tensor(
    [[0.7, 1, 0.5], 
     [0.3, 1, 0.5],
     [  0, 1, 0.5]])

d = torch.tensor([
    [0.11, 1, 0.5],
    [0.70, 1, 0.5],      #here a dup
    [0.71, 1, 0.5],
    [0.3 , 1, 0.5]])

check_duplicates_in_dataset(xs, d, return_ind=True)

(2, tensor([0, 1]))

Manipulating

source

shuffle_tensor_dataset

 shuffle_tensor_dataset (x, y=None, *z, cpu_copy=True)

Assumes numpy or tensor objects with same length.

source

get_unique_elements_indices

 get_unique_elements_indices (tensor)

Returns indices of unique_elements in tensor.

source

uniquify_tensor_dataset

 uniquify_tensor_dataset (x, y=None, *z)

x has to be tensor, assumes numpy or tensor obj for y and z

source

balance_tensor_dataset

 balance_tensor_dataset (x, y, *z, samples:int=None,
                         make_unique:bool=True, y_uniques=None,
                         shuffle_lables:bool=True, add_balance_fn:<built-
                         infunctioncallable>=None, njobs=1)

Assumes x is tensor and y is tensor or numpy.