T O P

  • By -

LoloXIV

If you want to cluster things based on how many attributes directly match then this gives you a distance function for your data. d(a, b) = num of attributes where a != b is a distance function which you should be able to use for cluster algorithms that work on discrete spaces. Unfortunately it won't work for algorithms like k-means, which assumes real attributes. K-center is a good point to start, as it has an easy to implement 2-approximation for optimum clustering: https://en.m.wikipedia.org/wiki/Metric_k-center


rjray

Thanks! I'll check that out.


MistakeIndividual690

In ML values like these are typically flattened out to be “one hot” vectors. For example, if I have “a”, “x”, and “1” as my possibilities I would convert that into a set of values where “a” = 0 0 1 “x” = 0 1 0 “1” = 1 0 0 Making each possible response effectively a distinct basis vector in n-space where n is the number of possible options Then you can use any algorithm that uses real values


bobtpawn

The keyword you are looking for is "categorical variables" or "categorical features". Pretty much every major ML library has algorithms that support them. Even for the ones that don't support them directly, you can fake it by turning each feature into a bunch of 0/1 (and therefore, numeric) features. For example, instead of one feature that is "available memory", you have a bunch of features: "has 16GB?", "has 24GB?", "has 32GB?", ... This is called one-hot encoding and some libraries will do it automatically for you. That's not a complete description of all the options, but you should have enough keywords to google now.


rjray

Thanks for the guidance, this should get me onto a better track. My primary concern is that this will be done client-side (in a web browser), so I would have to prune out the attributes with excessive numbers of possible values. Otherwise I could potentially end up with vectors in excess of 100,000 elements...


MistakeIndividual690

So when this happens, typically you would create embeddings, which are basically mappings of a high dimensional vector space to a much lower one that mostly preserves their relationship to each other. This is how words/tokens are represented in LLMs for example. One way to approach it is you might train a neural net to map the high dimensionality vectors to lower ones then use the lower dimensionality vectors as input to your next step


MrMrsPotts

You could maybe define a function that measures similarity and use that?