## 1 Introduction

In machine learning, many features are categorical, such as color, country, user id, item id, etc. In the multi-class classification problem, the labels are categorical too. The ordering relation doesn’t exist among different values for these categories. Usually those categorical variables are represented by one-hot feature vectors. For example, red is encoded to 100, yellow to 010 and blue to 001. But if the number of categories are very huge, for example the user id and item id in e-commerce applications, the one-hot encoding scheme needs too many resources to compute classification results.

In the past years while SVM is widely used, ECOC (error-correct output coding) method is proposed for handling huge numbers of output class labels. The idea of ECOC is to reduce a multi-class classification problem of huge number of classes to some two-class classification problems using binary error-correct coding. But for the solution of handling huge number of input categorical features, the similar method doesn’t exist, because the categories can not be separated by linear model, unless the one-hot encoding is used.

In recent year, the deep neural network has great improvement in terms of performance and speed. The coding method can be applied to deep neural network with some new beneficial reform.

In the classification problem, because the number of labels of a single neural network need not to be binary, if we use a deep learning network as a base learner, it is not necessary to limit the code to be binary. In fact, there is a trade-off between the class number of one base learner and the number of base learner used. According to information theory, if we use

classes classifiers as basic classifiers to solve a classification problem of

-class, we need at least ’s base learners. For example, if we need to solve a classifying problem of 1M’s classes, and we use the binary classifier as base learners, we need at least 20 base learners. For some classical applications, for example, the CNN image classification, we need to build a CNN network for every binary classifier. It is huge cost for computation and memory resources. But if we combine different base learners with 1000 classes, we need at least 2 base learners. We know that the number of parameters in a Deep neural network is usually big, hence using a small number of base learner benefits the reduction of the cost in computing and storage.On the other hand, because the neural network has the ability of non-linear representation, we can use the encoding for categorical features too. Can we use classical error-correct coding for categorical features? We know that in machine learning, the sparsity is a basic rule to be satisfied, but the classical error-correct coding does not satisfy the sparsity. Hence we need to design a new sparse coding scheme for this application.

In this paper, we give some new encoding method, they can be applied to both label encoding and feature encoding and give better performance than classical method. In section 2, we give the definition of category coding (CC) and propose 3 classes of CC, namely Polynomial CC, Remainder CC and Gauss CC, which have good property. In section 3 we discuss the application of CC in label encoding. In section 4, we discuss the application of CC in feature encoding. Our main tool is finite field theory and number theory, which can refer to ff and NT .

## 2 Category coding

For a -class categorical feature or label, we define a category coding (CC) as a map

where each is called a “site-position function”. category coding, for .

Generally, is a huge number, and are some numbers of middle size.

We can reduce a -classes classification problem to ’s classification problems of middle size through a CC.

We can also use a -hot -bit binary encoding instead of the one-hot encoding as the representation of the feature, i.e., use the composite of the CC map and the nature embedding

to get a hot encoding.

For a CC , we call the collision number of , and denote . We have the following theorem.

###### Theorem 2.1.

For a CC , where , we have .

Proof. Let . Suppose , i.e

Hence for any , there are at most same site-position value between and . Hence is an injection, and hence . It is a contradiction with the definition of . ∎

If a CC satisfying , we call it has the minimal collision property. In both usage of label encoding and feature encoding, we wish the code has minimal collision property.

We give 3 classes of CC, i.e, Polynomial CC, Remainder CC and Gauss CC, which satisfies the minimal collision property.

### 2.1 Polynomial CC

For any prime number , we can represent any non-negative integral number less than as the unique form , which gives a bijection , where is the Galois field (finite field) of elements.

For the classification problem of -classes and any small positive integral number (for example, k=2, 3) and a small real number , we take a prime number in (According to the Prime Number Theorem ( Riemann , Prime_Number_Theorem ), there are about such prime numbers.) , and get a injection by p-adic representation.

###### Theorem 2.2.

For ’s different elements in , the code defined by the composite map of the p-adic representation map and the map

and the map

has the minimal collision property.

Proof. We need proof that . Because we know that , hence we need just prove , i.e for any , .

Because the p-adic representation map and is an injection, and the map is a bijection, we need just to show that for any , . Suppose there are such that , it means the polynomial of degree at most has at least roots, it is a contradiction with the Algebraic Basic Theorem on fields. ∎

Remark. The composite map of and in above theorem is known as Reed-Solomon code also Reed_and_Solomon . The Reed-Solomon code is a class of non-binary MDS (maximal distinct separate) code Singleton . MDS property is a excellent property in error-corrected coding. But unfortunately, it has not find any nontrivial binary MDS code yet up to now. In fact, for some situation, the fact that there are not any nontrivial binary MDS code is proved. (Guerrini_and_Sala and Proposition 9.2 on p. 212 in Vermani ). This is an advantage of CC than ECOC in label encode also.

### 2.2 Remainder CC

For the original label’s set , a small number k like 2, or 3, etc., and a small positive number , select ’s pairwise co-prime numbers in the domain . (According to the Prime Number Theorem ( Riemann , Prime_Number_Theorem ), there are about prime and hence pairwise co-prime numbers in this domain.)

We define the remainder CC as

where , and is called its modules. Then we have the following proposition:

###### Theorem 2.3.

The remainder CC has the minimal collision property.

Proof. We need only to show that, for any , there are at most ’s such, that .

Suppose there exist ’s different such, that , we can suppose that for . Then we have for all . Because are pairwise co-prime numbers, we have . But we know , which in , hence . ∎

### 2.3 Gauss CC

We write the ring of Gauss integers as . For a big integral number , let is the minimal positive real number such that the number of Gauss integers in the closed disc is not less than , i.e and for any small . In general, we have is about , hence we can get such about .

We can embed the original IDs to the Gauss integers in Gauss integers in the closed disc.

Let be a small positive integral number, like 2,3, and be a small positive real number. Let be pairwise co-prime Gauss integral numbers satisfying We define the category mapping

where means the principle ideal of generated by , . is called the modules of this Gauss CC, and we have the following theorem.

###### Theorem 2.4.

The Gauss CC has the minimal collision property.

Proof. From the method to take , we know . Hence we need only to show that, for any , there are at most ’s such, that .

Suppose there exist ’s different such, that , we can suppose that

Then we have for all .

Because are pairwise co-prime Gauss integral numbers, hence are pairwise co-prime ideal of , and we have . Hence i.e, , and hence . But we know , hence . On the other hand, we know , hence , and hence . ∎

## 3 Application for label encode

For a -class classification problem, we use a CC

to reduce a -classes classification problem to ’s classification problems of middle size through a LM. Suppose the training dataset is , where is feature and is label, then we train a base learner on the dataset for every . We call it the label encoding method.

A CC good for label encoding should satisfy the follow properties:

Classes high separable. For two different labels , there should be as many as possible site-position functions such that .

Base learners independence. When are selected randomly uniformly from , the mutual information of and approximate to 0 for .

The property “classes high separable” ensures that for any two different classes, there are as many as possible base learners are trained to separate them. The property “base learners independence” ensures that the common part of the information learned by any two different base learners is few.

Remark. These properties are the similar of the properties “Row separable” and “Column separable” of ECOC (Dietterich_and_Bakiri ) in non-binary situation.

The minimal collision property ensure the CCs satisfy “Class high separable”, we will show that they satisfy “Base learner independence” also.

### 3.1 Polynomial CC

We will prove that, the Polynomial CC satisfies the property “Base learners independence” also.

###### Theorem 3.1.

If

is a random variable with uniform distribution on

, and are the i-site value and j-site value () of the codeword of under the simplex LM described above, then the mutual information of and approach to when grows up.Proof.

For any in , the i-th site value is , where are the coefficients of the p-adic representation of . We denote this map by .

Let , consider the following commutative diagram:
Z/ptZ [r][d] ^g_i & Z/ptZ [d]^g_i

Z/pZ [r] & Z/pZ

The horizontal arrow in up line is defined by , and the horizontal arrow in down line is defined by . The horizontal arrows are bijections, which shows that the numbers of the pre-images in of every element in are same and hence equal to .

On the other hand, we have the commutative diagram:
Z/p(t-1)Z [r][dr] & Z/NZ [r][d] & Z/ptZ [dl]

& Z/pZ &

where the horizontal arrows are the natural embedding, and other arrows are the restriction of .

But the number of pre-images in of every element in is , and the same logic shows that the number of pre-images in of every element in is . Therefore the number of pre-images in of every element in is or .

Hence if is a random variable with uniformly distribution on

, its probability at every point in

is , then the probability of at every point in are or . The same logic shows that the probability of at every point in are or .Let , we
have the commutative diagram for any :
Z/p^2sZ [r][d] ^(g_i,g_j) & Z/p^2sZ [d]^(g_i, g_j)

Z/pZ×Z/pZ [r] & Z/pZ ×Z/pZ

where the up horizontal arrow is defined by , and the down horizontal arrow is defined by . Both the horizontal arrows are bijections.

Because we know that when runs over all the pairs in the down horizontal map maps to all the pairs in . Therefore all the number of pre-images in of any element in are same, and hence equal to .

A similar method shows that if is a random variable with uniformly distribution on , the joint probability of at every point in are or .

We know that the mutual information of and is .

a.) When , i.e. , we know and on ’s point in and on other points. Hence we have

However, implies that , hence we have

b.) When , i.e. , we have

Because and , we have

However, implies that , hence we have

∎

### 3.2 Remainder CC and Gauss CC

The theorem 2.3, 2.4 tells us that the Remainder CC and Gauss CC satisfies the “Classes high separable” property. In fact, they satisfy the property “Base learners independence” also.

###### Theorem 3.2.

Let be a Remainder CC , and be uniformly randomly selected from , we have that for any , the mutual Information of and approximate 0.

Proof.

Let and for every . We have that the probabilities of at every point in are or and the probabilities of at every point in are or by using the similar method in the proof of Theorem 3.1.

We know that the mutual information of and is

a.) When , we have and hence and on ’s point in and on other points. Hence we have

b.) When , we have , and

Because

We have

This theorem tells us that, the Remainder CC satisfies the property “Base learners independence”.

Similarly, we have

###### Theorem 3.3.

Let be a Gauss CC, and be uniformly randomly selected from , we have that for any , the mutual Information of and approximate 0.

∎

This theorem tells us that, the Gauss CC satisfies the property “Base learners independence” also.

### 3.3 Decode Algorithm

Suppose we used the LM to reduce a classification problem of class number to the classification problems of class number ’s, and trained base learner for every , the output of every base learner is a distribution on . Now, for a input feature data, how we collect the output of every base learner to get the predict label?

In this paper, we search the such that is maximal, and let such be the decoded label. (In fact, , where is the Delta distribution at , and is the marginal distribution of induced by .)

### 3.4 Numeric Experiments

We use the Inception V3 network and LM on the dataset “CJK characters”. CJK is a collective term for the Chinese, Japanese, and Korean languages, all of which use Chinese characters and derivatives (collectively, CJK characters) in their writing systems. The data set “CJK characters” is the grey-level image of size 139x139 of 20901 CJK characters (0x4e00 0x9fa5) in 8 fonts.

We use 7 fonts as the train set, and other one font as the test set. We use inception v3 network as base learner, and train the networks using batch size=128 and 100 batch per an epoch.

We use three CCs as follows, and get the performance like in Table 1.

a. The polynomial CCs with k=2 and p=181. These Polynomial CCs are defined by , where , and , and r=2 or r=6.

b. The Remainder CCs with k=2 and . These Remainder CCs are defined by , where , , and .

c. the Gauss CCs with k=2 and . These Gauss CCs are defined by , where , and , and r=2 or r=6.

d. ECOC of 15 bit.

ep. | ECOC of 15 bit | Poly. CC of 2 sites | Rem. CC of 2 sites | Gauss CC of 2 sites | Poly. CC of 6 sites | Rem. CC of 6 sites | Gauss CC of 6 sites |
---|---|---|---|---|---|---|---|

20 | 0.0069 | 0.0118 | 0.0081 | 0.0017 | 0.0640 | 0.0459 | 0.0230 |

40 | 0.0795 | 0.6657 | 0.6130 | 0.4308 | 0.9878 | 0.9667 | 0.9760 |

60 | 0.3660 | 0.8172 | 0.7629 | 0.8436 | 0.9968 | 0.9962 | 0.9966 |

80 | 0.5740 | 0.8684 | 0.8757 | 0.9195 | 0.9988 | 0.9983 | 0.9985 |

param. num () |

We can see, even when the base learner number 2 of CCs is much less than the base learner number 15 of ECOC, the performance of CCs are better than the ECOC which trainable number of parameters of networks bigger than CCs.

## 4 Application for feature encode

For a categorical feature take value in , where is a huge integral number, we can use the composite mapping of a CC and the nature embedding

to get a -hot encoding. We use this -hot encoding as feature encoding.

Apart from the CC feature encoding, the more natural ideas for feature encoding are

COO. Cut off of one-hot encoding. We call a -bit binary code the ’Cut off of one-hot’, if the most frequently used ID’s are one-hot encoded in the front bits, and all the other ID’s are encoded to the code .

RMP. Using a code frequently used in error-correct encoding. For example, a Reed-Muller code RM with punch by a random subset of bits. For a binary code and a subset of elements, the punch of by means the code .

We will show that, the performance of our Polynomial CC, Remainder CC and Gauss CC are better than both the code COO and RMP.

### 4.1 Numeric Experiments

We use the dataset “Movie Lends” (Movie_Lends

), which has the columns UserID, MovieID, Rating and Timestamp. The UserIDs range between 1 and 6040, and MovieIDs range between 1 and 3952, ratings are made on a 5-star scale, timestamp is represented in seconds. Each user has at least 20 ratings. We use only the column UserID, MovieID and Rating. and use a DNN with an embedding layer and two full-connected layers. In the embedding layer, the User code and Movie code are embedded to real vectors of dimension 32 respectively, the dimension of the output the two full-connected layers are 64 and 1 respectively. After the first full-connected layer we use ’RELU’, after the second full-connected layer we use

. We use this network as a regression model, and train it by minimize MSE. The ratio between train data and validation data is 8:2. We compare the validation loss of the following methods:1. 582 bit cut off of the one-hot code for UserID, and 474 bit cut off of the one-hot code for MovieID.

2. 582 bit random punch of RM(12,1) for UserID, and 474 bit random punch of RM(11,1) for MovieID.

3. 582 bit 6-hot Polynomial code based on finite field for UserID, and 474 bit 6-hot Polynomial code based on finite field for MovieID.

4. 582 bit Remainder code with modules for UserID, and 474 bit Remainder code with modules for MovieID.

5. 582 bit Gauss code with modules for UserID, and 474 bit Remainder code with modules

Comments

There are no comments yet.