The purpose of the paper is developing a method to predict the
probability of Chinese students gets admission when applying graduate school.
This paper use logistics regression algorithm base on student’s features, such as GPA, GRE and TOEFL score,
to find out the relation with final result of application.
First, collecting dataset from Chinese forum posts 1 which cover
abundantly of previous student’s
application data. And raw data need to scrubbing before training and
transform data into best format. Using algorithm to training data, find optimal
coefficients to classify data. The application needs to get some instance
features as input data and output structured numeric values. According to
training model will build simple regression calculation on this input data and
determines which class the input data should belong to. According to the
students’ features, the output will the probability of being accepted by the graduate
Keywords Machine Learning,
Each year United State universities receive much application from
different countries, including numerous Chinese students who apply graduated
school intends to go further in their academic research area. Each graduate
school has qualifications to students, such as the minimum requirement of
student’s grade point average during undergraduate. Graduate Record
Examination (GRE) is a
standardized test which is an admission requirement for the most graduate university in the
United State. For international students also require English as a Foreign
Language (TOEFL) grade. The admission of graduate school exists many uncertain
factors, but most of the schools will have admission requirement. Even the highest
level of students’ applicants
exists random factors. Not all the good grade students can be accepted into
It is no doubt that there are many factors that affect admission, but
there are several factors that have an enormous impact on determining admission
rates: GPA and GRE scores. TOEFL score also as standard to verified
international students whether having the ability to access better
universities. No one really knows how universities evaluation and filtering
students from a number of applications based on these indicators.
Because some of the data can’t be measured by detail numerical value. In
this paper, we didn’t consider the student’s academic research experiment and students’ background.
2.1 Collecting Dataset
All the data collecting from public data. Chinese forum gter.com which
include abundant of posts previous student’s application data. In the forum posts have detail information
about student’s GPA, GRE and TOEFL score. Students also post their application
graduate school list and application result. Using open source web crawler from
GitHub to collect data from website forum.
2.2 Data Cleaning and
Even though web crawler can collect abundant of data from the internet,
some of the data meaningless for this project. So, data cleaning is especially
required when integrating heterogeneous data sources, some of the data value
are losing or anomalous. Some of the data formats should be addressed together
with schema related data.2 In order to improve the quality of data, data
cleaning deals with detecting and removing errors and inconsistencies from
data. Most of the data are reliable, each instance includes features in the
data set. Because the dataset comes from the internet, which can’t avoid losing some important values. Data
cleaning need to modify or remove data according to requirements. One of the
methods to solve its problem is using the mean value from all the available
data to substitute the losing value. But for some of the instances losing too
many features which may misadvise final result, for this condition have to
In order to quantify data, transforming raw data into schema data which
will facilitate during processing data. Some of the Chinese need to translate
into English and transform measurable data as number instead of string format
data. For example, mark receives offer as 1 and rejected as 0 as one instance
3.1 Logistics Regression
apply into admission result
The admission result also could modeling as binary target variable, the
applicant was admitted to the program or rejected. The probability of
occurrence of an event as a function of a relatively some independent
variables. Data features are independent variables, each student has a unique
set of test scores, GRE and TOEFL grades and cumulative grade point average
3.2 Data Features
Graduate Record Exam(GRE), a generalized test for prospective graduate
students including verbal reasoning and quantitative reasoning., continuous
between 130 to 170.
English as a Foreign Language (TOEFL), standardized test to measure the
English language ability of non-native speakers, score continuously between 0
GPA, cumulative grade point average, continuous between 0.0 and 4.0.
Admission result, Binary variable, 0 or 1, where 1 means the applicant
was admitted to the program.
3.3 Logistics Regression
to Best Optimization Problem
Logistic regression is a popular classification method, it will limit
the output in 0 and 1. Denote the possible observations by 0 and 1, each series
of trials therefore giving a sequence of 0’s and 1’s. Value of 1 means the
applicant was admitted to the program and 0 means rejected. 3
The Sigmoid function have similar feature like logistic regression as
jump function between value 0 and 1,
Figure 3.1 shows a larger scale where the sigmoid appears similar to a
step function at x=0.
Base on this Figure 3.1 when the g(z) bigger than 0.5 classify as 1, and
below 0.5 classify as 0.
z=w_0 x_0+w_1 x_1+w_2 x_2+?+w_n x_(n )=f(x)=?_(i=1)^n??w_i x_i ?= w^T x (3.2)
Multiply two vectors and add up all the features together. Using vector
notation write as z=w^T x. The vector x is instance’s features as input data, and we want to find the best
Binomial logistic regression model could consider as classify to two
different output. The output can be considered as a given set of probabilities
to enter an event, just like any other classification method. P(y ?| x)
means conditional probability distribution, variable y is 0 or 1.
P( y=1 ?|x;
w)= 1/(1+e^(-w^T x) )= g(w^T x) (3.3)
P( y=0 ?|x;
w)= 1- g(w^T x) (3.4)
Merge equation (3) and (4) together
P( y ?|x; w)=(g(w^T x))^y (1-
g(w^T x))^(1-y) (3.5)
In order to find optimize the function, need to building another
optimization algorithms to get the best result. After determining use logistics
regression model and selecting initial features set, the next step is how to
obtain the optimize parameters so that the training logistics regression model
process into best classification results.
This process can be regarded as a search process. how to find a solution
that matches logistics regression model we designed in a logistics regression
solution space. In order to obtain the corresponding optimal logistics
regression model, we need to design a search strategy, considering what kind of
criteria to choose the optimal model.
w^(t+1)=w^t-? (? L(w))/?w
This is equation by introducing ? (0 < ? < 1), which is called the learning rate.
Gradient ascent is a method to find the local optimal solution of the
function by using the gradient information. It is also the simplest and most
commonly used optimization method in machine learning. In this logistics
regression problem need to find the maximum solution, just need to go up every
step which makes cost function smaller. Then use same way iterate function to
find the optimal value.
Pseudo code for the gradient ascent
As the figure 1 show the GRE score and
of acceptation. Most student’s sore mode number between 315 – 330, according the number of the GRE score are higher than 315 will
have more probability acceptance rate.
Figure 2 TOEFL
result with accept rate
The TOEFL performance is more likely to show that higher TOEFL score higher the admission rate.
Figure 3 ROC
and ROC Cureve
the impact of admission rate by GRE and TOEFL grade, use the predictive
function returns the value of the label value, 0 represents failure and 1
represents success. The function returns the probability value as predict probe.
The first term is Precision = TP/(TP+FP). Precision tells us the fraction of
records that were positive from the group that the classifier predicted to be
positive. The second term we care about is Recall = TP/(TP+FN).
use ROC curve can be used to compare the classification and production costs
and benefits of decision-making analysis 5. Different classifiers may perform better for different
thresholds, and combining them in some way may make sense. In this mode, we
combine GRE score with acceptance rate as Figure 3 orange curve, TOEFL score
with acceptance rate as Figure 3 orange curve.
x-axis in figure3 is the number of false positives TP/(TP+FP), and the y-axis
is the number of true positives TP/(TP+FN). The ROC curve shows how the two
rates change with the threshold. The leftmost point corresponds to a negative
category, and the rightmost point corresponds to all categories in the positive
Logistic regression found that the most suitable parameter for the
nonlinear function is called sigmoid. Although logistic regression can be used
for classification, the algorithm still belongs to linear regression. On the
basis of linear regression, adding one more sigmoid function mapping when
mapping features to result. The first sum of linear features and then use a
sigmoid function to predict it. The optimization method can be used to find the
best fitting parameters. Among the optimization algorithms, one of the most
commonly used algorithms is the gradient-ascent algorithm. Gradient ascent can
be simplified by a random gradient ascent.