Questions tagged [machine-learning]
The machine-learning tag has no usage guidance.
165
questions
2
votes
1
answer
40
views
Non-linear transforms of RKHS question
I was reading the paper Norm Inequalities in Nonlinear Transforms (referenced in this question) but ran into difficulties, so I was wondering if anyone could help?
I think I follow the paper until I ...
52
votes
10
answers
6k
views
A clear map of mathematical approaches to Artificial Intelligence
I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical ...
1
vote
0
answers
59
views
Approximation of continuous function by multilayer Relu neural network
For continuous/holder function $f$ defined on a compact set K, a fix $L$ and $m_1,m_2,\dots,m_L$, can we find a multilayer Relu fully connected network g with depth $L$ and each $i$-th layer has width ...
1
vote
2
answers
191
views
Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class
Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in ...
1
vote
0
answers
100
views
Matrix valued word embeddings for natural language processing
In natural language processing, an area of machine learning, one would like to represent words as objects that can easily be understood and manipulated using machine learning. A word embedding is a ...
3
votes
1
answer
145
views
Why is the logistic regression model good? (and its relation with maximizing entropy)
Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^...
9
votes
1
answer
283
views
Who introduced the term hyperparameter?
I am trying to find the earliest use of the term hyperparameter. Currently, it is used in machine learning but it must have had earlier uses in statistics or optimization theory. Even the multivolume ...
2
votes
0
answers
77
views
Equivalence of score function expressions in SDE-based generative modeling
I am studying the paper "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) by Yang et al. The authors use the following loss function (Equation 7 ...
8
votes
1
answer
480
views
Geometric formulation of the subject of machine learning
Question:
what is the geometric interpretation of the subject of machine learning and/or deep learning?
Being "forced" to have a closer look at the subject, I have the impression that it ...
1
vote
0
answers
95
views
Problems Correction of "Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "' [closed]
Where I can find the problems correction of this book " Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "
3
votes
0
answers
38
views
Prove the convergence of the LASSO model in the presence of limited eigenvalues
I am researching the properties of the Lasso model $\hat \beta:= \operatorname{argmin} \{\|Y-X\beta\|_2^2/n+\lambda\|\beta\|_1\}$, specifically its convergence when the data satisfies restricted ...
8
votes
0
answers
119
views
Worst margin when halving a hypercube with a hyperplane
Consider the $n$-cube $C_n=\lbrace-1,1\rbrace^n$ and the problem of partitioning it into halves with hyperplanes through the origin that avoid all its points. We can parameterize the hyperplanes by ...
1
vote
0
answers
49
views
Curve fitting with "rough" loss functions
Many real-valued classification and regression problems can be framed as minimization in the following way.
Setup:
Let $\Theta \in \mathbb{R}^p$ be the parameter space that we are searching over.
For ...
2
votes
0
answers
423
views
Mathematics research relating to machine learning
What branch(s) of math is most relevant in enhancing machine learning (mostly in terms of practical use as opposed to theoretical/possible use)? Specifically, I want to know about math research used ...
1
vote
1
answer
97
views
Adjoint sensitivity analysis for a cost functional under an ODE constraint
I am trying to recover the result given by equation 10 in the article here. I am unable to get rid of the integral, any help would be much appreciated. To keep the description as self contained as ...
2
votes
0
answers
50
views
Convergence of minimiser of empirical risk to minimiser of population risk
Let $X_1, \dots, X_n \sim \mu$ be some random elements of a space $\mathcal{X}$. Let $H$ be a Hilbert space of functions $f: S \to \Re$ with norm $\|\cdot\|_H$.
Let $\|f^*\|_{L_2(\mu)} < \infty$ ...
2
votes
0
answers
42
views
can we get a family of classifiers $\left\{f_n\right\}_{n \in N}$such that $\lim_{n->∞} (E_{(X_1, Y_1), ...,(X_n, Y_n) \sim \rho}[R(f_n)]-R(f_B))=0 $
For a given classifier $f: \mathbb{R}^d \mapsto\{0,1,2\}$, let
$$
R(f):=\mathbb{E}_{(X, Y) \sim \rho}\left[\mathbb{1}_{f(X) \neq Y}\right]
$$
$f_B$ the Bayes classifier.
can we get a family of ...
3
votes
0
answers
48
views
How to prove emprical risk converges to expectation risk as $n\to \infty$?
For example, for a classical binary classification:
$x \in \mathbb{R}^d$ and $y \in\{0,1\}$
let empirical risk be
$R_{\ell}^n(f):=\frac{1}{n} \sum_{i=1}^n \ell\left(f\left(X_i\right), Y_i\right)$
and ...
2
votes
1
answer
81
views
VC-based risk bounds for classifiers on finite set
Let $X$ be a finite set and let $\emptyset\neq \mathcal{H}\subseteq \{ 0,1 \}^{\mathcal{X}}$. Let $\{(X_n,L_n)\}_{n=1}^N$ be i.i.d. random variables on $X\times \{0,1\}$ with law $\mathbb{P}$. ...
4
votes
1
answer
264
views
Perceptron / logistic regression accuracy on the n-bit parity problem
$\DeclareMathOperator{\sgn}{sign}$The perceptron (similarly, logistic regression) of the form $y=\sgn(w^T \cdot x+b)$ is famously known for its inability to solve the XOR problem, meaning it can get ...
1
vote
0
answers
32
views
Convergent gradient-type scheme for solving smooth nonconvex constrained optimization problem
Let $x_1,\ldots,x_n \in \mathbb R^d$ and $y_1,\ldots,y_n \in \{\pm 1\}$, and $\epsilon, h \gt 0$. Define $\theta(t) := Q((t-\epsilon)/h)$, where $Q(z) := \int_{z}^\infty \phi (z)\mathrm{d}z$ is the ...
3
votes
0
answers
126
views
What is the meaning of big-O of a random variable?
I encountered this problem in a book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. I excerpt it below:
screenshot of the book
In the excerpt, the big-O notation $O(\xi^...
2
votes
0
answers
70
views
Training an energy-based model (EBM) using MCMC
I'm reading this paper about training energy-based models (EBMs) and don't understand the parameters that we are training for? The part that is relevant to the question is in pages 1-4. Here is the ...
1
vote
0
answers
145
views
How to maximize certain function of hundreds variables related to correlations between sets vectors ? (and win Kaggle :))
It might be helpful for data science/bioinformatics challenge.
Consider for simplicity three rectangular matrices $Y_{true}$ , $Y_{predict0},Y_{predict1}$ of the same sizes say 70000*140.
Let us ...
2
votes
0
answers
80
views
Nuclear norm minimization of convolution matrix (circular matrix) with fast Fourier transform
I am reading a paper Recovery of Future Data via Convolution Nuclear Norm Minimization. Here, I know there is a definition for convolution matrix.
Given any vector $\boldsymbol{x}=(x_1,x_2,\ldots,x_n)^...
1
vote
0
answers
75
views
Distribution-free learning vs distribution-dependent learning
I came across some papers studying the problem of distribution-free learning, and I am interested in knowing the exact definition of distribution-free learning.
I have searched some literature:
In ...
4
votes
0
answers
116
views
Progress on "Un-Alching" ML?
So, a couple of years ago I watched both Ali Rahimi's NIPS speech "Machine Learning is Alchemy",
(where he talks about how the field lacks a solid, overarching, theoretical foundation) and ...
2
votes
0
answers
41
views
Combining SVD subspaces for low dimensional representations
Suppose we have matrix $A$ of size $N_t \times N_m$, containing $N_m$ measurements corrupted by some (e.g. Gaussian) noise. An SVD of this data $A = U_AS_A{V_A}^T$ can reveal the singular vectors $U_A$...
1
vote
0
answers
102
views
Can I minimize a mysterious function by running a gradient decent on her neural net approximations? [closed]
A cross post from on AI StackExchange.
So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is ...
1
vote
0
answers
52
views
How to calculate the unifrom entropy or VC dimension of the following class of functions?
When dealing with U process I meet with such a uniform entropy to calculate.
For any $\eta>0$, function class $\mathcal{F}$ containing functions $f=\left(f_{i, j}\right)_{1 \leq i \neq j \leq n}: \...
3
votes
1
answer
229
views
Independent input feature z can be removed: if y=f(x+z,z), then y=g(x)?
Let $y\in \mathbb{R}$ and $\mathbf{x},\mathbf{z}\in\mathbb{R}^p$ be random variable and random vectors. Assume $y=f(\mathbf{x}+\mathbf{z},\mathbf{z})$ for some function $f$.
Is the following statement ...
1
vote
0
answers
46
views
Sample Complexity/PAC-Learning Notation
In PAC Learning, Sample Complexity is defined as:
The function $m_\mathcal{H} : (0,1)^2 \rightarrow \mathbb{N}$ determines the sample complexity of learning $\mathcal{H}$:
that is, how many examples ...
1
vote
0
answers
139
views
Stochastic Gradient Descent
In this question, I am not really sure how to approach this question as I am a beginner in optimisation
Consider the function
$f : B_1 → R$ with $f(x) = \left\lVert x \right\rVert_2^2$ and $B_1$ := {$...
5
votes
2
answers
274
views
Entropy & difference between max and min values of probability mass
Let $X$ be a random variable with probability mass function $p(x) = \mathbb{P}[X = x]$.
I know entropy $H(X)$ of $X$ measures the uncertainty of $X$ and
a large value of $H(X)$ means $p(x)$ is nearly ...
1
vote
1
answer
177
views
Using Hoeffding inequality for risk / loss function
I've got a question to the Hoeffding Inequality which states, that for data points $X_1, \dots, X_n \in X$, which are i.i.d. according to a probability measure $P$ on $X$, we find an upper bound for:
$...
20
votes
3
answers
3k
views
How can Machine Learning help “see” in higher dimensions?
The news that DeepMind had helped mathematicians in research (one in representation theory, and one in knot theory) certainly got many thinking, what other projects could AI help us with? See MO ...
2
votes
0
answers
196
views
Covering/Bracketing number of monotone functions on $\mathbb{R}$ with uniformly bounded derivatives
I am interested in the $\| \cdot \|_{\infty}$-norm bracketing number or covering number of some collection of distribution functions on $\mathbb{R}$.
Let $\mathcal{F}$ consist of all distribution ...
1
vote
0
answers
95
views
Limit cycles or stable solutions for k-dimensional piece-wise linear ODEs
As a branch of reinforcement learning, restless multi-armed bandits have been shown PSPACE-HARD but Whittle has offered an implementable solution called the Whittle Index Policy. Weber and Weiss ...
1
vote
0
answers
86
views
If two functions are close apart can I proof the difference of their empirical loss is also small?
I am trying to understand the proof of Theorem 3 in the paper "A Universal Law of Robustness via isoperimetry" by Bubeck and Sellke.
Basically there exist atleast one $w_{L,e}$ in $\...
2
votes
0
answers
43
views
Convergent algorithm for minimizing nonconvex smooth function
Let $\Phi$ be the Gaussian CDF and for $\gamma\ge 0$ and $h>0$, define a loss function $\ell_h:\{\pm 1\} \times \mathbb R$ by
$$
\ell_{\gamma,h}(y,y') := \phi_{\gamma,h}(yy') := \Phi((yy'-\gamma)/h)...
0
votes
0
answers
32
views
Normalizing a parameter in a regression
I am thinking about the possibility of making a parameter in my regression, let's say the $\lambda$ in a ridge regression, somehow, inside a range : $\lambda \in [0,1]$. Do you have any ideas how I ...
0
votes
0
answers
70
views
Shattering of a set of binary classifiers
Let $S$ be a set, and let $\mathcal{F}_{S}=\{f:S\to\{-1,+1\}\}$ be a set of different label assignments. Show that $\mathcal{F}_{S}$ shatters at least $|\mathcal{F}_{S}|$ subsets of $S$.
Here is what ...
1
vote
0
answers
72
views
Converting an indexed equation to a matrix one
I am helping a friend with a project involving neural networks and he wants to convert this equation into matrix notation:
$$w_{ij} = \sum_{n=1}^N\left[\sum_{i=1}^I(r_{in}-y_{in})v_{ih}\right](1-z_{hn}...
3
votes
0
answers
194
views
What is the VC-dimension of regular convex k-gons in the plane?
Recall the relevant definitions:
Let $H$ be a family of sets in $\mathbb{R}^d$. The intersection of $H$ with a point set $C$ is defined as $H\cap C = \{h\cap C\mid h\in H\}$. The VC-dimension of $H$ (...
2
votes
1
answer
149
views
Derive equation for regularized logistic regression with batch updates
I am trying to understand this paper by Chapelle and Li "An Empirical Evaluation of Thompson Sampling" (2011). In particular, I am failing to derive the equations in algorithm 3 (page 6). ...
4
votes
1
answer
609
views
The ODE modeling for gradient descent with decreasing step sizes
The gradient descent (GD) with constant stepsize $\alpha^{k}=\alpha$ takes the form
$$x^{k+1} = x^{k} -\alpha\nabla f(x^{k}).$$
Then, by constructing a continuous-time version of GD iterates ...
2
votes
1
answer
148
views
Representer theorem for a loss / functional of the form $L(h) := \sum_{i=1}^n (|h(x_i)-y_i|+t\|h\|)^2$
Let $K:X \times X \to \mathbb R$ be a (positive-definite) kernel and let $H$ be the induced reproducing kernel Hilbert space (RKHS). Fix $(x_1,y_1),\ldots,(x_n,y_n) \in X \times \mathbb R$. For $t \ge ...
1
vote
0
answers
33
views
Correlating two matrices $A,B$ with stochastic dependency structure imposed by cross-validation
Consider a labelled data set
$$D = \{(x_1, y_1),...,(x_n, y_n)\} $$
on which we want to evaluate a machine learning algorithm using $k$-fold cross validation with $m$ different random seeds. This ...
2
votes
1
answer
83
views
How to fit a set of parametrized data to a parametrized distribution?
I have a time series $d_i(a)$ which depends on the parameter $a$. On the other hand, I have a sequence of normal distributions $\mathcal{N}(0,Q_i(a))$, where the variance $Q_i$ depends on time and ...
2
votes
0
answers
36
views
Stochastic gradient descent in 'stronger' settings
I am minimzing a function $F(x) = \mathbb E(f(x,\Xi))$ where $\Xi$ is some random value, by a stochastic gradient descent that generates a random number $\xi$ from the distribution of $\Xi$ at each ...