Newest 'machine-learning' Questions

2 votes

1 answer

40 views

Non-linear transforms of RKHS question

I was reading the paper Norm Inequalities in Nonlinear Transforms (referenced in this question) but ran into difficulties, so I was wondering if anyone could help? I think I follow the paper until I ...

Mat

41

asked Dec 15 at 12:54

52 votes

10 answers

6k views

A clear map of mathematical approaches to Artificial Intelligence

I have recently become interested in Machine Learning and AI as a student of theoretical physics and mathematics, and have gone through some of the recommended resources dealing with statistical ...

Community wiki

AI Bert

1 vote

0 answers

59 views

Approximation of continuous function by multilayer Relu neural network

For continuous/holder function $f$ defined on a compact set K， a fix $L$ and $m_1,m_2,\dots,m_L$, can we find a multilayer Relu fully connected network g with depth $L$ and each $i$-th layer has width ...

Hao Yu

773

asked Nov 22 at 15:31

1 vote

2 answers

191 views

Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class

Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in ...

dohmatob

6,586

asked Oct 13 at 10:09

1 vote

0 answers

100 views

Matrix valued word embeddings for natural language processing

In natural language processing, an area of machine learning, one would like to represent words as objects that can easily be understood and manipulated using machine learning. A word embedding is a ...

Joseph Van Name

27.4k

asked Sep 23 at 21:34

3 votes

1 answer

145 views

Why is the logistic regression model good? (and its relation with maximizing entropy)

Suppose we're trying to train a classifier $\pi$ for $k$ classes that takes as input a feature vector $x\in\mathbb{R}^n$ and outputs a probability vector $\pi(x)\in\mathbb{R}^k$ such that $\sum_{v=1}^...

stupid_question_bot

6,567

asked Sep 15 at 0:17

9 votes

1 answer

283 views

Who introduced the term hyperparameter?

I am trying to find the earliest use of the term hyperparameter. Currently, it is used in machine learning but it must have had earlier uses in statistics or optimization theory. Even the multivolume ...

AChem

813

asked Aug 13 at 18:14

2 votes

0 answers

77 views

Equivalence of score function expressions in SDE-based generative modeling

I am studying the paper "Score-Based Generative Modeling through Stochastic Differential Equations" (arXiv:2011.13456) by Yang et al. The authors use the following loss function (Equation 7 ...

Po-Hung Yeh

43

asked Aug 1 at 13:30

8 votes

1 answer

480 views

Geometric formulation of the subject of machine learning

Question: what is the geometric interpretation of the subject of machine learning and/or deep learning? Being "forced" to have a closer look at the subject, I have the impression that it ...

Manfred Weis

12.5k

asked Jul 16 at 9:40

1 vote

0 answers

95 views

Problems Correction of "Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "' [closed]

Where I can find the problems correction of this book " Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning "

zdo0x0

11

asked Jun 13 at 16:52

3 votes

0 answers

38 views

Prove the convergence of the LASSO model in the presence of limited eigenvalues

I am researching the properties of the Lasso model $\hat \beta:= \operatorname{argmin} \{\|Y-X\beta\|_2^2/n+\lambda\|\beta\|_1\}$, specifically its convergence when the data satisfies restricted ...

GGbond

39

asked Jun 10 at 14:33

8 votes

0 answers

119 views

Worst margin when halving a hypercube with a hyperplane

Consider the $n$-cube $C_n=\lbrace-1,1\rbrace^n$ and the problem of partitioning it into halves with hyperplanes through the origin that avoid all its points. We can parameterize the hyperplanes by ...

Veit Elser

1,041

asked Jun 8 at 23:43

1 vote

0 answers

49 views

Curve fitting with "rough" loss functions

Many real-valued classification and regression problems can be framed as minimization in the following way. Setup: Let $\Theta \in \mathbb{R}^p$ be the parameter space that we are searching over. For ...

Simon Kuang

190

asked May 13 at 20:18

2 votes

0 answers

423 views

Mathematics research relating to machine learning

What branch(s) of math is most relevant in enhancing machine learning (mostly in terms of practical use as opposed to theoretical/possible use)? Specifically, I want to know about math research used ...

Artus

141

asked Apr 28 at 18:32

1 vote

1 answer

97 views

Adjoint sensitivity analysis for a cost functional under an ODE constraint

I am trying to recover the result given by equation 10 in the article here. I am unable to get rid of the integral, any help would be much appreciated. To keep the description as self contained as ...

Abhi. A

55

asked Mar 23 at 17:04

2 votes

0 answers

50 views

Convergence of minimiser of empirical risk to minimiser of population risk

Let $X_1, \dots, X_n \sim \mu$ be some random elements of a space $\mathcal{X}$. Let $H$ be a Hilbert space of functions $f: S \to \Re$ with norm $\|\cdot\|_H$. Let $\|f^*\|_{L_2(\mu)} < \infty$ ...

user27182

315

asked Mar 18 at 12:20

2 votes

0 answers

42 views

can we get a family of classifiers $\left\{f_n\right\}_{n \in N}$such that $\lim_{n->∞} (E_{(X_1, Y_1), ...,(X_n, Y_n) \sim \rho}[R(f_n)]-R(f_B))=0 $

For a given classifier $f: \mathbb{R}^d \mapsto\{0,1,2\}$, let $$ R(f):=\mathbb{E}_{(X, Y) \sim \rho}\left[\mathbb{1}_{f(X) \neq Y}\right] $$ $f_B$ the Bayes classifier. can we get a family of ...

fantacy_crs

51

asked Mar 2 at 3:19

3 votes

0 answers

48 views

How to prove emprical risk converges to expectation risk as $n\to \infty$?

For example, for a classical binary classification: $x \in \mathbb{R}^d$ and $y \in\{0,1\}$ let empirical risk be $R_{\ell}^n(f):=\frac{1}{n} \sum_{i=1}^n \ell\left(f\left(X_i\right), Y_i\right)$ and ...

fantacy_crs

51

asked Mar 2 at 2:29

2 votes

1 answer

81 views

VC-based risk bounds for classifiers on finite set

Let $X$ be a finite set and let $\emptyset\neq \mathcal{H}\subseteq \{ 0,1 \}^{\mathcal{X}}$. Let $\{(X_n,L_n)\}_{n=1}^N$ be i.i.d. random variables on $X\times \{0,1\}$ with law $\mathbb{P}$. ...

Math_Newbie

228

asked Feb 15 at 21:46

4 votes

1 answer

264 views

Perceptron / logistic regression accuracy on the n-bit parity problem

$\DeclareMathOperator{\sgn}{sign}$The perceptron (similarly, logistic regression) of the form $y=\sgn(w^T \cdot x+b)$ is famously known for its inability to solve the XOR problem, meaning it can get ...

ido4848

141

asked Jan 29 at 21:55

1 vote

0 answers

32 views

Convergent gradient-type scheme for solving smooth nonconvex constrained optimization problem

Let $x_1,\ldots,x_n \in \mathbb R^d$ and $y_1,\ldots,y_n \in \{\pm 1\}$, and $\epsilon, h \gt 0$. Define $\theta(t) := Q((t-\epsilon)/h)$, where $Q(z) := \int_{z}^\infty \phi (z)\mathrm{d}z$ is the ...

dohmatob

6,586

asked Jan 4 at 11:07

3 votes

0 answers

126 views

What is the meaning of big-O of a random variable?

I encountered this problem in a book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. I excerpt it below: screenshot of the book In the excerpt, the big-O notation $O(\xi^...

zzzhhh

31

asked Dec 21, 2022 at 7:31

2 votes

0 answers

70 views

Training an energy-based model (EBM) using MCMC

I'm reading this paper about training energy-based models (EBMs) and don't understand the parameters that we are training for? The part that is relevant to the question is in pages 1-4. Here is the ...

Garfield

201

asked Dec 15, 2022 at 5:09

1 vote

0 answers

145 views

How to maximize certain function of hundreds variables related to correlations between sets vectors ? (and win Kaggle :))

It might be helpful for data science/bioinformatics challenge. Consider for simplicity three rectangular matrices $Y_{true}$ , $Y_{predict0},Y_{predict1}$ of the same sizes say 70000*140. Let us ...

Alexander Chervov

22.9k

asked Nov 9, 2022 at 18:56

2 votes

0 answers

80 views

Nuclear norm minimization of convolution matrix (circular matrix) with fast Fourier transform

I am reading a paper Recovery of Future Data via Convolution Nuclear Norm Minimization. Here, I know there is a definition for convolution matrix. Given any vector $\boldsymbol{x}=(x_1,x_2,\ldots,x_n)^...

Xinyu Chen

21

asked Oct 30, 2022 at 12:58

1 vote

0 answers

75 views

Distribution-free learning vs distribution-dependent learning

I came across some papers studying the problem of distribution-free learning, and I am interested in knowing the exact definition of distribution-free learning. I have searched some literature: In ...

yinan

11

asked Oct 29, 2022 at 6:37

4 votes

0 answers

116 views

Progress on "Un-Alching" ML?

So, a couple of years ago I watched both Ali Rahimi's NIPS speech "Machine Learning is Alchemy", (where he talks about how the field lacks a solid, overarching, theoretical foundation) and ...

dicaes

41

asked Oct 22, 2022 at 22:51

2 votes

0 answers

41 views

Combining SVD subspaces for low dimensional representations

Suppose we have matrix $A$ of size $N_t \times N_m$, containing $N_m$ measurements corrupted by some (e.g. Gaussian) noise. An SVD of this data $A = U_AS_A{V_A}^T$ can reveal the singular vectors $U_A$...

user2600239

21

asked Oct 19, 2022 at 18:03

1 vote

0 answers

102 views

Can I minimize a mysterious function by running a gradient decent on her neural net approximations? [closed]

A cross post from on AI StackExchange. So I have this function let call her $F:[0,1]^n \rightarrow \mathbb{R}$ and say $10 \le n \le 100$. I want to find some $x_0 \in [0,1]^n$ such that $F(x_0)$ is ...

Vladimir Zolotov

930

asked Sep 23, 2022 at 1:26

1 vote

0 answers

52 views

How to calculate the unifrom entropy or VC dimension of the following class of functions?

When dealing with U process I meet with such a uniform entropy to calculate. For any $\eta>0$, function class $\mathcal{F}$ containing functions $f=\left(f_{i, j}\right)_{1 \leq i \neq j \leq n}: \...

leslie zhang

11

asked Sep 12, 2022 at 3:20

3 votes

1 answer

229 views

Independent input feature z can be removed: if y=f(x+z,z), then y=g(x)?

Let $y\in \mathbb{R}$ and $\mathbf{x},\mathbf{z}\in\mathbb{R}^p$ be random variable and random vectors. Assume $y=f(\mathbf{x}+\mathbf{z},\mathbf{z})$ for some function $f$. Is the following statement ...

John

195

asked Aug 27, 2022 at 20:01

1 vote

0 answers

46 views

Sample Complexity/PAC-Learning Notation

In PAC Learning, Sample Complexity is defined as: The function $m_\mathcal{H} : (0,1)^2 \rightarrow \mathbb{N}$ determines the sample complexity of learning $\mathcal{H}$: that is, how many examples ...

user490208

11

asked Aug 25, 2022 at 15:24

1 vote

0 answers

139 views

Stochastic Gradient Descent

In this question, I am not really sure how to approach this question as I am a beginner in optimisation Consider the function $f : B_1 → R$ with $f(x) = \left\lVert x \right\rVert_2^2$ and $B_1$ := {$...

Jacob Zhang

11

asked Aug 12, 2022 at 16:10

5 votes

2 answers

274 views

Entropy & difference between max and min values of probability mass

Let $X$ be a random variable with probability mass function $p(x) = \mathbb{P}[X = x]$. I know entropy $H(X)$ of $X$ measures the uncertainty of $X$ and a large value of $H(X)$ means $p(x)$ is nearly ...

aest

143

asked Jul 30, 2022 at 0:24

1 vote

1 answer

177 views

Using Hoeffding inequality for risk / loss function

I've got a question to the Hoeffding Inequality which states, that for data points $X_1, \dots, X_n \in X$, which are i.i.d. according to a probability measure $P$ on $X$, we find an upper bound for: $...

Mathematiger

13

asked Jul 14, 2022 at 7:53

20 votes

3 answers

3k views

How can Machine Learning help “see” in higher dimensions?

The news that DeepMind had helped mathematicians in research (one in representation theory, and one in knot theory) certainly got many thinking, what other projects could AI help us with? See MO ...

liuyao

485

asked Jul 12, 2022 at 19:40

2 votes

0 answers

196 views

Covering/Bracketing number of monotone functions on $\mathbb{R}$ with uniformly bounded derivatives

I am interested in the $\| \cdot \|_{\infty}$-norm bracketing number or covering number of some collection of distribution functions on $\mathbb{R}$. Let $\mathcal{F}$ consist of all distribution ...

masala

93

asked Jun 17, 2022 at 21:40

1 vote

0 answers

95 views

Limit cycles or stable solutions for k-dimensional piece-wise linear ODEs

As a branch of reinforcement learning, restless multi-armed bandits have been shown PSPACE-HARD but Whittle has offered an implementable solution called the Whittle Index Policy. Weber and Weiss ...

KLiu

41

asked Jun 9, 2022 at 8:27

1 vote

0 answers

86 views

If two functions are close apart can I proof the difference of their empirical loss is also small?

I am trying to understand the proof of Theorem 3 in the paper "A Universal Law of Robustness via isoperimetry" by Bubeck and Sellke. Basically there exist atleast one $w_{L,e}$ in $\...

user158501

asked May 21, 2022 at 16:12

2 votes

0 answers

43 views

Convergent algorithm for minimizing nonconvex smooth function

Let $\Phi$ be the Gaussian CDF and for $\gamma\ge 0$ and $h>0$, define a loss function $\ell_h:\{\pm 1\} \times \mathbb R$ by $$ \ell_{\gamma,h}(y,y') := \phi_{\gamma,h}(yy') := \Phi((yy'-\gamma)/h)...

dohmatob

6,586

asked May 16, 2022 at 15:19

0 votes

0 answers

32 views

Normalizing a parameter in a regression

I am thinking about the possibility of making a parameter in my regression, let's say the $\lambda$ in a ridge regression, somehow, inside a range : $\lambda \in [0,1]$. Do you have any ideas how I ...

SUMQXDT

1

asked May 9, 2022 at 14:53

0 votes

0 answers

70 views

Shattering of a set of binary classifiers

Let $S$ be a set, and let $\mathcal{F}_{S}=\{f:S\to\{-1,+1\}\}$ be a set of different label assignments. Show that $\mathcal{F}_{S}$ shatters at least $|\mathcal{F}_{S}|$ subsets of $S$. Here is what ...

cbyh

143

asked Apr 28, 2022 at 1:01

1 vote

0 answers

72 views

Converting an indexed equation to a matrix one

I am helping a friend with a project involving neural networks and he wants to convert this equation into matrix notation: $$w_{ij} = \sum_{n=1}^N\left[\sum_{i=1}^I(r_{in}-y_{in})v_{ih}\right](1-z_{hn}...

user3308874

29

asked Mar 30, 2022 at 4:12

3 votes

0 answers

194 views

What is the VC-dimension of regular convex k-gons in the plane?

Recall the relevant definitions: Let $H$ be a family of sets in $\mathbb{R}^d$. The intersection of $H$ with a point set $C$ is defined as $H\cap C = \{h\cap C\mid h\in H\}$. The VC-dimension of $H$ (...

Tassle

131

asked Mar 27, 2022 at 13:17

2 votes

1 answer

149 views

Derive equation for regularized logistic regression with batch updates

I am trying to understand this paper by Chapelle and Li "An Empirical Evaluation of Thompson Sampling" (2011). In particular, I am failing to derive the equations in algorithm 3 (page 6). ...

denvercoder9

59

asked Mar 19, 2022 at 20:29

4 votes

1 answer

609 views

The ODE modeling for gradient descent with decreasing step sizes

The gradient descent (GD) with constant stepsize $\alpha^{k}=\alpha$ takes the form $$x^{k+1} = x^{k} -\alpha\nabla f(x^{k}).$$ Then, by constructing a continuous-time version of GD iterates ...

lazyleo

63

asked Mar 10, 2022 at 3:02

2 votes

1 answer

148 views

Representer theorem for a loss / functional of the form $L(h) := \sum_{i=1}^n (|h(x_i)-y_i|+t\|h\|)^2$

Let $K:X \times X \to \mathbb R$ be a (positive-definite) kernel and let $H$ be the induced reproducing kernel Hilbert space (RKHS). Fix $(x_1,y_1),\ldots,(x_n,y_n) \in X \times \mathbb R$. For $t \ge ...

dohmatob

6,586

asked Jan 28, 2022 at 14:10

1 vote

0 answers

33 views

Correlating two matrices $A,B$ with stochastic dependency structure imposed by cross-validation

Consider a labelled data set $$D = \{(x_1, y_1),...,(x_n, y_n)\} $$ on which we want to evaluate a machine learning algorithm using $k$-fold cross validation with $m$ different random seeds. This ...

Joker123

153

asked Jan 27, 2022 at 11:39

2 votes

1 answer

83 views

How to fit a set of parametrized data to a parametrized distribution?

I have a time series $d_i(a)$ which depends on the parameter $a$. On the other hand, I have a sequence of normal distributions $\mathcal{N}(0,Q_i(a))$, where the variance $Q_i$ depends on time and ...

ycz

51

asked Dec 8, 2021 at 16:10

2 votes

0 answers

36 views

Stochastic gradient descent in 'stronger' settings

I am minimzing a function $F(x) = \mathbb E(f(x,\Xi))$ where $\Xi$ is some random value, by a stochastic gradient descent that generates a random number $\xi$ from the distribution of $\Xi$ at each ...

lrnv

653

asked Dec 7, 2021 at 13:48

Questions tagged [machine-learning]

Related Tags