new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Dec 30

Enhanced Mortality Prediction In Patients With Subarachnoid Haemorrhage Using A Deep Learning Model Based On The Initial CT Scan

PURPOSE: Subarachnoid hemorrhage (SAH) entails high morbidity and mortality rates. Convolutional neural networks (CNN), a form of deep learning, are capable of generating highly accurate predictions from imaging data. Our objective was to predict mortality in SAH patients by processing the initial CT scan on a CNN based algorithm. METHODS: Retrospective multicentric study of a consecutive cohort of patients with SAH between 2011-2022. Demographic, clinical and radiological variables were analyzed. Pre-processed baseline CT scan images were used as the input for training a CNN using AUCMEDI Framework. Our model's architecture leverages the DenseNet-121 structure, employing transfer learning principles. The output variable was mortality in the first three months. Performance of the model was evaluated by statistical parameters conventionally used in studies involving artificial intelligence methods. RESULTS: Images from 219 patients were processed, 175 for training and validation of the CNN and 44 for its evaluation. 52%(115/219) of patients were female, and the median age was 58(SD=13.06) years. 18.5%(39/219) were idiopathic SAH. Mortality rate was 28.5%(63/219). The model showed good accuracy at predicting mortality in SAH patients exclusively using the images of the initial CT scan (Accuracy=74%, F1=75% and AUC=82%). CONCLUSION: Modern image processing techniques based on AI and CNN make possible to predict mortality in SAH patients with high accuracy using CT scan images as the only input. These models might be optimized by including more data and patients resulting in better training, development and performance on tasks which are beyond the skills of conventional clinical knowledge.

  • 9 authors
·
Aug 25, 2023

Phemenological Modeling of Eclipsing Binary Stars

We review the method NAV (New Algol Variable) first introduced in 2012Ap.....55..536A, which uses the locally-dependent shapes of eclipses in an addition to the trigonometric polynomial of the second order (which typically describes the "out-of-eclipse" part of the light curve with effects of reflection, ellipticity and O'Connell). Eclipsing binary stars are believed to show distinct eclipses only if belonging to the EA type. With a decreasing eclipse width, the statistically optimal value of the trigonometric polynomial s (2003ASPC..292..391A) drastically increases from ~2 for elliptic (EL) variables without eclipses, ~6-8 for EW and up to ~30-50 for some EA with narrow eclipses. In this case of large number of parameters, the smoothing curve becomes very noisy and apparent waves (the Gibbs phenomenon) may be seen. The NAV set of the parameters may be used for classification in the GCVS, VSX and similar catalogs. The maximal number of parameters is m=12, which corresponds to s=5, if correcting both the period and the initial epoch. We have applied the method to few stars, also in a case of multi-color photometry (2015JASS...32..127A), when it is possible to use the phenomenological parameters from the NAV fit to estimate physical parameters using statistical dependencies. We conclude that the NAV approximation is better than the TP one even for the case of EW-type stars with much wider eclipses. It may also be used to determine timings (see 2005ASPC..335...37A for a review of methods) or to determine parameters in the case of variable period, using a complete light curve modeling the phase variations. The method is illustrated on 2MASS J11080447-6143290 (EA-type), USNO-B1.0 1265-0306001 and USNO-B1.0 1266-0313413 (EW-type) and compared to various other methods from the literature.

  • 3 authors
·
Feb 12, 2016

Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Top-K sparse softmax gating mixture of experts has been widely used for scaling up massive deep-learning architectures without increasing the computational cost. Despite its popularity in real-world applications, the theoretical understanding of that gating function has remained an open problem. The main challenge comes from the structure of the top-K sparse softmax gating function, which partitions the input space into multiple regions with distinct behaviors. By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. When the true number of experts k_{ast} is known, we demonstrate that the convergence rates of density and parameter estimations are both parametric on the sample size. However, when k_{ast} becomes unknown and the true model is over-specified by a Gaussian mixture of k experts where k > k_{ast}, our findings suggest that the number of experts selected from the top-K sparse softmax gating function must exceed the total cardinality of a certain number of Voronoi cells associated with the true parameters to guarantee the convergence of the density estimation. Moreover, while the density estimation rate remains parametric under this setting, the parameter estimation rates become substantially slow due to an intrinsic interaction between the softmax gating and expert functions.

  • 4 authors
·
Sep 24, 2023

Statistical mechanics of continual learning: variational principle and mean-field potential

An obstacle to artificial general intelligence is set by continual learning of multiple tasks of different nature. Recently, various heuristic tricks, both from machine learning and from neuroscience angles, were proposed, but they lack a unified theory ground. Here, we focus on continual learning in single-layered and multi-layered neural networks of binary weights. A variational Bayesian learning setting is thus proposed, where the neural networks are trained in a field-space, rather than gradient-ill-defined discrete-weight space, and furthermore, weight uncertainty is naturally incorporated, and modulates synaptic resources among tasks. From a physics perspective, we translate the variational continual learning into Franz-Parisi thermodynamic potential framework, where previous task knowledge acts as a prior and a reference as well. We thus interpret the continual learning of the binary perceptron in a teacher-student setting as a Franz-Parisi potential computation. The learning performance can then be analytically studied with mean-field order parameters, whose predictions coincide with numerical experiments using stochastic gradient descent methods. Based on the variational principle and Gaussian field approximation of internal preactivations in hidden layers, we also derive the learning algorithm considering weight uncertainty, which solves the continual learning with binary weights using multi-layered neural networks, and performs better than the currently available metaplasticity algorithm. Our proposed principled frameworks also connect to elastic weight consolidation, weight-uncertainty modulated learning, and neuroscience inspired metaplasticity, providing a theory-grounded method for the real-world multi-task learning with deep networks.

  • 4 authors
·
Dec 6, 2022

Exploring HOD-dependent systematics for the DESI 2024 Full-Shape galaxy clustering analysis

We analyse the robustness of the DESI 2024 cosmological inference from fits to the full shape of the galaxy power spectrum to uncertainties in the Halo Occupation Distribution (HOD) model of the galaxy-halo connection and the choice of priors on nuisance parameters. We assess variations in the recovered cosmological parameters across a range of mocks populated with different HOD models and find that shifts are often greater than 20% of the expected statistical uncertainties from the DESI data. We encapsulate the effect of such shifts in terms of a systematic covariance term, C_{rm HOD}, and an additional diagonal contribution quantifying the impact of our choice of nuisance parameter priors on the ability of the effective field theory (EFT) model to correctly recover the cosmological parameters of the simulations. These two covariance contributions are designed to be added to the usual covariance term, C_{rm stat}, describing the statistical uncertainty in the power spectrum measurement, in order to fairly represent these sources of systematic uncertainty. This approach is more general and robust to choices of model free parameters or additional external datasets used in cosmological fits than the alternative approach of adding systematic uncertainties at the level of the recovered marginalised parameter posteriors. We compare the approaches within the context of a fixed LambdaCDM model and demonstrate that our method gives conservative estimates of the systematic uncertainty that nevertheless have little impact on the final posteriors obtained from DESI data.

  • 42 authors
·
Nov 18, 2024

Dense Hebbian neural networks: a replica symmetric picture of supervised learning

We consider dense, associative neural-networks trained by a teacher (i.e., with supervision) and we investigate their computational capabilities analytically, via statistical-mechanics of spin glasses, and numerically, via Monte Carlo simulations. In particular, we obtain a phase diagram summarizing their performance as a function of the control parameters such as quality and quantity of the training dataset, network storage and noise, that is valid in the limit of large network size and structureless datasets: these networks may work in a ultra-storage regime (where they can handle a huge amount of patterns, if compared with shallow neural networks) or in a ultra-detection regime (where they can perform pattern recognition at prohibitive signal-to-noise ratios, if compared with shallow neural networks). Guided by the random theory as a reference framework, we also test numerically learning, storing and retrieval capabilities shown by these networks on structured datasets as MNist and Fashion MNist. As technical remarks, from the analytic side, we implement large deviations and stability analysis within Guerra's interpolation to tackle the not-Gaussian distributions involved in the post-synaptic potentials while, from the computational counterpart, we insert Plefka approximation in the Monte Carlo scheme, to speed up the evaluation of the synaptic tensors, overall obtaining a novel and broad approach to investigate supervised learning in neural networks, beyond the shallow limit, in general.

  • 8 authors
·
Nov 25, 2022

Quasi-periodic pulsations in extreme-ultraviolet brightenings

Context. Extreme-ultraviolet (EUV) observations have revealed small-scale transient brightenings that may share common physical mechanisms with larger-scale solar flares. A notable feature of solar and stellar flares is the presence of quasi-periodic pulsations (QPPs), which are considered a common and potentially intrinsic characteristic. Aims. We investigate the properties of QPPs detected in EUV brightenings, which are considered small-scale flares, and compare their statistical properties with those observed in solar and stellar flares. Methods. We extracted integrated light curves of 22,623 EUV brightenings in two quiet Sun regions observed by the Solar Orbiter/Extreme Ultraviolet Imager and identified QPPs in their light curves using Fourier analysis. Results. Approximately 2.7 % of the EUV brightenings exhibited stationary QPPs. The QPP occurrence rate increased with the surface area, lifetime, and peak brightness of the EUV brightenings. The detected QPP periods ranged from approximately 15 to 260 seconds, which is comparable to the periods observed in solar and stellar flares. Consistent with observations of QPPs in solar and stellar flares, no correlation was found between the QPP period and peak brightness. However, unlike the trend observed in solar flares, no correlation was found between the QPP period and lifetime/length scale. Conclusions. The presence of QPPs in EUV brightenings supports the interpretation that these events may be small-scale manifestations of flares, and the absence of period scaling with loop length further suggests that standing waves may not be the primary driver of QPPs in these events.

  • 8 authors
·
Apr 21

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

Despite rapid advancements in the research and deployment of large language models (LLMs), the statistical distribution of model parameters, as well as their influence on initialization, training dynamics, and downstream efficiency, has received surprisingly little attention. A recent work introduced BackSlash, a training-time compression algorithm. It first demonstrated that pre-trained LLM parameters follow generalized Gaussian distributions (GGDs) better. By optimizing GG priors during training, BackSlash can reduce parameters by up to 90\% with minimal performance loss. Building on this foundational insight, we propose a unified, end-to-end framework for LLM optimization based on the GG model. Our contributions are threefold: (1) GG-based initialization scheme that aligns with the statistical structure of trained models, resulting in faster convergence and improved accuracy; (2) DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile, improving compressibility with minimized degradation in performance; and (3) RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-initialized BackSlash training, enabling low-cost inference without compromising accuracy. Experiments across diverse model architectures show that our framework consistently yields smaller and faster models that match or outperform standard training baselines. By grounding LLM development in principled statistical modeling, this work forges a new path toward efficient, scalable, and hardware-aware AI systems. The code is available on our project page: https://huggingface.co/spaces/shifeng3711/gg_prior.

  • 4 authors
·
May 31

On the Generalization vs Fidelity Paradox in Knowledge Distillation

Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to 10%, with a peak task specific gain of 22%, while providing only marginal benefits (sim 1.3%) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students' performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.

  • 3 authors
·
May 21

Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs

While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs' computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.

  • 4 authors
·
Mar 5

A Flexible Parametric Modelling Framework for Survival Analysis

We introduce a general, flexible, parametric survival modelling framework which encompasses key shapes of hazard function (constant, increasing, decreasing, up-then-down, down-then-up), various common survival distributions (log-logistic, Burr type XII, Weibull, Gompertz), and includes defective distributions (i.e., cure models). This generality is achieved using four basic distributional parameters: two scale-type parameters and two shape parameters. Generalising to covariate dependence, the scale-type regression components correspond to accelerated failure time (AFT) and proportional hazards (PH) models. Therefore, this general formulation unifies the most popular survival models which allows us to consider the practical value of possible modelling choices for survival data. Furthermore, in line with our proposed flexible baseline distribution, we advocate the use of multi-parameter regression in which more than one distributional parameter depends on covariates - rather than the usual convention of having a single covariate-dependent (scale) parameter. While many choices are available, we suggest introducing covariates through just one or other of the two scale parameters, which covers AFT and PH models, in combination with a `power' shape parameter, which allows for more complex non-AFT/non-PH effects, while the other shape parameter remains covariate-independent, and handles automatic selection of the baseline distribution. We explore inferential issues in simulations, both with and without a covariate, with particular focus on evidence concerning the need, or otherwise, to include both AFT and PH parameters. We illustrate the efficacy of our modelling framework by investigating differences between treatment groups using data from a lung cancer study and a melanoma study. Censoring is accommodated throughout.

  • 3 authors
·
Jan 10, 2019

Phemenological Modelling of a Group of Eclipsing Binary Stars

Phenomenological modeling of variable stars allows determination of a set of the parameters, which are needed for classification in the "General Catalogue of Variable Stars" and similar catalogs. We apply a recent method NAV ("New Algol Variable") to eclipsing binary stars of different types. Although all periodic functions may be represented as Fourier series with an infinite number of coefficients, this is impossible for a finite number of the observations. Thus one may use a restricted Fourier series, i.e. a trigonometric polynomial (TP) of order s either for fitting the light curve, or to make a periodogram analysis. However, the number of parameters needed drastically increases with decreasing width of minimum. In the NAV algorithm, the special shape of minimum is used, so the number of parameters is limited to 10 (if the period and initial epoch are fixed) or 12 (not fixed). We illustrate the NAV method by application to a recently discovered Algol-type eclipsing variable 2MASS J11080308-6145589 (in the field of previously known variable star RS Car) and compare results to that obtained using the TP fits. For this system, the statistically optimal number of parameters is 44, but the fit is still worse than that of the NAV fit. Application to the system GSC 3692-00624 argues that the NAV fit is better than the TP one even for the case of EW-type stars with much wider eclipses. Model parameters are listed.

  • 3 authors
·
Sep 17, 2015

Machine learning-driven Anomaly Detection and Forecasting for Euclid Space Telescope Operations

State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.

  • 6 authors
·
Nov 8, 2024

Preserving Statistical Validity in Adaptive Data Analysis

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of m adaptively chosen functions on an unknown distribution given n random samples. We show that, surprisingly, there is a way to estimate an exponential in n number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

  • 6 authors
·
Nov 10, 2014

Search for dark matter subhalos among unassociated Fermi-LAT sources in presence of dataset shift

We search for dark matter (DM) annihilating subhalos of the Milky Way halo among the Fermi Large Area Telescope (LAT) unassociated sources. We construct, for the first time, a statistical model of the unassociated sources at latitudes above 10 degrees. The latter is built as a combination of both DM annihilation subhalos as well as Galactic and extragalactic astrophysical components. The astrophysical components are constructed based on distributions of associated sources, while the distribution of DM subhalos is derived from Monte Carlo simulations. In this model we take into account the differences in the distributions of associated and unassociated sources including both covariate and prior probability shifts (both being forms of ``dataset shifts''). Previous searches of DM subhalos were based on classify-and-count strategies, while the approach adopted in this work is based on quantification learning, which allows one to determine a well-defined statistical interpretation of the contribution of a population of DM subhalos to the unassociated Fermi-LAT sources. In the bb annihilation channel and for a range of DM masses from 10 GeV to 1 TeV, we don't find a significant contribution from DM subhalos and derive a statistical 95% confidence upper limit on the DM annihilation cross section in this channel. While the derived limits are consistent with previous classify-and-count approaches, our generative statistical model opens new avenues for population studies of Fermi-LAT sources and, more generally, for searches of anomalies on top of backgrounds in presence of statistical and systematic uncertainties.

  • 5 authors
·
Mar 18

One-connection rule for structural equation models

Linear structural equation models are multivariate statistical models encoded by mixed graphs. In particular, the set of covariance matrices for distributions belonging to a linear structural equation model for a fixed mixed graph G=(V, D,B) is parameterized by a rational function with parameters for each vertex and edge in G. This rational parametrization naturally allows for the study of these models from an algebraic and combinatorial point of view. Indeed, this point of view has led to a collection of results in the literature, mainly focusing on questions related to identifiability and determining relationships between covariances (i.e., finding polynomials in the Gaussian vanishing ideal). So far, a large proportion of these results has focused on the case when D, the directed part of the mixed graph G, is acyclic. This is due to the fact that in the acyclic case, the parametrization becomes polynomial and there is a description of the entries of the covariance matrices in terms of a finite sum. We move beyond the acyclic case and give a closed form expression for the entries of the covariance matrices in terms of the one-connections in a graph obtained from D through some small operations. This closed form expression then allows us to show that if G is simple, then the parametrization map is generically finite-to-one. Finally, having a closed form expression for the covariance matrices allows for the development of an algorithm for systematically exploring possible polynomials in the Gaussian vanishing ideal.

  • 4 authors
·
Oct 1, 2022

Kernel Density Estimators in Large Dimensions

This paper studies Kernel density estimation for a high-dimensional distribution rho(x). Traditional approaches have focused on the limit of large number of data points n and fixed dimension d. We analyze instead the regime where both the number n of data points y_i and their dimensionality d grow with a fixed ratio alpha=(log n)/d. Our study reveals three distinct statistical regimes for the kernel-based estimate of the density hat rho_h^{D}(x)=1{n h^d}sum_{i=1}^n Kleft(x-y_i{h}right), depending on the bandwidth h: a classical regime for large bandwidth where the Central Limit Theorem (CLT) holds, which is akin to the one found in traditional approaches. Below a certain value of the bandwidth, h_{CLT}(alpha), we find that the CLT breaks down. The statistics of hat rho_h^{D}(x) for a fixed x drawn from rho(x) is given by a heavy-tailed distribution (an alpha-stable distribution). In particular below a value h_G(alpha), we find that hat rho_h^{D}(x) is governed by extreme value statistics: only a few points in the database matter and give the dominant contribution to the density estimator. We provide a detailed analysis for high-dimensional multivariate Gaussian data. We show that the optimal bandwidth threshold based on Kullback-Leibler divergence lies in the new statistical regime identified in this paper. Our findings reveal limitations of classical approaches, show the relevance of these new statistical regimes, and offer new insights for Kernel density estimation in high-dimensional settings.

  • 2 authors
·
Aug 11, 2024

Improve Machine Learning carbon footprint using Nvidia GPU and Mixed Precision training for classification models -- Part I

This is the 1st part of the dissertation for my master degree and compares the power consumption using the default floating point (32bit) and Nvidia mixed precision (16bit and 32bit) while training a classification ML model. A custom PC with specific hardware was built to perform the experiments, and different ML hyper-parameters, such as batch size, neurons, and epochs, were chosen to build Deep Neural Networks (DNN). Additionally, various software was used during the experiments to collect the power consumption data in Watts from the Graphics Processing Unit (GPU), Central Processing Unit (CPU), Random Access Memory (RAM) and manually from a wattmeter connected to the wall. A benchmarking test with default hyper parameter values for the DNN was used as a reference, while the experiments used a combination of different settings. The results were recorded in Excel, and descriptive statistics were chosen to calculate the mean between the groups and compare them using graphs and tables. The outcome was positive when using mixed precision combined with specific hyper-parameters. Compared to the benchmarking, the optimisation for the classification reduced the power consumption between 7 and 11 Watts. Similarly, the carbon footprint is reduced because the calculation uses the same power consumption data. Still, a consideration is required when configuring hyper-parameters because it can negatively affect hardware performance. However, this research required inferential statistics, specifically ANOVA and T-test, to compare the relationship between the means. Furthermore, tests indicated no statistical significance of the relationship between the benchmarking and experiments. However, a more extensive implementation with a cluster of GPUs can increase the sample size significantly, as it is an essential factor and can change the outcome of the statistical analysis.

  • 1 authors
·
Sep 12, 2024

Measuring the Intrinsic Dimension of Objective Landscapes

Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example, that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. In addition to providing new cartography of the objective landscapes wandered by parameterized models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this construction is a simple approach for compressing networks, in some cases by more than 100 times.

  • 4 authors
·
Apr 24, 2018

The implications of stochastic gas torques for asymmetric binaries in the LISA band

Gravitational waves from asymmetric mass-ratio black-hole binaries carry unique information about their astrophysical environment. For instance, the Laser Interferometer Space Antenna (LISA) could potentially measure the amplitude and slope of gas torques in binaries embedded in the accretion disks of Active Galactic Nuclei, helping differentiate competing accretion disk models. However, this relies on simplified analytic models, which do not account for the stochastic variability of torques seen in hydrodynamic simulations. In this work, we use hydrodynamic simulations to create gravitational waveforms for extreme and intermediate mass-ratio inspirals in the LISA band. We then analyze these simulated waveforms using simpler templates that assume analytic torques, without stochastic time variability. By performing realistic Bayesian parameter estimation, we find no bias at 90% confidence in the binary parameters; however, estimates of accretion disk parameters, such as torque amplitude and slope, may be biased. Typically, the posterior distribution is centered around the average value of the torques, but when stochastic variability is large, the posterior can indicate no torques, even though they are present in the simulation. Our results suggest that while simplified analytic torque models work well for estimating binary parameters, caution is needed when using them to infer properties of the accretion disk. This work moves towards a more realistic assessment of one of the LISA science objectives, i.e., probing the properties of the astrophysical environments of black holes.

  • 5 authors
·
Feb 14

KIC 4150611: A quadruply eclipsing heptuple star system with a g-mode period-spacing pattern Asteroseismic modelling of the g-mode period-spacing pattern

In this work, we aim to estimate the stellar parameters of the primary (Aa) by performing asteroseismic analysis on its period-spacing pattern. We use the C-3PO neural network to perform asteroseismic modelling of the g-mode period-spacing pattern of Aa, discussing the interplay of this information with external constraints from spectroscopy (T_{rm eff} and log(g)) and eclipse modelling (R). To estimate the level of uncertainty due to different frequency extraction and pattern identification processes, we consider four different variations on the period-spacing patterns. To better understand the correlations between and the uncertainty structure of our parameter estimates, we also employed a classical, parameter-based MCMC grid search on four different stellar grids. The best-fitting, externally constrained model to the period-spacing pattern arrives at estimates of the stellar properties for Aa of: M=1.51 pm 0.05 M_odot, X_c =0.43 pm 0.04, R=1.66 pm 0.1 R_odot, f_{rm ov}=0.010, Omega_c=1.58 pm 0.01 d^{-1} with rigid rotation to within the measurement errors, log(T_{rm eff})=3.856 pm 0.008 dex, log(g)=4.18 pm 0.04 dex, and log(L)=0.809 pm 0.005 dex, which agree well with previous measurements from eclipse modelling, spectroscopy, and the Gaia DR3 luminosity. We find that the near-core properties of the best-fitting asteroseismic models are consistent with external constraints from eclipse modelling and spectroscopy. Aa appears to be a typical example of a gamma Dor star, fitting well within existing populations. We find that Aa is quasi-rigidly rotating to within the uncertainties, and note that the asteroseismic age estimate for Aa (1100 pm 100 Myr) is considerably older than the young (35 Myr) age implied by previous isochrone fits to the B binary in the literature. Our MCMC parameter-based grid-search agrees well with our pattern-modelling approach.

  • 10 authors
·
Nov 27, 2024

Statistical selection of high-redshift, neutral-hydrogen-rich, lensed galaxies with the Square Kilometre Array

Deep wide spectral line surveys with the Square Kilometre Array (SKA) will expand the cosmic frontiers of neutral atomic hydrogen (HI) in galaxies. However, at cosmologically significant redshifts (z gtrsim 0.5), detections will typically be spatially unresolved and limited to the highest mass systems. Gravitational lensing could potentially alleviate these limitations, enabling lower mass systems to be studied at higher redshift and spatially resolved dynamical studies of some HI discs. Additionally, lensed HI systems would select foreground dark matter haloes using a different, more extended baryonic tracer compared to other lens surveys. This may result in a wider selected range of foreground dark matter halo properties, such as the concentration parameter. This paper uses the distortion of the observed HI mass function (HIMF) produced by strong gravitational lensing to find a flux density criterion for selecting lensed HI sources in future SKA-Mid spectral line surveys. This selection approach could yield lensed HI source densities in the range of sim 0.1--10 galaxies per square degree out to a redshift of z simeq 3 covered by SKA-MID Band 1. Although the sample sizes are modest, even with the proposed SKA-Mid surveys, the selection approach is straightforward and should have a 50% efficiency without any additional information, such as low-impact-factor or lower-redshift massive galaxies. The efficiency of selecting high-redshift, neutral-hydrogen-rich, lensed galaxies should then be greatly enhanced by using SKA-MID data in concert with the Vera C. Rubin Large Survey of Space and Time.

  • 2 authors
·
Feb 11

State and parameter learning with PaRIS particle Gibbs

Non-linear state-space models, also known as general hidden Markov models, are ubiquitous in statistical machine learning, being the most classical generative models for serial data and sequences in general. The particle-based, rapid incremental smoother PaRIS is a sequential Monte Carlo (SMC) technique allowing for efficient online approximation of expectations of additive functionals under the smoothing distribution in these models. Such expectations appear naturally in several learning contexts, such as likelihood estimation (MLE) and Markov score climbing (MSC). PARIS has linear computational complexity, limited memory requirements and comes with non-asymptotic bounds, convergence results and stability guarantees. Still, being based on self-normalised importance sampling, the PaRIS estimator is biased. Our first contribution is to design a novel additive smoothing algorithm, the Parisian particle Gibbs PPG sampler, which can be viewed as a PaRIS algorithm driven by conditional SMC moves, resulting in bias-reduced estimates of the targeted quantities. We substantiate the PPG algorithm with theoretical results, including new bounds on bias and variance as well as deviation inequalities. Our second contribution is to apply PPG in a learning framework, covering MLE and MSC as special examples. In this context, we establish, under standard assumptions, non-asymptotic bounds highlighting the value of bias reduction and the implicit Rao--Blackwellization of PPG. These are the first non-asymptotic results of this kind in this setting. We illustrate our theoretical results with numerical experiments supporting our claims.

  • 5 authors
·
Jan 2, 2023

On the statistical theory of self-gravitating collisionless dark matter flow: Scale and redshift variation of velocity and density distributions

This paper studies the scale and redshift variation of density and velocity distributions in self-gravitating collisionless dark matter flow by a halo-based non-projection approach. All particles are divided into halo and out-of-halo particles for redshift variation of distributions. Without projecting particle fields onto a structured grid, the scale variation is analyzed by identifying all particle pairs on different scales r. We demonstrate that: i) Delaunay tessellation can be used to reconstruct the density field. The density correlation, spectrum, and dispersion functions were obtained, modeled, and compared with the N-body simulation; ii) the velocity distributions are symmetric on both small and large scales and are non-symmetric with a negative skewness on intermediate scales due to the inverse energy cascade at a constant rate varepsilon_u; iii) On small scales, the even order moments of pairwise velocity Delta u_L follow a two-thirds law (-varepsilon_ur)^{2/3}, while the odd order moments follow a linear scaling langle(Delta u_L)^{2n+1}rangle=(2n+1)langle(Delta u_L)^{2n}ranglelangleDelta u_Lrangler; iv) The scale variation of the velocity distributions was studied for longitudinal velocities u_L or u_L^{'}, pairwise velocity (velocity difference) Delta u_L=u_L^{'}-u_L and velocity sum Sigma u_L=u^{'}_L+u_L. Fully developed velocity fields are never Gaussian on any scale, despite that they can initially be Gaussian; v) On small scales, u_L and Sigma u_L can be modeled by a X distribution to maximize the system entropy; vi) On large scales, Delta u_L and Sigma u_L can be modeled by a logistic or a X distribution; vii) the redshift variation of the velocity distributions follows the evolution of the X distribution involving a shape parameter alpha(z) decreasing with time.

  • 1 authors
·
Feb 14, 2022

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to n^{-2}-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "p-n criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the p-n criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

  • 1 authors
·
May 19, 2021

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned by large language models? (b) What is a proper performance metric for ICL and what is the error rate? (c) How does the transformer architecture enable ICL? To answer these questions, we adopt a Bayesian view and formulate ICL as a problem of predicting the response corresponding to the current covariate, given a number of examples drawn from a latent variable model. To answer (a), we show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm, which is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a O(1/T) regret bound for perfectly pretrained ICL, where T is the number of examples in the prompt. To answer (c), we show that, in addition to encoding Bayesian model averaging via attention, the transformer architecture also enables a fine-grained statistical analysis of pretraining under realistic assumptions. In particular, we prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error, where the former decays to zero exponentially as the depth grows, and the latter decays to zero sublinearly with the number of tokens in the pretraining dataset. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.

  • 4 authors
·
May 30, 2023

Standardized Benchmark Dataset for Localized Exposure to a Realistic Source at 10-90 GHz

The lack of freely available standardized datasets represents an aggravating factor during the development and testing the performance of novel computational techniques in exposure assessment and dosimetry research. This hinders progress as researchers are required to generate numerical data (field, power and temperature distribution) anew using simulation software for each exposure scenario. Other than being time consuming, this approach is highly susceptible to errors that occur during the configuration of the electromagnetic model. To address this issue, in this paper, the limited available data on the incident power density and resultant maximum temperature rise on the skin surface considering various steady-state exposure scenarios at 10-90 GHz have been statistically modeled. The synthetic data have been sampled from the fitted statistical multivariate distribution with respect to predetermined dosimetric constraints. We thus present a comprehensive and open-source dataset compiled of the high-fidelity numerical data considering various exposures to a realistic source. Furthermore, different surrogate models for predicting maximum temperature rise on the skin surface were fitted based on the synthetic dataset. All surrogate models were tested on the originally available data where satisfactory predictive performance has been demonstrated. A simple technique of combining quadratic polynomial and tensor-product spline surrogates, each operating on its own cluster of data, has achieved the lowest mean absolute error of 0.058 {\deg}C. Therefore, overall experimental results indicate the validity of the proposed synthetic dataset.

  • 3 authors
·
May 3, 2023

A Three-regime Model of Network Pruning

Recent work has highlighted the complex influence training hyperparameters, e.g., the number of training epochs, can have on the prunability of machine learning models. Perhaps surprisingly, a systematic approach to predict precisely how adjusting a specific hyperparameter will affect prunability remains elusive. To address this gap, we introduce a phenomenological model grounded in the statistical mechanics of learning. Our approach uses temperature-like and load-like parameters to model the impact of neural network (NN) training hyperparameters on pruning performance. A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance. Based on this transition, we build a three-regime model by taxonomizing the global structure of the pruned NN loss landscape. Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model. Based on our results, we present three case-studies: 1) determining whether to increase or decrease a hyperparameter for improved pruning; 2) selecting the best model to prune from a family of models; and 3) tuning the hyperparameter of the Sharpness Aware Minimization method for better pruning performance.

  • 4 authors
·
May 28, 2023

A study of a deterministic model for meningitis epidemic

A compartmental deterministic model that allows (1) immunity from two stages of infection and carriage, and (2) disease induced death, is used in studying the dynamics of meningitis epidemic process in a closed population. It allows for difference in the transmission rate of infection to a susceptible by a carrier and an infective. It is generalized to allow a proportion ({\phi}) of those susceptibles infected to progress directly to infectives in stage I. Both models are used in this study. The threshold conditions for the spread of carrier and infectives in stage I are derived for the two models. Sensitivity analysis is performed on the reproductive number derived from the next generation matrix. The case-carrier ratio profile for various parameters and threshold values are shown. So also are the graphs of the total number ever infected as influenced by {\epsilon} and {\phi}. The infection transmission rate (eta), the odds in favor of a carrier, over an infective, in transmitting an infection to a susceptible ({\epsilon}) and the carrier conversion rate ({\phi}) to an infective in stage I, are identified as key parameters that should be subject of attention for any control intervention strategy. The case-carrier ratio profiles provide evidence of a critical case-carrier ratio attained before the number of reported cases grows to an epidemic level. They also provide visual evidence of epidemiological context, in this case, epidemic incidence (in later part of dry season) and endemic incidence (during rainy season). Results from total proportion ever infected suggest that the model, in which {\phi}=0 obtained, can adequately represent, in essence, the generalized model for this study.

  • 2 authors
·
Mar 31, 2023

Blackbox Model Provenance via Palimpsestic Membership Inference

Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice's model to produce text. Can Alice prove that Bob is using her model, either by querying Bob's derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem--in which the null hypothesis is that Bob's model or text is independent of Alice's randomized training run--and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice's model using test statistics that capture correlation between Bob's model or text and the ordering of training examples in Alice's training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice's training data. In the query setting, we directly estimate (via prompting) the likelihood Bob's model gives to Alice's training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model's training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob's text overlapping with spans of Alice's training examples and 2) the likelihood of Bob's text with respect to different versions of Alice's model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob's text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

  • 6 authors
·
Oct 22

Information Theory and Statistical Mechanics Revisited

The statistical mechanics of Gibbs is a juxtaposition of subjective, probabilistic ideas on the one hand and objective, mechanical ideas on the other. In this paper, we follow the path set out by Jaynes, including elements added subsequently to that original work, to explore the consequences of the purely statistical point of view. We show how standard methods in the equilibrium theory could have been derived simply from a description of the available problem information. In addition, our presentation leads to novel insights into questions associated with symmetry and non-equilibrium statistical mechanics. Two surprising consequences to be explored in further work are that (in)distinguishability factors are automatically predicted from the problem formulation and that a quantity related to the thermodynamic entropy production is found by considering information loss in non-equilibrium processes. Using the problem of ion channel thermodynamics as an example, we illustrate the idea of building up complexity by successively adding information to create progressively more complex descriptions of a physical system. Our result is that such statistical mechanical descriptions can be used to create transparent, computable, experimentally-relevant models that may be informed by more detailed atomistic simulations. We also derive a theory for the kinetic behavior of this system, identifying the nonequilibrium `process' free energy functional. The Gibbs relation for this functional is a fluctuation-dissipation theorem applicable arbitrarily far from equilibrium, that captures the effect of non-local and time-dependent behavior from transient driving forces. Based on this work, it is clear that statistical mechanics is a general tool for constructing the relationships between constraints on system information.

  • 3 authors
·
May 27, 2011

Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates

Researchers are usually interested in examining the impact of covariates when separating heterogeneous samples into latent classes that are more homogeneous. The majority of theoretical and empirical studies with such aims have focused on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, the covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE) model. This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimates the parameters unbiasedly, precisely and exhibits appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of the proposed mixture model. We illustrate how to select covariates and construct the proposed model with longitudinal mathematics achievement data. Additionally, we demonstrate that the proposed mixture model can be further extended in the structural equation modeling framework by allowing the covariates that have direct effects to be time-varying.

  • 2 authors
·
Jul 5, 2020

Parameter estimation from the core-bounce phase of rotating core collapse supernovae in real interferometer noise

In this work we propose an analytical model that reproduces the core-bounds phase of gravitational waves (GW) of Rapidly Rotating (RR) from Core Collapse Supernovae (CCSNe), as a function of three parameters, the arrival time tau, the ratio of the kinetic and potential energy beta and a phenomenological parameter alpha related to rotation and equation of state (EOS). To validate the model we use 126 waveforms from the Richers catalog Richers_2017 selected with the criteria of exploring a range of rotation profiles, and involving EOS. To quantify the degree of accuracy of the proposed model, with a particular focus on the rotation parameter beta, we show that the average Fitting Factor (FF) between the simulated waveforms with the templates is 94.4\%. In order to estimate the parameters we propose a frequentist matched filtering approach in real interferometric noise which does not require assigning any priors. We use the Matched Filter (MF) technique, where we inject a bank of templates considering simulated colored Gaussian noise and the real noise of O3L1. For example for A300w6.00\_BHBLP at 10Kpc we obtain a standar deviation of sigma = 3.34times 10^{-3} for simulated colored Gaussian noise and sigma= 1.46times 10^{-2} for real noise. On the other hand, from the asymptotic expansion of the variance we obtain the theoretical minimum error for beta at 10 kpc and optimal orientation. The estimation error in this case is from 10^{-2} to 10^{-3} as beta increases. We show that the results of the estimation error of beta for the 3-parameter space (3D) is consistent with the single-parameter space (1D), which allows us to conclude that beta is decoupled from the others two parameters.

  • 5 authors
·
Apr 3, 2023

Introduction to Machine Learning

This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.

  • 1 authors
·
Sep 4, 2024

Skill-Targeted Adaptive Training

Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.

A Tutorial on Bayesian Optimization

Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. In this tutorial, we describe how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient. We then discuss more advanced techniques, including running multiple function evaluations in parallel, multi-fidelity and multi-information source optimization, expensive-to-evaluate constraints, random environmental conditions, multi-task Bayesian optimization, and the inclusion of derivative information. We conclude with a discussion of Bayesian optimization software and future research directions in the field. Within our tutorial material we provide a generalization of expected improvement to noisy evaluations, beyond the noise-free setting where it is more commonly applied. This generalization is justified by a formal decision-theoretic argument, standing in contrast to previous ad hoc modifications.

  • 1 authors
·
Jul 8, 2018

Questioning the Survey Responses of Large Language Models

As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models with varying scientific motivations. In this work, we examine what we can learn from a model's survey responses on the basis of the well-established American Community Survey (ACS) by the U.S. Census Bureau. Evaluating more than a dozen different models, varying in size from a few hundred million to ten billion parameters, hundreds of thousands of times each on questions from the ACS, we systematically establish two dominant patterns. First, smaller models have a significant position and labeling bias, for example, towards survey responses labeled with the letter "A". This A-bias diminishes, albeit slowly, as model size increases. Second, when adjusting for this labeling bias through randomized answer ordering, models still do not trend toward US population statistics or those of any cognizable population. Rather, models across the board trend toward uniformly random aggregate statistics over survey responses. This pattern is robust to various different ways of prompting the model, including what is the de-facto standard. Our findings demonstrate that aggregate statistics of a language model's survey responses lack the signals found in human populations. This absence of statistical signal cautions about the use of survey responses from large language models at present time.

  • 3 authors
·
Jun 13, 2023

AI Approaches to Qualitative and Quantitative News Analytics on NATO Unity

The paper considers the use of GPT models with retrieval-augmented generation (RAG) for qualitative and quantitative analytics on NATO sentiments, NATO unity and NATO Article 5 trust opinion scores in different web sources: news sites found via Google Search API, Youtube videos with comments, and Reddit discussions. A RAG approach using GPT-4.1 model was applied to analyse news where NATO related topics were discussed. Two levels of RAG analytics were used: on the first level, the GPT model generates qualitative news summaries and quantitative opinion scores using zero-shot prompts; on the second level, the GPT model generates the summary of news summaries. Quantitative news opinion scores generated by the GPT model were analysed using Bayesian regression to get trend lines. The distributions found for the regression parameters make it possible to analyse an uncertainty in specified news opinion score trends. Obtained results show a downward trend for analysed scores of opinion related to NATO unity. This approach does not aim to conduct real political analysis; rather, it consider AI based approaches which can be used for further analytics as a part of a complex analytical approach. The obtained results demonstrate that the use of GPT models for news analysis can give informative qualitative and quantitative analytics, providing important insights. The dynamic model based on neural ordinary differential equations was considered for modelling public opinions. This approach makes it possible to analyse different scenarios for evolving public opinions.

  • 1 authors
·
May 8

Linear statistics for Coulomb gases: higher order cumulants

We consider N classical particles interacting via the Coulomb potential in spatial dimension d and in the presence of an external trap, at equilibrium at inverse temperature beta. In the large N limit, the particles are confined within a droplet of finite size. We study smooth linear statistics, i.e. the fluctuations of sums of the form {cal L}_N = sum_{i=1}^N f({bf x}_i), where {bf x}_i's are the positions of the particles and where f({bf x}_i) is a sufficiently regular function. There exists at present standard results for the first and second moments of {cal L}_N in the large N limit, as well as associated Central Limit Theorems in general dimension and for a wide class of confining potentials. Here we obtain explicit expressions for the higher order cumulants of {cal L}_N at large N, when the function f({bf x})=f(|{bf x}|) and the confining potential are both rotationnally invariant. A remarkable feature of our results is that these higher cumulants depend only on the value of f'(|{bf x}|) and its higher order derivatives evaluated exactly at the boundary of the droplet, which in this case is a d-dimensional sphere. In the particular two-dimensional case d=2 at the special value beta=2, a connection to the Ginibre ensemble allows us to derive these results in an alternative way using the tools of determinantal point processes. Finally we also obtain the large deviation form of the full probability distribution function of {cal L}_N.

  • 4 authors
·
Oct 25, 2023

Evaluating Binary Decision Biases in Large Language Models: Implications for Fair Agent-Based Financial Simulations

Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models (ABMs). As models become more powerful and accessible, researchers can now incorporate individual LLM decisions into ABM environments. However, integration may introduce inherent biases that need careful evaluation. In this paper we test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries. We observe significant variations in distributions of outputs between specific models, and model sub versions, with GPT-4o-Mini-2024-07-18 showing notably better performance (32-43% yes responses) compared to GPT-4-0125-preview's extreme bias (98-99% yes responses). We show that sampling methods and model sub-versions significantly impact results: repeated independent API calls produce different distributions compared to batch sampling within a single call. While no current GPT model can simultaneously achieve a uniform distribution and Markovian properties in one-shot testing, few-shot sampling can approach uniform distributions under certain conditions. We explore the Temperature parameter, providing a definition and comparative results. We further compare our results to true random binary series and test specifically for the common human bias of Negative Recency - finding LLMs have a mixed ability to 'beat' humans in this one regard. These findings emphasise the critical importance of careful LLM integration into ABMs for financial markets and more broadly.

  • 2 authors
·
Jan 20