Statistics

Applications for 2024-2025 open on 1 July 2024.

Estimating parameters of a void point process using a Bayesian approach

Project code: SCI186

Supervisor:

Dr Charlotte Jones-Todd  

Discipline Department of Statistics

Project description

A set of locations in space is a spatial point pattern: earthquake epicenters, locations of trees, animal habitats etc. The arrangement of these points is generated by a combination of deterministic and stochastic mechanisms and is modelled using a point process.

A void point process has regions devoid of points where you might otherwise expect to see points given the background intensity of the process. These areas are termed ‘voids’ and may represent, for example, regions of a forest where storm damage has resulted in missing trees.

This project will involve using a Bayesian approach to estimate parameters of a void point process. No prior knowledge of point processes is required; however, a student should be comfortable with the concepts covered in STATS 730 and STATS 731.

Working with Genstat, Jupyter Notebooks, and Quarto

Project code: SCI187

Supervisor:

Simon Urbanek
James Curran

Discipline Department of Statistics

Project description

Genstat is statistical software that has particular strengths the field of experimental design, although it can do most things that one would expect from a mainstream data analysis and modelling programme.

Jupyter, and in particular Jupyter Notebooks is a web-application for creating and sharing computational documents. Jupyter is most commonly associated with Python, but in fact can be extended to accommodate any language.

In this project we want the candidate to write a Jupyter kernel to extend the functionality of Jupyter to the Genstat scripting language. This task involves writing a small Python programme which implements some key functions in a Python class.

If there is time, we wish to explore using the Quarto to render our Jupyter documents so that we can use the system to produce dynamically updatable documents with Genstat code.

The ideal student should have some programming experience in Python, and some experience of using either knitr or Quarto. Experience with Genstat is not important as this is a “proof-of-concept” project and we are likely to work with some small pre-written documents. Genstat is commercial software, and a licence will be provided for the duration of the project.

Statistics education – analysing novel assessments aiming to promote academic integrity and prepare students for the workplace

Project code: SCI188

Supervisor:

Stephanie Budgett
Leila Boyle

Discipline Department of Statistics

Project description

A novel, low-stakes authentic assessment, designed to promote academic integrity and prepare students for the workplace, is being piloted in semester two, 2024. This project will involve working with the supervisors in analysing student-data, refinement of rubrics, and collating reflections of the teaching team to inform future iterations.

This project will suit a student with a keen interest in statistics education.

Skills required:

  • Knowledge of any or all of STATS 100/101/108/150/201/208
  • An interest in assessment practices
  • An enquiring disposition.

Evaluating web-based tools that support “real time” formative assessment of writing

Project code: SCI189

Supervisor:

Dr Anna Fergusson

Discipline Department of Statistics

Project description

Introductory-level statistics and data science students need to learn how to identify and produce short written communications that are statistically sound. However, there are pedagogical and practical challenges to designing and implementing effective formative assessment of student writing when courses involve hundreds or thousands of students. Automated approaches, such as the use of LLM-based chat bots, can assist but higher quality training data is needed to develop pedagogically-sound interactions.

This project involves evaluating the design and implementation of two web-based tools for supporting and engaging students with “real time” formative assessment of writing within a large introductory statistics lecture - Quick, write! and Why so judgemental?

This project will be an excellent opportunity to demonstrate skills with analysing very large amounts of unstructured text and user interaction data, employing statistical models such as the Bradley-Terry for the purpose of exploring engagement and effectiveness of a purpose-built educational technology product.

There is also the possibility to extend to the development and testing of rule-based generative models for providing feedback on written statistical-based communications.

Requirements: A background in data technologies (e.g. STATS 220), and strong skills with data analysis and general statistical methods. An interest in data science and/or statistics education would be advantageous, particularly the development of pedagogically-oriented computational tools.

Computational Bayesian methods to understand glacial melting

Project code: SCI190

Supervisor:

Dr. Chaitanya Joshi
Dr. Ru Nicholson

Discipline Department of Statistics

Project description

Approximately ten percent of the earth’s land area is covered with glacial ice, with about 10% of that in the Greenland ice cap, the remainder in Antarctica. Human activity has led to conditions in which many glaciers are now rapidly melting, retreating on land and undergoing iceberg calving, i.e. large icebergs and chunks of ice falling from glaciers (and the ice shelf) into the ocean, some over a kilometre in height. This is resulting in a rise in ocean sea level, and a decrease in the earth’s ability to reflect the sun’s heat back into space. It is essential to understand the interrelationship between anthropomorphic global warming and iceberg calving and glacier melting, in order to reliably predict their consequence.
A number of factors are known to affect iceberg calving and glacier melting, including temperature, glacier thickness, ice density and crystal structure, base roughness and friction, and water pressure. These (typically uncertain) parameters are related to measurable quantities (such as the velocity of the ice on top of the glacier) through mathematical and computational models. A common approach to enable forecasts/predictions, such as future sea level change, is to first estimate or infer these uncertain parameters and then run the predictive models. The Bayesian framework provides a natural framework to consider the estimation problem as it allows for incorporation and quantification of various sources of uncertainty. However, due to the complexity of the mathematical ice sheet models, solving the inference problem is computationally prohibitive and in some cases completely infeasible
The goal of this project is to investigate and compare the accuracy, applicability and efficiency of several approximate approaches to the inference problem.

Some exposure to Bayesian statistics such as STATS331 mathematical modelling and good R programming skills are essential. An interest in climate change, environment, ecology is desirable but not necessary.

Adaptive Nested Sampling

Project code: SCI191

Supervisor:

Brendon Brewer

Discipline Department of Statistics

Project description

Nested Sampling is a popular and widely applicable algorithm for performing Bayesian Inference. However, its effectiveness depends on the ability to generate new particles above a given likelihood threshold. In most implementations, a numerical parameter controls how much computational effort is spent doing this. In reality, it would be useful if this could be set adaptively, so that the effort is spent only when it is really needed. In this project you will investigate ideas for how to make this work.

Prerequisites: Good grades in STATS 331 or 731, and strong programming ability.

Genealogies of samples from stochastic populations and biodiversity models (Discovery Centre project)

Project code: SCI192

Supervisor:

Simon Harris

Jesse Goodman

Discipline Department of Statistics

Project description

Suppose a population has been evolving over time up to the present time, with births and ancestries following a random process. Now we sample individuals at random from the current population. What will be the structure of the genealogical tree relating the chosen individuals and their ancestries? This project in probability theory will investigate the reconstructed phylogenetic trees that arise under natural models of population dynamics. The particular focus will be age-dependent branching processes – the stochastic process that encodes population growth when individuals may have varying numbers of offspring over the course of their lives. The project will look at “spine” techniques and other techniques from exciting recent research, see for instance Gernhard (2008), Stadler (2009), and Harris, Johnston, Roberts(2020), and Harris, Palau, Pardo (2022+). The project will include directed reading for any necessary background material in probability, such as Markov chains, branching processes and age-dependent branching processes, and Poisson processes. Computer simulations may be used to exhibit graphically typical behaviours and theoretical results.

Requirements: A good background in probability (eg. STATS 125, STATS 225) and very good mathematics (eg. proofs, limits, calculus, differential equations) is essential. Some more advanced knowledge of stochastic processes or Markov chains is also strongly recommended (eg. STATS 325, STATS 320).

Constructing the New Zealand Socio-economic Index 2023 (NZSEI-23)

Project code: SCI193

Supervisor:

Barry Milne
Natalia Boven

Discipline Department of Statistics

Project description

The New Zealand Socio-economic Index assigns socio-economic scores to occupations using the ‘returns to human capital’ model, which posits that occupations are the way we transform cultural capital (education) into material rewards (income). The NZSEI has been constructed using census data for the 1991, 1996, 2006, 2013, and 2018 censuses. This project will involve constructing NZSEI scores for the 2023 census. It will also explore the possibilities of constructing these scores using occupation, income, and education data from non-Census data (i.e., administrative data sources). The ideal student should have some programming experience in SAS and R, have an interest in social statistics and inequality. The project will involve working on-site at one of the Stats NZ datalabs housed at the University of Auckland, so remote working will not be possible.

Best practice regression modelling with iNZight

Project code: SCI194

Supervisor:

Tom Elliott (iNZight Analytics, UoA)

Matt Edwards (UoA)

Discipline Department of Statistics

Project description

Regression modelling involves a set of basic assumptions that should be checked by analysts. Any unchecked or unsatisfied assumptions should be handled, or at least commented on in text and graphical outputs. This project will involve working with the iNZight development team to develop R software for fitting regression models with interactive diagnostic and assumption checking.

Requirements: STATS 380, STATS 330, and very good R programming skills.

Bayesian inference with iNZight

Project code: SCI195

Supervisor:

Tom Elliott (iNZight Analytics, UoA)
Matt Edwards (UoA)

Discipline Department of Statistics

Project description

iNZight currently provides standard (Normal theory) and bootstrap methods for basic inference and hypothesis testing, such as ANOVA and chi-square tests. This project will involve working with the iNZight development team to develop R software for adding Bayesian alternatives to these common methods.

Requirements: STATS 380, STATS 331, and very good R programming skills.

Investigating diabetes burden and related co-morbidities for women from Fiji living in Aotearoa

Project code: SCI196

Supervisor:

Dr Pritika Narayan

Xiaoxu (Tina) Ye

Discipline Department of Statistics

Project description

A recent paper (De Graaff et al. 2023), demonstrated that the burden of gestational diabetes among South Asian women living in New Zealand is particularly high. Further preliminary analysis demonstrates that the GDM burden particularly among those born in Fiji may be associated with increased kidney complications or poor cardiovascular outcomes. To further examine GDM and GDM-associated complications, we are looking for a student who is interested in a summer studentship and potentially masters, to use administrative health data sets such as the PHO enrolment minimum dataset and NMDS diabetes and diabetes-related complications datasets to examine disease burden in this understudied population.

Skills required: R programming skills including basic r syntax, proficiency in using packages like ‘tidyverse’ will be advantageous.

Preferred: · Interest in pursuing masters · Passionate about improving evidence based reporting on health outcomes for minority ethnic groups.

Prerequisites: STATS 380 (Statistical Computing) or equivalent

The impact of disclosure risk processes on the accuracy and precision of official statistics for the Pacific population

Project code: SCI197

Supervisor:

Andrew Sporle
Dr Nicole Satherley

Discipline Department of Statistics

Project description

Official statistics agencies impose disclosure risk processes on publicly accessible official data in order to reduce the risk of disclosing individually identifiable information. These processes don’t result in significant changes to the data if the numbers are large. However, when numbers are small, as with certain sub-populations, the processes can result in relatively large changes to counts and calculation results, as well as the suppression of otherwise useful data. This project involves examining and quantifying the impact of these disclosure risk processes on accuracy and precision when the counts are small. This will be done using a combination of existing Pacific health and social data using standard demographic and epidemiological analyses.

Prerequisites: STATS 705 or currently enrolled.

Skills: Familiarity with Pacific health and social statistics in New Zealand or overseas. Understanding of simple demographic calculations. Data management and R coding skills

Statistical Meta-Analysis for Cochrane Review: Comparing Two Clinical Interventions for Parkinson's Disease

Project code: SCI198

Supervisor:

Dr. Priya Parmar

Discipline Department of Statistics

Project description

This summer studentship focuses on conducting a statistical meta-analysis for a Cochrane Review. The project aims to compare the efficacy and safety of two clinical interventions for Parkinson's Disease. The selected student will gain hands-on experience with advanced statistical techniques, systematic review methodologies, and evidence-based decision making in clinical research.

Objectives: (1) Extract relevant statistical information (outcome measures, beta-coefficients, standard errors, p-values, samples sizes) and study characteristics (patient demographics, year, intervention details, location) from a pre-set list of journals* which comprise the literature search for the systematic review and manage these using Excel. (2) Perform statistical meta-analysis in R including fixed-effects and random-effects models, subgroup analyses, and sensitivity analyses to compare the efficacy and safety of the two interventions. (3) Conduct a Bias Assessment to assess the risk of the bias from the included studies using the Cochrane Risk of Bias tool. (4) Prepare a report summarising these results and interpreting the key findings considering clinical relevance and statistical significance. (5) Document the methodology and results comprehensively, contributing to the final Cochrane Review * A systematic review would have been conducted utilizing databases such as PubMed, MEDLINE, Embase, and the Cochrane Central Register of Controlled Trials to identify relevant studies with titles, abstracts, and full texts screened against predefined inclusion and exclusion criteria prior to summer-student project commencing.
The student will work under the guidance of Dr Parmar (epidemiologist and biostatistician) and Professor Maurice Curtis (Neuroscientist and Parkinson’s Disease expert). Additional training resources and workshops on systematic reviews and meta-analysis methods will be available. Ideal for students who have completed stage-2 undergraduate statistics or graduate student in statistics, biostatistics, epidemiology, public health, or a related field. A good knowledge of statistical analysis and experience with R is required. Strong analytical skills and attention to detail are also recommended. Keen interest in neurological health, clinical research, evidence synthesis, and Parkinson’s Disease a bonus. Key learnings for the student include: (1) Gaining practical research experience in conducting systematic reviews and meta-analyses, including data extraction, management, and statistical analyses (2) Developing technical proficiency in using R and associated packages for meta-analysis (3) Enhance critical thinking skills by assessing the quality and bias of clinical trials. (4) Improving scientific writing and reporting skills, contributing to a publication-quality Cochrane Review illustrating evidence-based decision-making processes

Procedurally generated CSS & JS R-focussed casual/puzzle games

Project code: SCI199

Supervisor:

Dr Charlotte Jones-Todd  

Discipline Department of Statistics

Project description

This project will create and implement R-focussed CSS and JS casual/puzzle games (e.g., https://statbiscuit.github.io/mini_games/). The focus of the project will be on the procedural generation of the material so that the content refreshes for each run-through.

This project is suited to students with a strong programming background. A creative flair would also be advantageous.  

Improving the accuracy of the saddlepoint approximation for count data

Project code: SCI200

Supervisor:

Jesse Goodman

Discipline Department of Statistics

Project description

The saddlepoint approximation is a systematic method for approximating an unknown density function in terms of a known moment generating function. It is useful when each individual in a large population contributes to a single random variable, and has often been used in statistical ecology.
The saddlepoint approximation works best for densities, when the underlying random variable is continuous. For discrete random variables, the traditional saddlepoint approximation works less well, and always fails at the boundary. This project will implement new alternative saddlepoint approximations for some simple models and assess how these proposed alternatives compare to existing methods.
Experience with R programming and simulation would be a plus. The mathematical aspects of the saddlepoint approximation are not prerequisites, but mathematical applications could be explored as part of the project depending on the student. Saddlepoint approximations are related to certain contour integrals, so for a student with an interest in complex variables this project could look at complex variable methods and techniques.

Predictive Asset Management (PAM) for industry

Project code: SCI201

Supervisor:

Dr Priya Parmar

Discipline Department of Statistics

Project description

This summer studentship focusing on Predictive Asset Management (PAM) in the manufacturing sector. The project aims to develop and implement predictive models to forecast equipment and machinery faults, enhancing maintenance strategies and minimizing downtime. The selected student will gain hands-on experience with advanced predictive data analytics.

The data will be provided by global company TATA-iq.
The student will
1. Integrate the data from multiple sources
2. Perform exploratory data analysis to understand characteristics and identify any patterns and outliers.
3. Develop predictive models using key performance indicators (such as maintenance costs and operational efficiency)
4. Validate, test, and evaluate the predictive models using historical data for accuracy and reliability.
5. Visualise and report the findings and predictive insights to TATA-iq
6. Provide a comprehensive report documenting the methodology, results, and recommendations.
7. Propose a strategy for deploying the predictive model in a live manufacturing environment.
The student will be mentored by Dr Parmar and an analytic team from TATA-iq headquarters in Bangalore. Regular check-ins and progress meetings will ensure guidance and support throughout the project.

Would suit a student from data science, computer science, engineering, statistics, or a related field. Basic knowledge of machine learning and experience with programming languages such as Python or R. Familiarity with data visualization tools and frameworks would be useful.  

Adaptive control of queueing systems with bursty arrivals

Project code: SCI202

Supervisor:

Azam Asanjarani

Discipline Department of Statistics

Project description

The evolution of queuing systems is often random, with key variables/parameters either unknown or only partially observable. Developing algorithmic methods for these systems, aimed at improving efficiency, forecasting, and enabling online control, can significantly reduce customer waiting times, enhance server utilization, and ensure system stability. This project's primary goal is to devise an optimal model for queueing systems with bursty arrivals that is applicable to practical scenarios in the field of health services, energy, manufacturing, transportation, and communication networks. A basic knowledge of stochastic processes (such as Markov chains and queueing systems), along with strong analytical and programming skills, are required for this project.

Reaching for the stars: Searching for gravitational waves from collapsing stars

Project code: SCI203

Supervisor:

Matt Edwards
Avi Vajpeyi

Discipline Department of Statistics

Project description

A Core-Collapse Supernova (CCSN) is a spectacular explosion marking the death of a massive star. Traditional observations using light can't reveal
what happens at the star's core, but Gravitational Waves (GWs) can let us look inside. Our project aims to help search for GWs from supernovae, using
an AI GW model made by Matt. The search algorithm works like the “Shazam” app, but instead of songs, we search for GWs from stars.
Skills required:
● Python
● Interest in astronomy :)