Find a problem that you are excited to spend time implementing and releasing, develop high-quality code, and release it as a well-structued and documented GitHub repository.
There are a few heuristics that you might use to identify a good problem:
Start with a problem we saw in the course (e.g. an implementation of a single algorithm like a spectral PDE solver) and then branch out into related extensions (e.g. generalize the solver to different coordinate systems or geometries).
Look for creative but underappreciated older methods papers in your field, especially where the author’s code is unavailable, where the implementation is in a very different language like FORTRAN or Perl, or where the provided code is poorly documented. A clean, easy-to-use implementation of an uncommon method often proves very valuable.
Find a very common task or method in your field (e.g., stacking microscopy images, or synchronizing experimental recordings), and use tricks from vectorization, dynamic programming or linear algebra to speed up the task.
It’s okay if you pick a hard problem and ultimately can’t solve it. The goal is to write good code with good documentation and structure.
A few common pitfalls to avoid¶
Don’t pick a problem that is so hard or open-ended that it prevents getting started writing code. Your goal should be to get started writing code as soon as possible. You can always get something simple working and then add more features later.
Please don’t copy or very lightly modify code that is primarily taken from a blog post, a GitHub repo you found, or purely generated by an LLM without using best practices to check for correctness (ie, unit testing). Some LLM generated code will runs, but it can be excessively complicated, including use of unnecessary abstraction, excessive edge case checking, and indirect logic.
Don’t do an ML project unless someone in your group is either experienced with it, or you are willing to do a lot of background reading to ensure sure you can use best statistical practices. For example, if you want to train a deep learning model, first make sure it’s appropriate for your problem scale. Simpler methods, like boosted forests or ridge regression, might be more appropriate. Be extremely careful about validation, hyperparameter tuning, and train-test splits.
Project Learning Goals¶
Beyond providing a setting to try out some of the ideas we are learning in this class, I am hoping that this project will have residual value to you after the course is over. Having prior experience with open-source development and visible existing code examples may prove useful to you in your graduate research, and potentially on the academic and industry job market. By posting the code publicly on GitHub, your code will help others in the future who are trying to solve similar problems.
Parameters¶
I would prefer groups of 3-4 people, for a total of 8 projects per course
You’re allowed (and encouraged) to work on something relevant to your research group’s work, but please make the GitHub repo self-contained. You should plan to write substantial new code for this project (although re-factoring a “rough” implementation is okay), so please use your best judgment to ensure that you get the most value out of this project.
If there is a method you’d like to add to a large, existing package that is widely used in your field (e.g. Biopython, Astropy, scikit-learn, sktime) check with me about submitting a well-structured PR to the main repo instead of implementing a standalone package. Please check with the repo maintainers that your feature would be welcome, and about the format and testing in order to get accepted. Generally, the larger the repo’s userbase, the smaller the addition should be, and the more testing it will need to pass. However, the potential impact could be huge.
Grading Rubric¶
Problem scope: 20%
Contains an interesting and challenging problem, and makes a good-faith effort to approach it.
Creativity: An unexpected application or novel algorithm or interpretation of an algorithm is exciting and appreciated.
Thoroughness: Makes a thorough attempt to solve the problem, even if ultimately unsuccessful
Code quality: 40%
Logical structure, minimal redundancy or repeated code
Variables and objects have appropriate scope
Use of appropriate abstractions
Code legibility and style
Unit tests or other tests to ensure correctness
Documentation: 20%
README contains Installation instructions
README contains example usage and minimum working example
I’m not requiring a written report this term, and so if you have benchmarks or results, please put them in a section of your repo’s README.md file. Please use best practices for publication-quality writing and figure-making.
Major functions and classes have documentation
Talk: 20%
Only one group member needs to present, though you are welcome to structure this however you’d like.
These will be ~10 minutes + 2 min questions during the last few sessions of the course (5 talks per class).
Please be ready to present the class session before you are scheduled, just in case someone can’t come on their scheduled presentation day.
You can organize these however you want, but if you would prefer a template: 5-8 minutes background, 3 minutes on problem formulation, 5 minutes on your solution and any pitfalls or dead ends, and remaining time on future directions, applications, connections to interesting other ideas.
Project ideas¶
These are just suggestions, feel free to pick anything that interests you. I have included informal estimates of the difficulty of the base problem; however generalizations or modifications can make many of these problems more advanced.
Implement the Ising, Potts, or lattice gas model and study the properties of the phase transition, when it occurs, and how it depends on the lattice geometry (e.g. square vs triangular vs hexagonal, or even a random graph). Difficulty level: Easy to Medium
Implement the Kramer-Marder model of river formation. This model bears some similarities to the sandpile model. Difficulty level: Medium to Hard
Implement the Fermi–Pasta–Ulam–Tsingou model. Simulate a weakly nonlinear mass–spring chain; track modal energies and demonstrate lack of rapid equipartition plus near-recurrence. As a stretch, obtain results confirming frequency-space normal-form fit or compare alpha vs beta formulations of the model. Difficulty level: Easy to Medium
Implement the Chirikov standard map. and study the transition to chaos in the model. Difficulty level: Easy
Implement the Kuramoto model of coupled oscillators and study the onset of synchronization as a function of coupling strength. Then, implement the Abrams Strogatz coupling matrix, and observe chimera states. How do these states change with the number of coupled oscillators? Difficulty level: Medium to Hard
Implement the Vicsek model of flocking Try changing the noise model, the interaction radius, or add time delay to the interactions, in order to see how these changes affect the flocking behavior. Consider how you migh extend the model to larger swarms, such as by using the kernelized approach described in this paper. Difficulty level: Easy
Implement the Barnes-Hut algorithm using a quad-tree data type and use it to simulate a large N-body gravitational system. Compare the results to a direct N^2 simulation for small N, and then scale up to larger N. Difficulty level: Hard
Implement the Nagel–Schreckenberg traffic CA model and study the onset of traffic jams as a function of car density. Difficulty level: Easy
Implement the Eden or ballistic deposition model and study the fractal dimension of the resulting clusters. Measure the roughness exponents and check for finite-size scaling effects. Difficulty level: Easy to Medium
Implement a minimal Lattice-Boltzmann fluid solver in 2D, and use it to simulate flow around an obstacle as a function of Reynolds number. For example, the von Karman vortex street that arises due to flow past a cylinder. You may find this example implementation for the lid-driven cavity useful as a starting point. Difficulty level: Medium to Hard
Implement the orthogonality-constrained optimizer of Edelman et al.
Implement a minimal finite-element solver for a PDE like the diffusion equation in 2D, and compare its performance with the finite-difference method. Difficulty level: Medium
Recreate the key results of Kauffman’s random Boolean circuits paper. Difficulty level: Medium
Recreate the key results of Lenski et al.'s digital organisms paper. Difficulty level: Hard
Using the logistic map or another minimal system, implement the Ott, Grebogi, and Yorke method for controlling chaos. Difficulty level: Hard
Implement a minimal version of Havok or Dynamic Mode Decomposition, two data-driven methods for discovering the underlying dynamics of a system based on time series data. Difficulty level: Medium to Hard
Project resources¶
It’s not necessary to include everything in this guide, but it’s a great guide to what the research community thinks good, reusable code should look like.
Example Final Projects from this course¶
Neural System Identification by Training Recurrent Neural Networks
Assimilating a realistic neuron model onto a reduced-order model
Testing particle phenomenology beyond the Standard Model with Bayesian classification
Simulating Anderson localization and Hofstadther butterflies
Example code repositories¶
An example pull request to the widely-used sklearn machine learning package, which implements varimax PCA:
- Kramer, S., & Marder, M. (1992). Evolution of river networks. Physical Review Letters, 68(2), 205–208. 10.1103/physrevlett.68.205
- Abrams, D. M., Mirollo, R., Strogatz, S. H., & Wiley, D. A. (2008). Solvable Model for Chimera States of Coupled Oscillators. Physical Review Letters, 101(8). 10.1103/physrevlett.101.084103
- Miranda-Filho, L. H., Sobral, T. A., de Souza, A. J. F., Elskens, Y., & Romaguera, A. R. de C. (2022). Lyapunov exponent in the Vicsek model. Physical Review E, 105(1). 10.1103/physreve.105.014213
- Ott, E., Grebogi, C., & Yorke, J. A. (1990). Controlling chaos. Physical Review Letters, 64(11), 1196–1199. 10.1103/physrevlett.64.1196