The plan was to go to WACV25, held early March in Tucson, Arizona. The desert biome there has been on my wishlist for a long time, so naturally expectations were set high, not for the conference but for the trip itself. It ain't often that a little European poverino gets to travel to the US, especially in such "glorious" times as the present, so I decided to make the most of it. Here are some random details from this satisfying jaunt.

Day -2. The original plan was to travel through Munich, then San Francisco, then Tucson. Unbeknownst to us, the ver.di union in Germany had different plans. Turns out, they were organizing a two-day strike at the Munich airport and our initial flight was caught in the way. It was later cancelled but fortunately we were rebooked to an altogether different outbound itinerary, one going through Istanbul, then again San Francisco (SF), and then Tucson. The downside was that we'd arrive one day later, missing the first day of the conference.

Day -1. While trying to check in for the flight next day, it became apparent that the PNRs we had were invalid. Calling the tour agency revealed that also they cannot get the correct PNRs. They advised just to check in directly at the counter. But this is dangerous, because if the flight is overbooked, it becomes a race in terms of who gets to check in first. So I went to the airport to check in advance of the flight the next day. There were many problems there but eventually I finally managed to get a PNR and select the seats, thinking they are now secured...

Day 0. The journey begins. I go to the airport and see a horrendous long queue. After 40 minutes they print my boarding pass (just for that leg of the trip, not for all), according to which I've been put on "stand-by" even though I had checked in the previous day?! Anyway, at the gate I manage to board. After landing in Istanbul I had to quickly board the flight to SF, yet I didn't have a boarding pass. So I had to run across the airport, find a transfer desk, have them print the boarding pass, go through security, and find the correct gate within this absolute behemoth of an airport. For the USA, you then go through extra security checking. It really feels like you're entering a vault. Border control is way stricter than when travelling within the care-free EU countries.

The flight from Istanbul to SF was long and tiring. Luckily, the Trump-Zelensky meeting was going on precisely at that time and given that we had live television during flights, this kept most people occupied. As soon as you get to San Francisco you can immeadiately sense the difference from the European cities. It feels different in the best way possible. Everything is so efficient. Things are where you expect them to be. The last flight from SF to Tucson was comfortable and in a small plane. And seeing the city lights from the top at night - a vast sea of colored dots - it's almost as if you see a beautiful visualization of all the complicated interconnectedness, web of transactions, and economic activity. We arrived in Tucson late in the evening, then headed to the Marriott.

Day 1. Day 1 began with a massive over-the-top American full buffet breakfast after which the presentations started. I presented my poster on 3D object detection from Bird-Eye-View (BEV) and was pleased from all the attention and interaction it got. I met interesting people working on BEV estimation, DETR variants, and sensor fusion.

The keynote presentation was about computer vision for remote sensing applications. This includes for example recognizing and segmenting objects within huge satellite images. Some people argue that satellite imagery has to be treated as a altogether different modality from images due to its unique characteristics [1]. For example, satellite ML can be concerned with modeling phenomena that span logarithmic scale in the spatial and temporal dimensions, with many spectral channels with greater diversity and precision than standard 8-bit RGB. Also, satellite dataset volumes are huge, much bigger than anything else, often in petabyte-scales.

Consider the task of localizing a small object in a huge satellite image. What a person would probably do is to iteratively focus on a single region, zoom into it, and inspect it for the target object. If it doesn't contain the target, you zoom-out and focus in a different area, kind of how human saccades work. This is an iterative search-based approach. People are trying to mimick this into a vision algorithm. It's also similar to searching for a particular scene within a long video. You make an educated guess roughly where the frame should be and search around it. Depending on the frame content you may jump to another frame.

In the evening we went to Tumamoc Hill. We were supposed to get an Uber to the bottom of the hill and climb from there, but erroneously went to an altogether different hill closeby. It was very windy. We caught the sunset over the Saguaro National Park and soon it was pitch-black before we knew it. We headed down the road to reach a good Uber pickup spot.

Day 2. Day two started with a keynote by Philip Isola, speaking about language as a camera. The idea is that similar to how an image contains a visual description of the world, so does language, except that this description is much shorter and more ambiguous. Based on it, it's enitirely possible for purely-text-driven LLMs to understand graphics and draw by simply generating tokens in a SVG-like format. That being said, I'm not convinced why studying this direction is worthwhile. Does the question "What is the color space of language?" really need to be asked? It almost feels like this line of research is only aimed at understanding the phenomena, without using the knowledge for some particular practical benefit. What good is all your understanding if it cannot be put into the engineering that will improve your life?

What is kind of interesting though is that it seems vision and language features are becoming more and more aligned [2]. Text descriptions and camera images of the same object aim to capture the same object and hence one can expect that their features will become more similar. Different modalities can be considered different manifestations of the same statistical model of reality.

In the afternoon we went to the Arizona-Sonora Desert Museum, which was very cool. You get to walk across desert plains and caves, observing all kinds of marvellous creatures. From the reptiles, the famous venomous gila monster, many kinds of lizards, mojave rattlesnakes, toads, scorpions, tarantulas, desert tortoises, are all on display. There is a huge abundance of cactus-like plants. Hills and plains are usually covered with Saguaro, which are heavily protected by law. You need to have a permit if you want to move/remove one. Chain fruit cholla (choy-uh), also called jumping cholla, is interesting in that its spines are very easily detachable even from the slightest brush and incredibly painful to remove. The resurrection plant shrivels up during dry periods but can literally resurrect itself into life after contact with water.

Day 3. Day 3 started with one of the most amazing majestic hikes of my life. I was terribly jetlagged and had woken up at 5 am. Around 7:20 am, I decided to go for a quick walk along the Bowen trail. It was nothing short of amazing - you're walking along a dusty little path surrounded by huge Saguaro, cholla, and creosote bushes on both sides. That early in the morning it was still cool and windy so the weather conditions were perfect. The scenery was amazing, so different from anything in Europe. The best thing to do is to periodically stop and listen. In these moments of absolute silence your perception heightens and you start hearing the desert sounds - twigs rattling, wind howling, birds chirping, insects buzzing, eventually you will hear the faintest of footsteps. On four separate occurences I saw desert rabbits hopping around the shrubs.

Sonora desert
Figure 1: Walking through the Sonora desert.

Further down, Bowen trail connects with Yetman's trail. I headed back and purposefully turned left onto what's called the Hidden Canyon trail. The path is narrow and winding, and climbs up the hill providing excellent vantage points. While climbing up, I saw nothing less than 5 mule deer, observing me from a distance of about 20 meters max. It seems they are accustomed to people. The rest of the hike was relatively uneventful, apart from encountering a small fat round bird eating a cactus and periodically emitting a low rumbling droning sound.

Academic-wise, there was an interesting presentation by Alvaro Velasquez, a project manager at DARPA. He was talking how neurosymbolic methods could improve training and reduce model size. He made the following interesting point. We know that a bigger network size and more training data have a smoothening effect on the loss landscape, making it less jagged and rough. This happens for tasks which benefit from pattern recognition, e.g vision and NLP, where you need to see many examples in order to learn the patterns. Yet, for combinatorial tasks, like the travelling salesman, you can't really extract any general patterns for the solutions, even if you collect a dataset of millions of problem instances and their optimal paths. The loss landscape of a reasonably defined continuous approximation to the problem always remains very rough, which makes neural approaches unusable.

Ultimately, we'd like to be able to detect when we're in a local sharp region (e.g. using approximations of the Hessian) and jump out of it. This could arguably be done using neurosymbolic techniques. The idea is to detect when we're in a sharp region, lift the parameter state into some higher symbolic representation, perform a symbolic update there, and reproject it back to the parameters space where gradient descent continues the optimization.

As another application consider the task of training a controller that has to always respect some physical constraints. The way to do it is to implement the constraints within the training loop itself, so that the model satisfies them by design. This also typically improves training time by a huge margin. It's also related to differentiable simulation. Any physical parameters that are unknown, e.g. drag coefficients, are estimated as additional parameters.

Day 4. Day 4 was the bomb. It started with a two hour hike within the Sonora desert. The path passed through hills and valleys, dense and sparse vegetation, wilderness, dry creaks, dusty trails, rocky slopes, winding curves, and finally the asphalt road to the hotel. Not a single human was seen, only rabbits.

In the afternoon we got an Uber and went to the Titan Missile Museum. I never thought I'd find myself within the decomissioned launch silo of the USA's largest and most powerful intercontinental ballistic missile (ICBM). Yes, that's right. It's a museum where you get to learn about the Titan II missile. To say that it's fascinating is an understatement - located in the middle of nowhere, you enter an underground facility and learn about military deterrence theory, retaliation, nuclear warfare, and the Cold War. The guide shows you the command center, where the commander and deputy were in charge of obeying the president's orders and launching a 9 megaton nuclear warhead. Then you walk through a tunnel to reach the actual underground silo where the rocket is positioned. It's an amazing place. It gives off very strong Los Alamos vibes. Also reminds me of Half-Life's On A Rail level.

Titan II is a weapon that can level entire cities. It can cause immense destruction. It is able to launch within a minute from the president's order. And during the Cold War there were 54 missiles like it. Moreover, ICBMs are only one type of weapons within the nuclear triad, the other being strategic bomber aircraft, and submarine-launched ballistic missiles (SLBMs). Bomber planes have the benefit of moving the bomb closer to the enemy, but can be shot down on the way. Submarines have the benefit of being entirely stealthy, but could only accommodate smallers warheads. ICBMs provided the most powerful weapons, but their silo locations were fixed and easily knowable to enemy intelligence. Thus, do you realize the might of the military-industrial complex? Do you realise that AI could be used, and is used, to improve and optimize the tens of thousands of little decisions and parameters within the whole launch pipeline? Scary.

In the late afternoon we met with some other people and went to downtown Tucson, where by sheer chance we encountered an anti-Trump protest that was going on. Streets were blocked. Police officers were guiding the crowds. Overall, quite an authentic american experience. After walking around we entered a local bar to get some cerveza and to discuss politics. There was a sign on the door saying "No firearms" which was unusual and interesting to me. But yes, Arizona allows people to carry guns - concealed or otherwise, so such signs are not unreasonable. After that we went to another bar and drank IPA. Overall, an amazing day and a great memorable birthday.

Day 5. End of the journey. It feels hard to leave. The flights were from Tucson to San Francisco and then one to Munich. Overall, the trip during these past days was amazing - beautiful landscapes, out-of-distribution events, strong conference, interesting people. Looking forward to coming back to the land of the free.

Sonora shrublands
Figure 2: The shrublands.

References

[1] Rolf, Esther, et al. Position: mission critical–satellite data is a distinct modality in machine learning. Forty-first International Conference on Machine Learning. 2024.
[2] Huh, Minyoung, et al. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987 (2024).