Against recognition

Against recognitionJune 28, 2021(This post was originally part of an email update to my GitHub
sponsors. I'm making it public
because I like it. It's a little less formal than most things on my
site.)


at Dynamicland, I had this idea for an 'audio recorder' or 'audio
note' device -- it would be a thing that has a push button, a
microphone, and a receipt printer:
you hold the button down to record11. that is, it's push-to-talk /
spring-loaded 
say something / make whatever sound you want
let go of the button
the device immediately (or even in real time, while you're
speaking?) prints out a little receipt that 'contains' (that
is) the
audio that it just recorded22. It's strange -- the detail that the printer is a receipt
printer feels very important. that the printout comes out right
at hand, like a Polaroid (rather than out of a printer somewhere
else in the room), and that the printout is fast (not a laser
printer churning for 30 seconds before spitting out a page). it
helps create this sense of lightness and 'post-it-ness' 
then you can put that sound (receipt) down33. dj microbeads proposed implementing this as a standalone
project, with QR codes on the receipts & a phone app that you
point at a QR code to play the sound, and that immediately didn't
feel right to me.
it feels like it doesn't preserve enough of the
lightness, the fact that you could put this receipt down anywhere on
your desk and it'd play, like an object that has its own magic. (maybe the receipt would have a dot frame around it, like any other page in Dynamicland, although I don't like to think about that too hard;
it would be better if it felt even lighter than that)
I think you at least need to have an always-on well on your desk where you can put a receipt to play it (like a slot
where you'd place a Yu-Gi-Oh
card, but alive -- maybe
it'd have a webcam / document camera / cheap phone
mounted overhead), even if you don't have Dynamicland-esque coverage of the whole desktop; the app on your existing phone that you have to switch into is not enough 
 to play it
like I feel like there is no audio equivalent of jotting something down
on a post-it note (or even writing on a whiteboard), and that's kinda
what I wanted: a cheap physical object, made of sound instead of
writing, where both consuming it and creating it feel immediately at
hand.
It'd be a mix of the nice things about audio (oral communication style
instead of written, can bring in other noise-making objects, background
sounds can make it in without your explicitly intending to include them,
music, kids/nonliterate people can participate, etc) and the nice things
about writing a note (cheap to make, disposable, objectness, can point
to and reuse later).
(ideas: you
narrate a
project you're working on, then literally stick your narration onto the
project / you
hum a bunch of
little sound samples and play with them to make some musical composition
/ you build a game where the game pieces are spoken words... / i don't
know. you can make non-visual systems, where you only use your hands and
ears and mouth, that would be accessible even to someone who can't read
or can't see)
but -- and I think this is important -- the thing would not try to
recognize your speech and turn it into text. the receipt represents only
the sound itself. (from the computer's point of view, it's just an
opaque blob of audio.)4
4. maybe the receipt would have the audio waveform printed on it or
something, just to give it a unique appearance. Plus, there's this
Dynamicland aspiration that I'd want to maintain -- that a computer
could in theory look at the receipt and regenerate the audio
completely from what it sees (even if in practice it usually
'cheats' by keeping the audio file on disk)
(The Dynamicland aspiration is that you should be able to look at
the situation in the real world & completely derive the computer
state from that; there shouldn't be invisible 'virtual state'
that lives in RAM or on a hard disk.)
(Ideally, if the power cut out and the computers in the ceiling at
Dynamicland all restarted, nothing would be lost, because the
physical arrangement of stuff in the real world completely
determines the behavior of the computer anyway, and that physical
arrangement remains intact.) 
The computer is what stores the audio data for the receipt, and when you
put down the receipt, the computer is what plays the audio through the
speakers in the ceiling. So the computer's role here is to store
(audio) information and to
orchestrate
hardware, but its role is not really to digest and index that
information.
if you have the computer recognize the speech inside these audio
receipts and turn it into text, you're now privileging that one kind of
audio (speech), which feels weird to me. you're going to push people to
make sound in the ways that are legible to the computer. you may start
to think that the text is the 'canonical' format, and the audio is
only a temporary stop on the way to figuring out the text, and it's a
Problem if that text comes out wrong, and you'll treat everything in
the world only in terms of how you can convert it to text...
(and to what end? none of the ideas or aspirations I've described so
far require the computer to understand what's in the audio receipts.)
(sure, maybe it'd be nice to have, like, search, but should that
nice-to-have dominate the design? You can't search physical books or
post-it notes, and that's fine. I want to play with this fresh thing of
audio notes, which doesn't even exist yet, before trying to add an
extra layer of recognition and textuality and legibility, a layer which
may overwhelm some of the things about audio that were interesting to me
in the first place.)
I have this reMarkable 2 tablet I got last year. I really like it. I use
it a lot. Its UI looks like this
(source):




and that frustrates me a little. I feel like you could do so much more;
I feel like that interface doesn't really take the tablet and pen
seriously.5
5. it's such a waste! I have a pet theory that a lot of the
stagnation in programming and in GUIs in general is
downstream
of a stagnation in consumer I/O
devices.
hard to do,
say, graphical programming that feels good, if you're stuck with a
mouse and
keyboard.
and tablets aren't stuck with that; they have a chance to do
something new, and it feels like they've squandered it so far 
There's so much typed text on that screen -- there are so many straight
lines and
buttons -- it
feels like it's a UI built around
tapping and
clicking and maybe keyboard shortcuts, not around the pen. Shouldn't as
much of the interface as possible be handwritten text, and hand-drawn
lines, and weird-shaped
regions?6
6. I almost feel like I should get to draw the interface that I want
for myself. like how Acme lets you grow your own palette of commands
as you work. that would also make me feel a lot more comfortable and
committed to the interface, if it was something that came from my
own hand 
A small example: titles. This file is called "Chapter 4":


if you have a tablet, with a pen, I feel like you should be able to
just... write/draw that "Chapter 4" in. you shouldn't have to Make
a New
File,
then type C-H-A-P-T-E-R-SPACE-4 in a text field, then click
OK... there shouldn't even really be a 'text field' at all on a
tablet like that; at minimum, it should be an open blank field where
you can draw or write anything with your pen.
even if the computer can't figure out what some little scribble in that
field means, it is still useful, since the title is there to serve me,
to help me spot my files on sight. and in fact, even if the tablet does
manage to recognize the text in my scrawled title, I'd prefer that it
show my original handwriting there, because then the title feels like
it's mine, it's my writing from my hand, with my weird quirks and
imperfections. it's not dead 'text' from some font built into the OS
why is there any typed text on the tablet UI? why am I typing a title
by mashing soft keys on a software keyboard? shouldn't it all be
handwritten? not only would it be more comfortable and more fun (the
tablet hardware is great for handwriting and terrible for typing), it
would be more open: I could draw little smiley faces or stars or
whatever I want. (I could 'star' a document by literally drawing a
star on it!)7
7. and why are there files and folders at all? it feels like a naive,
traditional-PC-derivative take on how to deal with information. why
can't I just have endless pages, where I mark out and wire regions
of them together with my magic pen? 
anyway, if we want legibility, we could imagine techniques to navigate
and visualize that space of titles/title-drawings that aren't just
about converting everything to text and treating it as text:




it's weird that tablets like the reMarkable or iPad don't use this
more in their interfaces. like Dynamicland, they have a new form of
input, a form that goes way beyond traditional mouse and keyboard or
touch, that is far more open-ended, and I think they should use it!
pervasively!
and I don't mean treating handwriting as just an input to a
recognition system that slots
into old interface paradigm:


(I understand why that form of text recognition is useful -- existing
systems are big, and they do a lot of stuff you can't replicate easily,
and some form of pen compatibility with them is important -- but it
doesn't excite me, and I feel like it limits our imagination if we
think too much about it)
A couple of years ago, we wanted to have a better idea of what projects
people were doing (and had done already) at Dynamicland, so we started
making this 'research gallery' application.8
8. and we wanted to think about how you would make a 'database' or
querying interface that takes advantage of the unique properties of
Dynamicland, and we wanted more applications in DL that actually got
regularly used in a real context. 
It was built around this dynamic 'scrapbook'. The idea was that when
you make something, you'd add a new page to the scrapbook about it,
with photos, videos, text description, maybe a little embedded demo of
the thing, and so on.9
9. reminds me a bit of the Lisa Polaroids :-) 


You can see two such scrapbook pages below -- on the left, a page about
the "Animation" project, and on the right, a page about the "DNA
Kit" project:


Each project at Dynamicland would get a big page (or two or three) in
the scrapbook.
You can see (your eyes were probably drawn to) all the iconic
Dynamicland dot-framed pages that are glued into the scrapbook. These
pages are a few different things:
photos that have been printed out via the Dynamicland system (for
example, the yellow-backed areas in Animation): the photo is
actually printed on the paper, but the computer also knows where
and what the photo is and can transclude it onto the wall or as a
thumbnail in search results or whatever
demo videos that play on the scrapbook (the purple "History" in
Animation)
an embedded, live instance of the project itself (bottom-left
corner of Animation, which animates between the 3 hand drawings
along the bottom center of the page)
But I don't actually want to talk about those dot-frames, or about
their behavior; they're not the part that I find interesting. I'm much
more interested in how you can really put anything you want on the
scrapbook page (it is, after all, just a big piece of paper).
Look at everything on the Animation scrapbook page that isn't framed
by colored dots:


These are things -- post-it notes, bits of text floating around that
were written by different people, handwritten headings -- where the
computer doesn't even know they're there. But they mean something to
you and me. And, unlike 'text' in a computer, they can vary in human
ways; they can be written or typeset
differently, set
at different sizes, with different colors, and so on, without software
needing to implement any of those features.
It's like how I can yell or cry or laugh in an audio recording, but all
that meaning gets destroyed when it gets 'recognized' into text.
When we made the scrapbook, we struggled a lot with the tension between
the freedom of this unrecognized 'open input space' and the utility of
a scrapbook that is indexable/legible to the computer. The scrapbook you
see above was a sort of compromise, with both structured (dot-framed)
and unstructured (everything else) elements. I wrote a bit about this
tension at the time, in an unpublished draft:
Let's say we only had a physical scrapbook of projects, with no
structured data -- no computer at all, basically. There's something
freeing about that open space format: you can put whatever you want
into the book. Photos, handwritten notes, hand-drawn diagrams,
booklets stapled in, paper inserts that fold out, whatever.
But in exchange for that freedom, you get a book which is in some ways
profoundly illegible and unsearchable. How would you search for
projects that involve 'music'? How would you search for projects
'made by Omar'?
Unless the book had an index or table of contents for that specific
kind of query (and you kept that index updated by hand, every time you
added or changed a project), you'd have to scan through the book from
cover to cover to answer the query.
So when we made the research gallery, we wanted its users to have the
power of a computer to process structured data:
search for projects,
see many views of projects (by date, by author, by subject, by
capabilities used...),
filter only for what you're interested in,
see connections to related work,
quickly render a list of results.
We want the computer to understand some things about each project. But
once we make a project format that is easy for the computer to read,
we also circumscribe how authors can describe a project. The system
wants authors to say the kinds of structured things that it can
understand, like:
a project is described by strings of (QWERTY-keyboard-typeable)
text, not diagrams, or icons, or singing, or other languages, or
handwriting where the author used a different pen and pressed
harder on some parts to emphasize them
a project was created on a particular date on the calendar,
not 'some time from Omar here and some time from Paula there,
with low-level thinking and tweaking all along as part of a
different project X, and then a big overhaul 6 months later'
a project is demonstrated in action by individual photos and
videos, not a big collage of photos with stuff pasted in like
handwritten captions and supplementary drawings and extra people
who weren't captured in the original photos
To be honest -- I would have liked to not have any dot frames on the
scrapbook page at all -- I would have liked for that 'open input
space' to be the primary thing, and the recognition system to come
later (if at all).10
10. and the dots take so much space! to me, the fact that they
dominate your visual field feels so wrong, so misleading about
what's really important. and it constrains the number of
'objects' that can fit on the page so much when each 'object'
has this thick dot-frame around it. it constrains your imagination
of what the scrapbook can be. i wish objects could be small and free
and unrecognized 
 (and if that means the computer has to give up
some legibility for a while, if that means you can't automatically
figure out a thumbnail photo for a scrapbook page, if that means you
can't search through the scrapbook quickly, that might have been OK by
me.)
the idea that a lot of
things in the
computer are not really about
computation;
they are artifacts made by people for people. it should be normal to put
things
inside the
computer that the computer itself cannot digest, but that it can pass
on to your
future self or to other people.
this is sort of the idea behind literate programming, too -- the idea
that the stuff for humans should be the default
context, and the
highly constrained stuff parsed by the computer should be an exceptional
mode within that.
It's not that I think that recognition is always bad -- but I do think
it is more interesting to err against it when we're designing new
systems.
(and I think framing your problems in terms of 'recognition' is
actually risky; it may result in really ossified, unimaginative systems.
You end up with some expert who works on the 'recognition module' of
your system, and their job is to take some fixed form of input and
deliver some fixed form of output, and they can improve that module as
much as they want, but they don't think about the interface context
around the recognition. Even if you want recognition, I think there
should be someone who thinks about the whole system at once and can
come up with an
end-to-end
design that
includes both pattern recognition and user interface.)
We have all these computer systems that love lowest-common-denominator
formats like plain text, and they push programmers to normalize
everything into those formats, so the computer can 'understand' them.




But I feel like as much as possible, the computer should be leaving
things the way they are!




If you have recognition, it should be a sort of overlay you put on the
thing (maybe one of many such overlays); you shouldn't destroy the
thing and replace it with its ashes.
If it has to exist, the text recognizer should attach an overlay to the
image that says 'it might have this text in it'; the image shouldn't
itself be transformed into text. (and ideally, that overlay would be
rich with context and provenance; it wouldn't just be a blob of plain
text; it would know what image it's from, admit other texts that it
could potentially be, talk about how likely each word of it is to be
correct, say as much as possible about the recognizer's process and
thinking)
The original thing is still around and is still the source of truth.




mostly, this email is not an argument; i don't know if it really makes
any sense; it's a feeling
i feel like the computer should give you more space to play. like you
should be able to play and doodle and dream by default, wherever you
happen to be on the computer, without your first concern being whether
the computer will recognize it...




my computer's first job is not to recognize things; it's to hold onto
what I put in it. it's a medium, not an intelligence; i want it to be
good at being a sheet of paper before it tries to be anything else.
(it's a
versatile sheet
of paper: it can hold video, and sound, and links, and computations, and
...)