Screaming in the Cloud
aws-section-divider
Audio Icon
Leaving Chemistry and Becoming a Data Nerd with Yulan Lin
Episode Summary
Yulan Lin is a former developer advocate for Google’s Data Studio, a position she held for the two-plus years, and has since gone on to become a software engineer for Google Chrome. Prior to joining Google, Yulan worked as a software engineer for Valador Inc. at the Johnson Space Center in Houston. She also served as a registration analyst for the InterVarsity Christian Fellowship and was a self-employed musician for a bit, working as an accompanist, voice coach, and assistant choir conductor.

Join Corey and Yulan as they discuss how Yulan went from studying chemistry and researching bioinformatics to becoming a developer advocate at Google and a self-described data nerd, how organizations tend to be good at collecting data but not always at making sense of it, why the definition of “big data” changes from one use case to the next, what Google’s Data Studio is and how it supports data visualization, what Yulan does in her developer advocacy role, how data visualizations change depending on the audience, some of the most egregious examples of misusing data visualizations, and more.
Episode Show Notes and Transcript
About Yulan Lin

Yulan is a data nerd with experience in everything from bioinformatics to NLP. She’s currently working as a Developer Advocate for Google’s Data Studio. Prior to Google, she worked in research, event management, and government data science. When not computerating, she can be found reading, hosting dinner parties, and making music.

Links Referenced
  • DigitalOcean: https://www.digitalocean.com/
  • CHAOSSEARCH.io
  • http://do.co/screaming
  • DataStax: http://www.datastax.com
  • https://www.datastax.com/accelerate
  • Twitter https://twitter.com/y3l2n


Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is brought to you by DigitalOcean, the cloud provider that makes it easy for startups to deploy and scale modern web applications with, and this is important to me, no billing surprises. With simple, predictable pricing that’s flat across 12 global data center regions and UX developers around the world love, you can control your cloud infrastructure costs and have more time for your team to focus on growing your business. See what businesses are building on DigitalOcean and get started for free at do.co/screaming. That’s D-O-Dot-C-O-slash-screaming and my thanks to DigitalOcean for their continuing support of this ridiculous podcast.


Corey: This episode has been sponsored by CHAOSSEARCH. If you have a log analytics problem, consider CHAOSSEARCH. They do sensible things like separating out the compute from the storage in your log analysis environment. You store the data in S3 in your account. You know where it lives, you know what it costs, and then they compress it heavily while indexing it, and then they query that data using a separately scalable fleet of containers. Therefore, the amount of data you’re storing no longer is bounded to how much compute you throw at it, as well. It’s broken that relationship, leading to over 80 percent cost savings in most environments, and being a sensible scaling strategy while still being able to access it through the API’s you’ve come to know and tolerate. To learn more visit CHAOSSEARCH.io.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’m joined this week by Yulan Lin, a developer advocate at a small company called Google. Yulan, welcome to the show.

Yulan: Thanks, Corey. Thanks for having me.

Corey: Of course. So, you self describe yourself as a data nerd. With experience in everything from bioinformatics to NLP? I have to look up what some of those words even mean. So, backing up a sec, what do you do and how did you get there?

Yulan: Yeah, so I’m a developer advocate for a product called Data Studio at Google, which is a business intelligence and dashboarding product that we have. And, how did I get here? Well, I studied chemistry. I actually thought I was going to go down the research route, and I had a bioinformatics research project, which is basically like computational genomics kind of stuff. And I was looking at RNA sequencing and macular degeneration. But the interesting part of that was, I had a dataset that crashed Excel, and I was like, “I don’t know what to do.” And so, what ended up happening was in the process of learning to analyze that data, one, I learned all sorts of statistical techniques that were completely new to me, but I also learned how Python scripting worked, R scripting worked, learned a little bit of SQL along the way, and I realized that I was much better at picking up those data analysis skills that were transferable than I was at keeping cells alive. And kind of went from there.

Corey: Yeah, you found your way through to Google of all places which, if you’re working with data, it seems like a decent place to go. I’ve been told they have a bit of that there. But your path sounds like it diverged from mine almost immediately. When I wound up early on in my career with data sets that had problems in Excel, I threw the computer aside and a huff, and instead of doing data analytics, I just figured I would talk qualitatively instead and tell interesting stories and indulge my ongoing love affair with the sound of my own voice. It didn’t occur to me that there might be better ways to solve these problems. They’re the path not taken as it were.

Yulan: I actually think there are really incredible stories to be told with and about datasets which is, I think, what compelled me about data analytics in the first place. So, in research, when I was looking either at chemistry education, or at bioinformatics stuff, what I always loved was that I could ask a question, get some data about it and tell a compelling story that, I think, I would argue mattered to the world. And I think the same is true for a lot of data sets. Because after Google, I ended up at a nonprofit doing event management and ops work. And so, I was doing a lot of the statistics around a large, about 15 thousand person event, and so in the process, I learned a lot about what different stakeholders wanted to get out of the stories inside our data sets about who was attending and how registration was going—

Corey: Out of all our talks, which irritated people the most. I mean, fun things like that.

Yulan: [laughing] I mean yes, but also really interesting things like you find users who take different paths through the registration system. And so, you end up with really interesting technical issues, like that the tags associated with someone’s registration don’t make any logical sense, because they found a bug that allowed them to take multiple routes through it, all these data management and data quality issues as well. And I loved going in and figuring out what didn’t look right, why it didn’t look right, was there something we could do about it that would stop that problem?

Corey: That sounds like it’s, first, incredibly difficult. And secondly, it sounds like it’s the sort of thing that no one knows exists. No one pays attention to that sort of thing at all. At a conference, it magically happens, it sprung fully formed. And sure they started setting this up yesterday evening or something. People don’t realize all of the heavy lifting and tremendous amount of work that goes into putting something like that on. I’ve talked about that previously with other folks on this show. But it never occurred to me to figure out the other end of that; of alright, with all the data that’s thrown off and stuff like that, how do you analyze that? How do you turn that into something that is usable by, you know, humans?

Yulan: Yes. And I think the same question of given the data that you have, “How do you turn it into something usable by humans?” is a question that applies across a lot of organizations in really small ways. So, everyone’s talking about big data. But I think a lot of quick wins are to be found in the spreadsheets that are on people’s local devices, or just one analyst or person is maintaining month after month and converting into some kind of presentation or doc or report. Because those are often human-curated, often a little bit messy. But if they were regularized, they were shared with the right people, you could link them to different data sets, all of a sudden, you have a wealth of information at your disposal, and context as well. And the ability to present things to different stakeholders and tell the right story. And that’s really cool to me.

Corey: One of the things that I always found somewhat challenging was the idea of when do you have a big data problem? The rule of thumb was if it’s on a thumb drive, it’s not big data. It’s little data, medium data at absolute most if you squint hard enough. And then the other argument that became, “Oh, if it fits in RAM, it’s not big data.” And then I started seeing instances in cloud and whatnot with many terabytes of RAM in there. So, is there, I guess, a clear line differentiating what separates big data from medium data? Or is it more of a -ish type of soft boundary?

Yulan: I think in general, it’s an -ish boundary. But I think the framework I use is less why do you care about how big the data is? Do you care about it for reasons of data engineering, and you want to know what the best kind of technical ways to manage and process your data analysis pipelines are? Or are you interested in what statistical techniques are valid on the data? Because the definitions of quote-unquote, “big data” differ across those things.

Corey: That doesn’t occur to me to think of it in terms of domain-specific. I mean, on some level, log data could be enormous if you log everything forever from just a simple web service. But it also winds up being awfully repetitive. Oh, wow. 98% of our data in the logs is the load balancer checking to see if the thing is still okay. Maybe there’s a transformation that makes this a little bit more usable as you start filtering that through. And again, I am not a data person at all. It turns out that stateless stuff is way more aligned with how I tend to operate because if I break that, I can push a button, build a new one and no one notices or cares. When you lose the data, very often you don’t really have the company anymore after that.

Yulan: Yeah. I think the other thing too, with longitudinal data or data over time is that definitions can change too. And so, within the same organization, even if it’s been collecting a particular piece of data forever, the original reason might have been to answer question X, and then at some point, they realized question Y might also be kind of relevant to this data set, so I’m going to add a couple other fields to capture those things, as well. And tracking that metadata and the evolution of the whys behind why a database exists or why a field exists in a table, I think really can inform the questions that are valid to ask about the data set.

Corey: I think one of the challenges with data, at least one that I experience myself is that I don’t know what questions to ask the data can effectively answer. I mean, so from that perspective, it’s always challenging to figure out what questions does data visualization solve for me.

Yulan: I think jumping to shiny visualizations before understanding the data set in the domain is actually going too quickly. At my last job, I sometimes described it as I was playing data therapist, because I talked to different people about what data sets they had and what questions they wanted to answer. And whether or not those datasets could effectively answer those questions. We also talked about what are the best ways to answer those questions? Is it some kind of analysis? Is it some kind of Visual Dashboard? And so, that’s something, I think, that has to be done in partnership with a domain expert. And also just time spent in the data, right? What’s the distribution of things, what do null values look like? What are things I should know about what different codes mean? All of these questions really should be thought through, in partnership with a domain expert who then also has a better idea what are the things that they want to track, or that would impact their day to day work?

Corey: This episode is sponsored in part by DataStax. The NoSQL event of the year is DataStax Accelerate in San Diego this May from the 11th through the 13th. I’ve given a talk previously called the myth of multi-cloud, and it’s time for me to revisit that with… a sequel! Which is funny given that it’s a NoSQL conference, but there you have it. To learn more, visit datastax.com that’s D-A-T-A-S-T-A-X.com and I hope to see you in San Diego this May.


Corey: Well, let’s back up a second here just to clarify something that I may not be entirely clear on. One of Google’s core competencies is taking words and putting them after the word Google as a product. In this case, they’ve done that with Google Data Studio. What is Google Data Studio?

Yulan: Yeah, so Data Studio is a in-browser Data Visualization BI kind of dashboarding product that connects to all sorts of data sources. So, the way we describe it is, if it has an internet-accessible API, you can probably get the data into Data Studio. So, it allows people to integrate data from different data sources into the same place so that it’s easy to have a at-a-glance look or analysis of whatever metrics you care about. And it’s also really easy to make sure that it’s shared with the right stakeholders.

Corey: So, it winds up visualizing data for human consumption, not machine consumption?

Yulan: Yes, it’s for human consumption, and it’s also structured in such a way that it’s relatively easy to get started with it because the product itself, it’s a click and drag kind of product. It’s a GUI based thing, even though I work on the developer features, which is kind of this separate box.

Corey: So, I guess my question for you then becomes as a developer advocate for something like this, what developer advocacy around data visualization look like? Who are the people you’re talking to? And what challenges do they have?

Yulan: Yeah, I think to answer the question, it might be useful to talk a little bit more about my job. So, my job is to actually support this API called Community Visualizations, which allows people to build their own custom visualizations and integrate them into Google Data Studio dashboards. And so, the reason to have a Developer Advocate around data visualization is really to show people the power of different kinds of visualizations, or different kinds of solutions, and storytelling around their datasets, and how to build them with Data Studio. And so, it’s things like, is there a chart that somebody made in an academic paper that actually would be really great for your use case, but it’s super specific and you have to have everything configured a certain way. And when does that chart work, when does it not? Is it for a particular dashboard or infographic, or is it something that’s generalizable? I think these are all questions that I’m hoping that my work helps people to answer a little bit.

Corey: It’s always difficult, I guess, from my perspective, to figure out how to structure any sort of visualization of reasonable data. It’s easy once you have a dashboard, or something that shows the relationship you’re looking at. Oh, yeah, that’s incredibly valuable and helpful. For whatever reason, I don’t know if it’s just who I am, or this is something a lot of people struggle with, but I personally have trouble figuring out even how to begin structuring what I might represent data as in a visual context. Is that common? Am I just crap at this thing and I should accept that? What is the, I guess—what are you seeing in the world as far as people’s level of comfort with this sort of thing?

Yulan: Yeah, that’s a great question. I think that it’s actually a really hard problem, and it’s deceptively hard. And the reason is because I think the right visualization or the right structure of a dashboard depends so heavily on what you want that dashboard to do. Because there’s a difference between some of the key metrics you want to have on a TV display, in your lobby or in an open office, then something that you want an analyst to be able to interact with and find trends or interesting things in, and it’s different than another dashboard that summarizes particular metrics for an executive. And so, everyone cares about different things, so I think my first question is always, what metric do you care about? Who is looking at it, and is it meant to be kind of a display kind of thing? So, a dashboard in a lounge, or a infographic kind of thing or is it meant to be something you can interact with, and a means of exploration and analysis? Because that tends to help me start deciding how complicated things should be. Should they be scorecards? Should they be pie charts or bar charts? Do I want to bring in something really complex because it actually represents something like the number of people transferring in and out of certain regions well, or the genome data well? Should I be bringing in domain-specific things like that?

Corey: That’s an area where it seems to be extraordinarily challenging to, I think, articulate to folks who aren’t steeped in areas of this. I mean, it becomes the popular question that I think a lot of us who work in anything that even remotely touches technology has to answer when we deal with folks who are not in that space, usually at holidays with family, explaining what you do for a living to people who have no touchpoints for it. Do you have a go-to that you wind up using for that?

Yulan: Yeah, I talked about the New York Times data visualization team, partly because it was their work that inspired me to care about data visualization and see how powerful it was in the first place. And because that tends to be a good point of reference, so even if people aren’t familiar with that team, if I pull up a map that they’ve created, or pull up some charts that they’ve created to go with some of their stories, people immediately understand like, “Oh, seeing this in a chart instead of a table actually makes it click in a different way or I asked different questions. And that starts the conversation around data visualization.

Corey: Let’s go down a path that I love to explore that most people often don’t, generally because it’s a terrible way to teach people things, but I find it entertaining. What are some of the most egregious misuses of data visualization that you’ve seen? Or, I guess, bad data visualization. This is an audio podcast, so showing people crappy charts is not going to be as compelling when you’re just describing a crappy chart, but have you seen anything that is horrifying?

Yulan: It’s hard to say things are actually horrifying, but I think there are some cases where there’s just lines everywhere, it’s incredibly complicated, and there’s no explanation or walkthrough of what the different icons mean, and why lines are moving in certain directions, and whether or not things were stylized, or whether every angle and motion of the line or color variants means something and what that maps to, because I think at some point of complexity, my brain personally just kind of shuts down. The other thing I found, and I’m guilty of this, too, is just making decisions that look kind of pretty but have no meaning. So, arbitrary color changes because it matches a particular palette, even though colors have absolutely no meaning, that ends up being very confusing. So, those are some of my own pet peeves. Oh, that and I also really dislike low contrast color palettes, for accessibility reasons but also just readability reasons. It’s like, “Cool, you used this very uniform palette that looks great with your branding, and I cannot tell the difference between your different categories.”

Corey: One of the things I’ve always found is, for whatever reason, and I see this periodically in various state-of-the-cloud-style reports, where they’ll have a whole bunch of different providers or services or offerings that they’ll wind up trying to visualize. And this isn’t even a data visualization issue as such, but it’s always we’re going to represent each one of these different things, five or 10 of them, in different shades of blue. Maybe there’s another color or million that you could use that would show a little bit more contrast. At some point I look at that wonder if I’d suddenly gone colorblind. No, it’s just graphic design is hard for everyone.

Yulan: Yes. Yeah, and I think there’s also the sense of making something clear and easily readable might be at odds with some kind of sleek visual identity that certain infographics or reports want to attempt to follow. So, it’s this like, do you pick readability? Do you pick your brand palette? What if they’re at odds?

Corey: And that’s always a question. I dare not tread down that path. I found that it is best not to walk down into the den of corporate communications, and branding and, oh, no, no, no, no, you wound up not quite centering that, or the font isn’t quite right, throw it away, start over. And if you do it, again, you’re being censured. I may deal with big companies too much at this point in that context. So, changing gears slightly, you are a developer advocate. What exactly does that look like in your particular scenario? Very often, I’ll find that developer advocates spend the bulk of their time arguing with other developer advocates about what developer advocacy is.

Yulan: Yeah, that’s a great question. I will say that I think a definition most of my colleagues and peers can agree on is that we want technical practitioners to be successful with our products. Ultimately, if that happens, then I feel like I have succeeded. In my particular case, I think my goal is to build an ecosystem around this API. I want people to know what’s possible with it, and I want to help people solve problems with it. And so, to understand, you know, why they should care and also have a clear path to success once they figure out, “Oh, I want to build something.” So, it involves, for me, everything from talking to developers and understanding their use cases so that I can make sure their concerns are addressed as I write the documentation, or make videos, or give talks. And that tends to be the bulk of my work is just thinking through, how do I make somebody successful who thinks, or wants to build something using this API?

Corey: Do you find that the bulk of your developer advocacy work, it looks like blog posts, like one-on-one conversations with customers or developers in the community? Are you giving conference talks? Are you writing API examples and documentation-style stuff? Or other things entirely. There’s so many different expressions of the whole [inaudible 00:21:36] world that I learn something new every time I talk to someone who does this full time.

Yulan: Yeah, so the answer to your question is yes, I do most of those things if not all of them. It is a lot less speaking than I thought it would be. So, my time, at least right now, is spent split between a couple things. One is content creation. So, making sure the documentation is there, making sure there are examples, some blog posts, some social things. I also spend some time talking to developers and companies who are developing against this API. And the other thing I do is I am writing API examples and developer tooling that makes the developer experience easier. And as I’m writing these examples, I’m also collecting my feedback and other people’s feedback about the API and bringing it to our internal teams, and saying, “Here are things I think would help the future developer experience. These are ways I think we could make it easier,” and then trying to address it either from my end, or talking to our internal teams to see, can we solve this problem for future developers?

Corey: I remember back when we first met, you had given a talk at a conference and we wound up catching up at the event afterwards and got to talking. A few speakers started gathering together, and I think you were even asking me, “How can you start doing more conference talks as a part of a career path?” And I think the default response from everyone who’s done that was, “No, don’t do that. It’s awful. It’s drudgery, and misery, and horrible.” And, as I recall you, you hadn’t gone through to that side yet and thought that it was going to be fun and amazing and worth doing. Where do you stand on public speaking now, now that you have found the job where that is part and parcel of what you do?

Yulan: There’s a couple things. One is, I still absolutely love public speaking. And I do wish I did it a little bit more because there is something—especially in a small- to medium-size audiences about being in front of people and sharing things, but also just reading and reacting to the energy of the room, and helping people understand something new, or hopefully learn about something that hadn’t heard about before or thought about before. At the same time, I think there is a sense of, maybe not the talks themselves, but travel for conferences, has become less shiny to me, even though I actually do it less as a developer advocate than I did before. And part of it might just be because it’s part of my job, it seems less shiny to do it for fun. And I think part of it, too, is I think it’s different to nerd out as Yulan being a data nerd because data is super cool, versus when I’m representing a company or a product because they’re just different considerations. And that’s not an aspect of it I had ever thought about.

Corey: Yeah, it’s always very interesting to me seeing how people’s evolution as they walked down the path of doing whatever it is that they’re involved in tends to modify itself and, I guess, express itself in different forms. It’s strange when you think of someone who’s has a background in data engineering, at the path that you talked about going down, the idea of, oh, and then pivoting to becoming a speaker and someone who helps other people understand these things, it’s always interesting seeing the different routes people take to get there. I mean, you see people who look an awful lot alike on stage sometimes, but the paths they took to get there are incredibly varied.

Yulan: Yeah. And it’s also, I think, that the same skills and things that people enjoy can be expressed in so many different ways throughout a career or throughout a job. And so, when I was not a developer advocate, I still loved helping to organize and speak at meetups because I just loved seeing people learn more and seeing knowledge sharing within the community. And also, I was really excited to just show people things I thought were cool. I’m still very excited to show people things I think are cool. There’s a part of it to where, when it’s my actual job, I’d have to think not only about like, how do I tell people about something I think is cool is this question of how am I good at telling people about that thing? And kind of content creation and technical communication is its own set of skills in addition to kind of having the technical knowledge of whatever it is I’m trying to communicate.

Corey: Do you have any advice for people who are looking to get started with data visualization to where they can go to learn more? How can people dip a toe in this water if it’s something that they’re unfamiliar with and want to learn more?

Yulan: Oh, so many places. One is I would just start looking for the places that are building charts that you respect and trying to figure out what you like about them. So, there’s several news outlets that are really good about that. And there’s also some independent data visualization experts, designers that I really respect, people like Shirley Wu and Nadieh Bremer—I hope I’m pronouncing her name right—and so, that’s one piece, to learn about the design aspect. And then technically, I would say, get started with either Python or JavaScript, and just get into a data set and figure out, how do I put something on a page? How do I make a chart? How do I explore this data? Don’t worry about the bells and whistles, that takes time, and that will come, but just trying to figure out what is that conversion from data to pixels on a page look like? And oh, also draw data visualizations on graph paper, because it’s a fantastic way to get an intuition for what you’re trying to map out and why.

Corey: That’s a great starting point that I think people will appreciate. If people want to learn more about what you’re doing and the various things that you have to say about a variety of topics, where can they find you?

Yulan: They can find me usually on Twitter, I’m @y3l2n, and I talk about all sorts of things from women-in-tech rants, to data visualization, to posting videos I make for work, so that’s a great place to look.

Corey: We will definitely throw a link to that in the [show notes 00:28:03]. Yulan, thank you so much for taking the time to speak with me today. I appreciate it.

Yulan: Yeah. Thanks for having me, this has been great.

Corey: Yulan Lin, developer advocate at Google, specifically on Google Data Studio. I’m Corey Quinn. This is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star rating in Apple Podcasts. If you’ve hated this podcast, please leave a five-star rating on Apple Podcasts and tell me what my problem is.

Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at ScreamingintheCloud.com or wherever fine snark is sold.

This has been a HumblePod production. Stay humble.
View Full TranscriptHide Full Transcript