Hashing It Out
Hashing It Out

Episode 34 · 3 years ago

Hashing It Out #34: Storj - Shawn Wilkinson & JT Olio

ABOUT THIS EPISODE

File storage has long been known as a centralized solution. Storj seeks to change this with their decentralized file storage and retrieval architecture. We'll go over their design, engineering goals, and challenges. We dive deep into how erasure codes can be used to mitigate replication requirements, how trust is managed on their decentralized platform, and how one can be incentivized to donate their storage space and bandwidth to the world through their network.

**Links** https://storj.io/ https://github.com/Storj/

Entering forecast work. Welcome to hashing it out, a podcast where we talked to the tech innovators behind blocked in infrastructure and decentralized networks. We dive into the weeds to get at why and how people build this technology and the problems they face along the way. Come listen and learn from the best in the business so you can join their ranks. Episode thirty five of Hashing it out. I am Dr Corey Patting it. Thirty five. Thirty thirty five, is it not? Are we was in the last one? Thirty three, pretty sure. Thirty four as some episode three X of Hashing Get out. I'm going at you see which one it is. Whenever we put it a post production. As always, I'm Dr Cory Petty and my Trustee Cohost is Colin Crouche. Say Hello everybody. Hello, everybody, nice and today we are with Jyt and Sean from storage, a decentralized file storage company, and they are coming out with a lot of really cool stuff lately in their newest release is storage, and we wanted to get them on to talk of generally about the difficulties of decentralized fall storage, as it's a something very interesting to call and I've always found it fascinating and useful, as well as kind of what their new features are and how they got there. Are Things are going. So welcome to the show. Guys. You want to give yourselves a quick introduction as to who you are, how you got in space and what storage is? Yeah, shot first, Hey, some shoe. Wilkinson, founder of storage. I got involved in these fun bit clean and Crypri currencies in two thousand and twelve and fell in love and throughout my journey I I you know, explore different parts of the technology and this thought, hey, decentralized storage and kind of bitcoin and black chain make sense together, and kind of went down that that rabbit hole and never came out. And at the end was the storage which is a decentralized and distributed cloud storage platform that allows anyone to run out their actual hard drive space and on the other side we allow people to store data on this network. We're focused on developers, but they can do a lot cheaper, a lot faster and lot quicker. So in one more private, secure so that's us, that's me and I'm JT. I only joined storage just this year, so grateful to be able to join sean on this. Previously, I was one of the early engineers at Mosey, which was an online backup company prior to, you know, really people really believing that cloud storage was even a viable option. We kind of helped push the narrative there at Mosey. I also worked and turned to google. I worked it's the space monkey, which was a kick started distributed storage platform with little home devices to compete with dropboxs, and that that was sort of a six year journey, after an acquisition into some other distributed storage platforms and then I joined storage I just this year. So I kind of have I got early into cryptocurrency but for the most part of my experience with distributed storage is actually nothing to do with cryptocurrency, more kind of on the academic side through my graduate research. So cool sounds very useful. So for this particular position. So I just I want to open up this interview with a question that I think is really important to just clarify right off the bat. Most of my desearch was application development has involved IPFs and I think it's important to differentiate what you're doing versus what IPF EST does, because it is is drastically different. But kind of in the same category of you know needs, if that makes sense. So could you please just do that? Sure. So I think, you know, there's there's a lot of people in the space of played around with IPFs and, you know, trying to highlight the differences between them. You know, they both allow you to store data but if, for my understanding, with IPFs there's no guarantee that that data will...

...be around once you store it. Versus that's very different. On storage, you know you can store the data and have guarantees at that data is going to be there and available for you. So that's probably kind of the largest difference. I think of IPFs is more kind of a decentralized way to kind of fine and address fils, but they're still kind of a missing piece underneath, which is, you know, keeping the data around over time, which is, you know, what we provide. So maybe JPJT can expand on that, but I think that's probably one of the largest difference differences between the two. I do, I do think it's a good point that IPFs and storage are, you know, okay the centralized storage, but pretty different. Yeah, I think that's a great point. I think ultimately my position, you know, just just off the bat is I'm really hope full and interested in IPFs is success. FILECOIN even made safe. I think certainly with this space, kind of a rising tide raises all boats and you know sort of the more the more mainstream decentralized storage is, the more uptake and adoption it's going to take, you know, in more professional settings. So I'm really excited about just sort of the whole space and, you know, doing what we can to help. But I do think that storage is targeting kind of a different market segment than IPFs and you know, the file coin, ipfx s IPFs extensions are mainly storage. It's primary objective is to take the existing centralized customers of cloud storage and help move the needle a little bit more towards the centralized storage. So our primary focus right now is an sthree compatible cloud storage platform and s three compatibility actually imposes a lot of restrictions and a lot and makes a lot of design decisions for us. And so you know IPFs, while IPFs you know, of course, if you've used IPFs, you're familiar with kind of the immutable Hashes that files are stored with. It's something where you can pin files on certain computers to help contribute to those files to the network. It's a little more community, I think, in some sense. And for us, what we're trying to do is people pay us to store their day DA and so we need to turn around and make sure that that data is reliable. There's no pinning, there's no sort of, you know, like mechanism to allow people to choose which files get more redundancy. It's just a matter of we want to store the data with s three level durability, as three level performance, as three level compatibility, and do it in a way where there isn't actually a data center involved. And so, yeah, we've made a lot of different design decisions. Right like at a core files are mutable. You it's a it's a path based system where you can change things. We don't have hashes that identify files that you can address and it's it's something where, you know, obviously one of the one of the big issues right now with IPFs is if you go look at the IPFS Hashes subreddit, you know about half of them don't work and I think part of that is, you know, just based on it's kind of a volunteer platform right people are contributing on IPFs their space and their uptime to volunteer and you know, not not every IPFs note is getting paid right, and so filecoin, hopefully we'll help address some of that. But that's just sort of like off the bat get go for us as storage note operators get paid, they get paid for good up time, they get paid for a liability, they get paid for storing and returning data and and that's what we need to do. We need to incentivize people to do the right thing so that we can provide this has three level type storage. So yeah, kind of kind of a different market segment. Ultimately, I think that if and when filecoin launches or made safe gets their storage platform to a to a farther level, I think ultimately that's going to be kind of a different storage product, right. Like so there's differences already with cloud storage. There's Sthree, there's glacier, there's, you know, Google has their nearline cold line Google storage. There's will Sabi, back blaze. Right, all of these have slightly different tradeoffs and ultimately I think that, you know, kind of some of the other players in the decentralized storage market are ultimately going to be the the the forefront of kind of a new a new type of storage that I'm really excited about, and it's just kind of not what storage is focusing on right now. Storage is focusing on helping bring decentralization to the existing, you know, exibites and Yada Bytes of data that are ultimately going to be stored in cloud...

...data centers unless someone does something. Okay, so, well, long before before we go deeper, because we can easily and quickly go deeper into what you just said. I would like to maybe back up just a little bit and and and maybe provide some contexts as to why there's in need for decentralization in storage, where the problem exists currently and how things are stored and data centers and now decentralization helps mitigate those problems, and what the tradeoffs are associated with the too. Yeah, so you know, there's a lot of, you know, ways you can attack that. You know one way that people kind of attack that is in terms of ideolity. Right, you have these large companies Amazon, you know, Microsoft, Google. That'store majority of the world's data and you know we as users will want a lot more, you know, privacy and security and control over there our data. But that that's an interesting case for for people and users, but it doesn't really move the larger needle. But the end of the day, companies and people are spending roughly a hundred billion dollars on cold storage every year and that keeps on going up everywhere every year. No one saying I need less cloud storage please. And you know there's a lot of problems associated with traditional problems. Right. So you obviously want, you know, faster speeds to transit that data. You want to, you know, spend, you know, as less as possible. You know, again, you want, you know, security and privacy with that. You assume that. But the problem is if you're building out these, you know, six hundred million dollar data centers, you know, that's that's a lot of capital that you have to invest into to building those out, and the way that we we access and use the data is not in a centralized manner. Right. Everyone's not in rule Nevada, where that data center might be. You know, the Internet is a distributed and decentralized network, and so we really kind of starter from that segment to say hey, we store this data on a distributed, decentralized Internet. The architecture just looks more like how people use an access that data and then, as if I magic, you just start solving some of the problems. Right. So if you want to bring cost down, well, you don't have to know, buy that fire suppression system and that expensive heating and cooling and the parking lot of the data center, like all those costs go away. And when you're renting out you know, people's just excess Har dry space. But you know those costs exists in the data center. You want it to be more performant. Well, you know, if you're storing the data and it happens to be three blocks down rather than through states the way you know, then you can get a lot better latency and performance out of that. And then taking it, you know, from a position of privacy and security, since we're storing not these on many untrusted devices all over the world. You know the we have to encrypt the data before even you know touches the network, and so you know it's it's just taking a very fundamental different approach to the problem. And what you find is again that since the way we access and use the Internet is in the decentralized and distributed away, you end up solving a lot of these traditional, you know, problems that people will have, and so we've really focused on it uses utilizing that that really extreme advantages and boilating down into a simple package and a simple service that people can use. And so we can start kind of attacking that, you know, large hundred billion dollar market and it's saying hey, using, you know, most likely Amazon US three and you're paying this much, but if you maybe take five minutes to replace a couple lines, that can fig you know, we can save you, you know, half as amount of money with better performance. You know, that's that's a no brainer, and so that's where we're really focusing on, is using the advantages of technology and a simple enough package to make a big impact on kind of the traditional clop market. So you mentioned us three early on as kind of part of your solution. You're not depending on US three or what? Where can people store these files other than just likes three like? Are you competing with US three like? What's going on there? It could be clarify that. Yeah, so I'll make a comment on them. Then jt probably follow up with some some additional useful information there. It really started with kind of V two of the network. So we've been around for a while. We've launched, you know, many iterations of the network. Version two of over, the network that we launched last...

...year in early two thousand and seventeen scaled up to about a hundred and fifty PEDA bytes of data. One PEDA BYTE is Onezero terabytes and we had about a hundred and fifty thousand people renting out their their hard drive space and over a hundred mainey countries. So huge network. We learned a lot from it. But one of the things that we realize and learn from actually having this, you know, live network and production is that, you know, we had libraries that people could use to to integrate with and there are kind of storage specialized library. So it might take someone, you know, hours or days to figure out how to integrate. But if you look at the traditional market, most applications are using, you know, something like Amazon has three to store it stay. It's it's kind of a standard that everyone uses, for good or for bad. Right it's this the standard that people use. And so we, you know, looking towards the V three network and what we really wanted to change and make things better we thought, hey, you know, let's make this Amazon has three compatible, not dependent. All right, it's this, it's compatible with those those apis. So people who, you know, are our building applications, like, can literally spend a couple of minutes to get something working. And so, you know, we made that change and it's really paid off so far, as we've had, you know, partners and customers play around with our early version of the V three of the stores network and they're just like, AD's a few minutes to integrate. And that's that's what's really important because if you look at the you know, decentralize and distribute landscape, there's a lot of people and a lot of technology that's really impactful, but you know, it's hard to use right. Try to set up a get someone to set up a bitcoin wallet or some of these things. It's certainly come a long way from from the early days, but it's still, you know, very difficult and hard. And so what we really wanted to do say, hey, we want to impact the traditional market and we want to have a big impact, and the easiest way to do that is not making it hard for people as much taking them minutes is important. So that's that's one to sign thesis in that we made that's very, very different from many of the other players in the space. Yeah, so, so answer answer your question directly, we with US three. We Mimic, yes, three API. We don't use US three. We are often replacement for s three, and what that is is a really good way to lower the barrier of entry for people who are already dependent upon that standard which you talked about, like using things like us three so they can switch an entire portion of their back end by changing just a few lines in the actual code. Yeah, maybe even just configuration information only in no code. That's that's that's that's a really good, in my opinion, way to get people to use your network versus others and allow like people to into it what's actually going on. Or everyone even need to into it what's actually going on, but want potential performance benefits or cost benefits of using your network versus as three. I'm curious. So, like before, you said a lot of that. You said that I say, for instance, I'm going to use ATFs as the example here, because everyone's familiar with it. The Hash that you receive is based on the content you put forward, and so you know that if the hash changes, the content changes and you have guarantees around the data that you're actually getting. How do you do and since you said the files are mutable within your system, how do you give guarantees like that if you can? Yeah, so this is this is actually going into kind of a much wider discussion about her overall architecture and kind of some of the decisions we've made. Ultimately, some of the decisions, what we've made our potentially a little surprising, since the only thing we use a blockchain for is for background settlement and payments. And so in terms of you know, a lot of a lot of these cloud storage products, these decentralized cloud storage products, spend a lot of time talking about their consensus protocol or they do things like these that you know. They have like a Miracle Tree to construct the file system, they do content addressable storage, like they hash the data to make sure that you know what data you're getting because, honestly, you know, these these platforms are built in ways where you can't trust anybody. Right, you just can't trust any computer that you're storing any of the data on. And so we've made we've made a slightly different incremental decision. Again,...

...our our road map and our plan and our goal is to kind of be a little more promethean and take the fire down from, you know, Olympia down to the masses of decentralization. And so we want to bring people steps closer instead of making it a huge leap. And so the reason why I say it like that is because one of the things we've done is we've said, you know, we can probably get a significant improvement in performance and a significance reduction and complexity by making some tradeoffs. One of them is, okay, they're there are computers that we want people to choose to trust within the system. And so by that what I mean is there's three different actors in our system. Their storage nodes, there's satellites and there's uplinks. And so we talked about this a lot in our recently released ninety page white paper. That took way to while reading. Yeah, just way too many months of time on that one. But yeah, so the the the storage nodes are untrusted. Those are the those are the nodes that are the vast majority of actors in our system. Storage node operators provide their hard drive space and provide storage data. And then there's satellites, and satellites are run potentially by you. You can run your own satellite as a customer of storage, or you can use a satellite that someone you trust has set up, and so this is the tradeoff that we made. As we said, all right, we believe that it's possible for people to still get most of the benefit of decentralized storage and are comfortable having an account on some specific server somewhere, or specific set of servers. Right. It doesn't have to be a satellite. Isn't necessarily a single server for uptime. It could be a small cluster of servers. But the question is a trust boundary, and so a satellite is a small trust boundary that you are comfortable giving some metadata to. And so that's how we actually we actually avoid the white paper talks quite a bit about one of the one of the ways that we get a significant increase in performance is avoiding coordination where possible. There's a growing body of distributed systems research, academic distributed systems research, that talks about how coordination is one of the easiest things. Avoiding coordination is one of the easiest things to do to get your your your system to be able to scale and continue to perform right. And so what I mean by coordination is if you have, for instance, bitcoin is a great example of coordination. Everything is coordinating and so adding because everything has to agree on this single global ledger. Adding more computers doesn't get you more throughput. You know, the amount of minors in in Bitcoin has increased significantly, but the amount of transactions bitcoin can process hasn't increased. On the alternate, on the flip side, sthree, the way that Amazon has designed us three, is a very coordination of voidance in the sense that its scales horizontally significantly. Other things that scale horizontally are Cassandra or Cockroach, DB or spanner, Google. Span or database scales horizontally and and horizontal scaling requires that you are able to it by adding more computers you get more throughput, and the way that you do that is you you can't have every operation go through a single global ledger, and so by having everyone choose their own satellite to store metadata on, is kind of a way that we've partitioned the network into these little trust zones that allow us to avoid coordination, and so we talked about that a lot in the white paper and then and then, of course, one of the main things that we want to do is allow people to be able to use storage access it directly from very lightweight clients, like their mobile phones, like there, you know, web browser, their desktops, and so an up links the third actor in our system, and uplink coordinates directly with storage nodes and with satellites with for the Metadata that you have. And so it's a little bit of a surprising design. It's not maybe what you'd expect if you've read about IPFs or file coin, but it actually ends up being almost exactly the overall design that a lot of other distributed storage systems that aren't be centralized use. Like I don't know, I think I think people have different definitions of distributed into centralized, so I probably should use different words. But Luster is a very well known used in the academic super absolutely right luster. It runs like fifty of the top one hundred fastest computers in the...

...world. It's a it's a storage platform that's distributed and the the overall architecture of luster is there's three components. There's clients, there's metadata servers and their storage notes and so the design of storage. Strj is a very much inspired by systems like luster. It makes a lot of sense yet and just you know, put it into more, you know, concrete examples. You Know Jat comes with a lot of experience, you know, both in the literature and, you know, hands on experience with these these many distributed and just in general storage systems. And we've come in now launching R vt network, which, you know, is outclasses pretty much all the other decentralized storage networks and binds in terms of scale and magnitude, and we learned a lot from that. But it's all about choosing, you know, proper design that works in scales. I won't name any names, but there is, you know, particular storage, disturbing storage platform that relies on on consensus and you know, users has experienced issues as they're uploading, you know, a cat picture size file, a couple of Mags, and you know they're getting six hours of upload time right. That's now. You might have, you know, the Nice ideological components in there, but if it takes six hours uploadic at picture, you know that's that's not great. And so we really wanted to come with something that, you know, works and scales off the Bat. All as you to get, you know, performance that of what you you expect. We're better and then over time, you know, we can make improvements, incremental improvements on that design to make it more trustless and more robust. So, for example, if you really if a new bit of mutability was really important to you, you could, you know, write or Apper of our bfs in storage and you would have, you know, your cake and eat it too. And so we really want to make sure there's a solid base layer to build upon that that works well. So it sounds like you're you're focusing a lot on the user experience. So maybe you could step us through the user experience of getting storage on the system in the three peer classes that you've kind of outlined in your white papers. So, you know, if I wanted to be an uplink node, I guess node might be the appropriate term. I'm not certain, then, what would be my path that? If I wanted to be a satellite, would be my path that? If I just want to be a storage client, what would be my path that? It was your sense of models around each one of those peer classes. It's a good question. For an uplink and an uplink actually is a peer class that we're really describing to match almost any application that uses storage. So we're ultimately we have an uplink service in our storage repart now, but we're working on releasing lib up link, which is just a library that processes can can link against and use. So Lib up link we're going to re release our Vtwo Lib storage backed by Lib up link, and so all of our language bindings will be just live up link and so anything that uses that library where for the purposes of the white paper calling it an uplink. So there's actually nothing to do to become an uplink node or an uplink peer. It's just something that naturally happens if you're using storage. We do have an sthree gateway that allows an a computer to pretend to be the sthree API. It serves the s three compatible in points and so you can run an uplink gateway that way. But that's that's kind of the end of you know, there's not really anything that anyone would expect that you do to be an uplank node. You wouldn't be an uplink node for someone else. If that makes sense. It's only if you're using the network to be a storage node. That's actually our next release. That's coming up a little bit later this year, or, sorry, at the beginning of next year, I should say. Our storage node release is our next big release where people will be able to currently we have a weight list and all the weight list is is we are currently using the certificate authority to sign certificates to join the network. We will strip that out eventually, but for now we want to make sure that we grow the network at a kind of like a not a Birsti rate. So we want to meeter how quickly the network crows. But our next release is a storage node release. You'll join the network, you'll create an identity. It's a long lived identity that identifies you across, you know,...

...multiple upgrades and reboots, and then you will just configure it with you kind of where you want to get paid and that storage note will just start running and start advertising itself to the network as a recipient of storage. You'll be able to configure how much bandwidth, how much space you want it to use and it'll kind of just do its thing. It's sort of set it and forget it. It is really important that storage node operators choose computers or servers that will have good uptime. If a node doesn't have high quality uptime or high quality availability, it will ultimately no longer get chosen for new data and we'll stop getting income. Running a satellite is probably the most intensive process because the satellite does have a number of responsibilities. But running a satellite, hopefully at some point, will be as simple as, you know, installing our binary. We use go for all of our programmings, programs. So this is all statically linked binaries that are easy to install. But you know, if a doctor container is your jam, will have a doctor container that will make it easy to be pointed at a just a little bit of configuration and a database and you let that run. That will have an Admin, a Webadmin console interface that you can point around and see how things are going, and so running that will probably be a bit more like running, you know, some sort of service. You'll want it to be. You'll want it to have like a domain name that people can access it over and a few other things, but otherwise it should be as simple as Dnssentry, running a service, pointing it at a database. That's it. So that's kind of like how you would be a satellite operator, how you would be a storage note operator. Why would you be a satellite operator? What is what is your incentive around that? Every operator gets paid. A storage note operator gets paid for the storage, but storage isn't the only thing that needs to happen in the network. So the white paper talks a lot about repair and auditing and so one of the things that we do, and I know you mentioned you'd like to talk more about a racial codes. There's actually a lot to talk about with racial codes. We've chosen a racial codes instead of replication and there's actually a really deep argument about why that's actually critical and vital. Request. Can you please define a racior codes for those rings that don't understand it. A great point in a racier code. So so yeah, just briefly, when you store data, the question is, how do you deal with nodes disappearing? So let's say I store data on the storage network and some storage nodes lose power or an asteroid. It's them or the storage note operator decides he hates us and leaves, or you know, there's a bunch of different potential outcomes where, you know, a storage note operator might just decide she's had it with the software and just wants to install and leave. So at any point we might lose data, and so the question is, what do we do to make sure that all of the data that people have given us we can still return when they want it? Replication is a common choice. You just make more copies. It's kind of like the most obvious thing to do. You just in any incoming data, you make, you know, three copies, five copies, ten copies. The V two network we which you know, as Jean pointed out, we learned a lot from those. VTWO network used a mixture. It used a replication and a racial codes, but other systems use only replication and in our V three system only uses a racier codes. And what an e racial code is is, instead of storing data as copies, we actually use a pretty interesting math trick, which I can explain in a second, to break the file up into multiple pieces, where let's say a file comes in, we break it into forty pieces. We only need any twenty of those pieces to recover the file, any twenty. So it could be the last twenty, the first twenty, every even piece, every odd piece, it doesn't matter. Any twenty of those forty pieces will be able to recover the original file. So that's in a racier code and there's a number of different ratire Code Algorithms, but the most common one that's used and you know, making it so you can scratch your CD and it still plays music, satellite communication, bunch of stuff, is an algorithm called read Solomon, and so read Solomon kind of the way it works, just sort of like at a high level, is, you know, if you remember from math class in high school or Early College, if you if you do, you know any two points will identify a line right, and so if you put more points on that line, it's still the same line. So any two points will uniquely determine a line, regardless of what points where those points are on that line. And then the same way in a quite a quadratic equation, any three points will uniquely determine a quadratic equation, any four points will uniquely determine a cubic equation, and so the...

...same way, any twenty points will uniquely determine a degree nineteen polynomial. And so if we break take the file, break it into twenty points, twenty math points, like where we just treat them as numbers, because it's just ones and Zeros after all. We take the file, we break it into twenty pieces, we treat those pieces as points on a degree nineteen polynomial and then we oversample the polynomial. We generate another twenty points on that polynomial. Now any twenty of those forty points will allow us to regenerate the original file. And so what we do is we actually do this on a kilobyte by kilobyte level, so we can stream data through the network. And so this is how we do video streaming and and streaming storage of log files while we still get this property. So we choose some nodes to storage, nodes to upload to. We take the file, we break it up into all these pieces using read Solomon, and then we store it on those nodes in a way that now we can lose any twenty of those nodes and still recover it. We don't actually use two thousand and forty. Currently. The numbers are actually something that is, you know, we determine based on the durability and the current characteristics of the network. But that's kind of the idea behind racial codes, and I guess the most important thing about racial codes I guess to point out is the durability of the data is much higher using a racial codes, and the reason for that is because when we talk about replication, if you just make copies, let's say we have, you know, you take the data and you stored on a bunch of nodes. And if you are only doing copies and you want to be able to survive, you know, like for Ney node failures, you have to have five copies, right. And so now what that means is there's five times as much hard drive space being used as the file that you care about, because there's five copies. With read Solomon, with this two thousand and forty scheme, each piece is on twenty of the file. So that actually means, since there's forty twenty, it's only two times that this space. So that's called the expansion factor, and the expansion factor is significantly less with erasier coats. So what that means is it uses way less bandwidth, it uses way less dispace, it dilutes. It makes it so that we can afford to pay more to storage note operators per bite and it actually makes much higher durability than replication would. And Short, it basically makes it gives the same guarantees but more efficiently. Yes, that's a good soap for the dozers on that side. Read Solomon Codes are old as as hell. Yes, they're from S and there's been a lot, a lot of new error our racier code systems out there. One of the more interesting ones that just came out of patents as Tornado Code. And one of the downsides to read Solomon, I think, is that February. Right, all right, but the original fountain codes come out of patent in February and I'm not sure Tornado Code. Tornado codes should be out of patent, if I recall from my research from last year. I should look into that again. Yeah, because, like read Solomon, it's encoding time is tremendously it's like I think it's Oh and squared in order to do encoding or notes and log in encoding and to decode its own squared, whereas tornado codes are pretty much just and log natural log in. It's pretty it's pretty quick. So like what would take like a sixteen megabyte file would probably take, I don't know, thirtyzero seconds to to encode and maybe like thirteen thousand decode and read Solomon it's like four seconds to two one second in in Tornado, and so like the the tradeoff there's of course the length of the erasure code is higher tornado but, like you said, it's still way less in replication and way less than doubling the size of the file. So I'm kind of interested in what made you go with the reads Solomon route rather than some of the more modern racial codes that are out there. So I think that's a great question. For the most part I've kind of just avoided fountain codes just due to patent encomberment and kind of my personal philosophy is to not go read patents because I don't want I want would rather preserve plausible deniability. But in terms of read Solomon, it is old. It could be faster, but it's not like it's slow. Read Solomon, a good incode, a good decoder, can be code at three hundred megabits a second or megabytes a second, which is usually faster than the links that you're talking on. So you know, we think that overall the major the major performance requirements for us are actually in terms of bandmidth and throughput and latency, and so the...

...read Solomon thing just isn't the lowest hanging fruit, like you know, replacing it with a better erasure codes scheme. This one works. It's we're conveniently configured in such a way that it's easy to replace read Solomon with something better as soon as it becomes our main bottle neck. But for now I mean it like I said, three hundred megabytes a second is certainly not as fast as it could be, but it's not the main bottle neck for competing with s three. I hear that. Okay, so what? Just so the audience is clear, what is the maximal loss that you can suffer on a file and be able to recover it using the racior codes? It depends on your configuration with your eraser code. Right now I think our code based defaults to eighty five or maybe two thousand and nindred ninety or something like that. It's not quite the same thing as a two x expansion vector. Is a little more than two X, obviously, but what that means is if you have eighty five pieces, you only need twenty nine of them. So so what's that? Eighty five twenty nine is and the full and the full racial code and the fuller ratior coade. So you can't just so even though you need twenty nine pieces, you also need the full ratior code in order to recover those twenty nine pieces correct. What do you mean by Fuller Ratier Code? In other words, when you have the file itself and the eracior code is basically tagged onto the end of the file, is just considered the erasure coded pieces are actually all you need. So when I talk about these pieces, you only need twenty nine of the eighty five pieces. The those twenty nine pieces are the output of the erasure code function. So that is all you need. You only need those twenty nine pieces to recover the original file. Is Is this configurable for the end reserve based on like their preferences, on how much they need their data available? Yeah, so our intention is to make it so that people can choose their durability and then we'll have an estimator for given durability, what the best read Solomon Choices are. Now, the tradeoff for that is going to be price, because the more storage no read your data across, it's going to cost more. But yeah, we do want to have a sliding scale where you can say, look, you know, I don't really care about durability so much. This is something that I'm only storing for ten days, so I don't want to pay a bunch for durability. Or you can say I'm planning on saving this for, you know, you know, the next fifty years. Then it's, you know, my kids baby photos. So I want really high durability. So yeah, so there's a number of different choices you can make on that spectrum. Cool. So let's talk about privacy then. So it basically you pitch. The privacy is one of the key things, using a certificate authority to get people online. How are you handling encryption, and who handles that? I think it's up is it the uplink loade that's doing all the encrypting? And yes, and how how do you what is your scheme for handling and delegating the access to a file? Sure, great, so, so right. So the up link is what does all of the encryption and we have a really stringent policy, which is that none of this other systems have any access to any unencrypted data. So the the the satellite doesn't have access to an unencrypted data. The satellite only stores encrypted metadata and it doesn't have the keys. If you lose your keys with storage, you lose your data, and so that's kind of a weird user experience straight off for people who are used to password recovery screens. But we think on balance it's probably the better choice from a privacy perspective. The encryption scheme that we use is configurable. We default to ASGCM, which is one of the of the new authenticated encryption schemes. I guess is not terribly new, but but honestly, my preference is actually one of the Daniel j Bernstein encryption methods, which is the secret box encryption method, which is poly one thousand three hundred and five and Cha chaw encryption. And Anyway, what you can do is you can choose your encryption key, you can figure your up link with your encryption settings and then it encrypts all of the data before it ever gets read, Solomon encoded before it ever gets sent anywhere, and so the only way you can retrieve the data is if you have that key with the up link that's doing the retrieval, because this is important. You know, we want to be able to support a lot of the functionality that as three does, which is, like, you know, being able to share a file someone you know you might want to be able to share. You know, certain different delegated access patterns, right, and necessary has a bunch of rich permissions. So the the goal...

...for us is to actually make it that we have encryption that's hierarchical encryption. So our encryption system is based on, I think, thirty nine hierarchical encryption. I mean bitcoin. Did you know? hierarchically encrypted wallets so that you can have a sub wallet that you share with someone but you only have a master key. We're doing the same kind of thing, but based on encrypted path names. So every path element, so like a by a path element. I mean if you have, like you know music, you know Ariana Grande, thank you, next MP three. Right, the path elements are music and then are on a Grande and then thank you, next a MP three. Those are all separate path elements and they're separated by slashes. And so the goal is maybe I want to share with you just my music folder, but I don't want to share with you the full bucket, so that the encrypted path for the music folder, I would give you that prefix and I would give you an encryption key that allows you to access a hierarchically encrypted key derivation scheme for just everything under that music holder. So there's a there's an issue with thirty nine in terms of how you can you can use a specific subset of like subpaths to then regenerate parent paths. Is that an issue that can come with like maybe potentially exposing access to files you don't want people to have? Well, so, I mean, I guess, I guess what I'd say with that is we're not actually using that. Oh, you just want to give like an analogy about things work. We're our scheme is inspired by bit. Okay, had to be some differences. And so ultimately, in terms of like, okay, just in general security vulnerabilities and stuff like that, we think that we've kind of got it covered. But that's really hard to say for certain. Yeah, you need a lot of you need you need the wild exposure and people trying to break things before you have stronger guarantees around the security model about how you do things. We need a lot of eyeballs and we're also getting a security audit. So we're getting least authority, who you know, that's Zuko's company, prior to ze cash to do a security audit for us after we're a little bit farther along. Speaking about it. Your system does audit files correct so it makes sure that you have some sort of uptime guarantee and has some ability to guarantee the file being served is actually the file that's that's that's supposed to be served. Is that correct? Yes, that's if you how how do you do that? What do you what is what is your scheme of for verifying that a file is the right files? If you're not hash addressing it, although you might be, I just don't know the details on that yet. And how do you how do you know that the files are actually being served to anybody who requests it and not just the person who's doing the validating? That's a great question, I think. I think some of the answers there are well so let me let me just start from the beginning. So that hierarchical encryption scheme does we are using authenticated encryption which does both encryption and and signing right. So you know, if you can't, you can't. The data is written in such a way that even though it's not a hash of the contents, the file isn't named after a hash of the contents. We have a hash of the contents, so we know that the data is correct because the Hashes are all the way included through the encryption scheme. So that's how you know that the data that you're getting is right. In terms of, you know, auditing and repair, our auditing and repair system is we've made a number of different tradeoffs both from those fronts. Auditing is written in such a way where our goal for auditing this is. This is actually, I thought, kind of an interesting sort of tradeoff that we made. Most of the auditing systems that are written about, for to central a storage and I remember Vitalik wrote a blog post about, you know, kind of proof of proof of retrieval. I think we're basically you store a mercle tree of pregenerated challenges, and there's actually there's actually a number of different papers about proof of retrievability that include ways of generating challenges ahead of time and then you have a finite number of challenges. There's also some homomorphic techniques that allow you to have an unlimited amount of challenges. And what we're doing actually is, you know this, we decided that our auditing system, it's primary goal is actually like this is kind of a big shortcut for performance reasons. We're actually not using our auditing system to find if files are bad, and so that's actually a really interesting difference.

Most of these proof of retrievability schemes are trying to figure out if the file is retrievable and it does it by random sampling. So it's assuming that you're not going to audit the entire file every time. What you're going to do is you're going to audit little, small ranges of the file and probabilistically you can be pretty confident that the file is they're completely intact, because the store can't predict what your audit would have been until that story gets it, and so this sort of sampling kind of incentivizes the store to keep the entire file because it can't predict what the audit's going to be about. And so it's the sampling process that ends up allowing us to be pretty confident in a general proof of replication scheme that the file is there without doing a lot of work. And so we're actually going one step further and we're saying the question is, is the storage node good? Is the storage node playing by the rules and so we have a list of all of the files that we believe a storage node has on the satellite. The satellite knows what files a storage node should have based on the satellites metadata. So the satellite is going to consider all those files and then consider all the ranges within those files and then do random sample audits on randomly selected files. And so the goal of auditing is not to determine if files are bad. The goal of auditing is determined if a node is bad and if the node is no longer playing by the rules and keeping the data it supposed to, then the node is as penalized and ultimately ejected. So auditing actually doesn't care so much about the data correctness. The thing that auditing cares about is catching nodes with basically just sampling spot checking. The auditing system is just spot checking nodes to make sure that it's good. The thing that actually checks to make sure that the data is correct is our repair system. So our repair system, if it determines that a node is bad or a note has been offline too much, which is an interesting separation. We think that most of the data loss that's going to happen in our system is going to be due to nodes going off law line and not do it, not do two nodes being corrupted or mutating data. That'd imagine a churn as much more a problem than like standard in a carbon coruption. Exactly right. Churn is going to be our biggest problem by far, and so we have a very high incentive, a very strong motivation to care a lot about churn more than anything else. And so our auditing and repair system is kind of all predicated around we need to be able to quickly determine which nodes are online and how long they've been online and if they're likely to come back. And if what happens is if it turns out that, you know, ten of the forty nodes that store a specific file are offline and or or have been marked bad by the spot check auditing system, then we need to repair that file. And repairing the file is going to involve actually a mercle tree of Hashes, and so that's where we're actually able to confirm the file is correct. We have the pieces that we need, we can do the read Solomon without doing any decryption and recover the original pieces and replace the missing pieces onto new storage notes. So that's something that the satellite does to make sure the date is good. The last question you asked was how do we confirm the storage node is playing by the rules for up links as well as satellites, and I think that's that's an area where we still have a little bit more research and a little bit more to it explore, but the plan so far, if that becomes a problem, is to just consider statistically reports from uplinks about storage nodes not playing by the rules. The goal is if you only need twenty nine out of eighty five pieces, most of the nodes are going to be it's going to be very easy to recover your file and so the number of nodes that are going to be, you know, malicious will be quickly detected by a bunch of uplanks complaining. I think what's interesting about the way that we've, you know, done this read Solomon thing is it turns out we don't actually depend on any specific piece, and so what that means is that our speed now only depends on the fastest responders of any request. So we actually, when we do a download, we over request pieces more than we need and then we're able to return the data to the the user as soon as the fastest responses have come back. And so even though our storage nodes have highly variable performance and we actually get the best of the storage nodes, so our high variability we're able to turn into a big strength based on that architecture decision. So what's up? Let's talk about some of the common attack vectors on kind of decentralized systems here, like what about like what if somebody throws up a ton of nodes and gets good reputation on them and then starts manipulating the network? Like how how do you mitigate some of these, like or Sybil attacks or something, or eclipse,...

...like. These are all common things. What kind of threat models are you considering and what what a what ones are still kind of in areas of research and what ones do you think are not an issue? Yeah, so that's a great question. We off the bat decided that our nodes would actually do very least proof of work to join the network. It's kind of it's kind of sort of a dumb side thing that we're just making nodes do it it. It's simply it removes a bunch of simple attacks, but it doesn't really do anything about a motivated attacker. So if someone, if someone has a bunch of, you know, time, months worth of time, can generate these ideas for joining the network, they join the network. They just kind of hang out on the network for a long time and then start and then start trying to manipulate the network. That for our original design, that was actually kind of a problem. Yeah, you're attacking Amazon's business model. That's three kind of a little bit here. So, like they have the power, they have the ability state actors, dude as well, if they don't like what you're doing for one reason or another. That's right now. So so the question really is, how do we make that? It's not really something that we can I mean, again, this is the same thing with bitcoin. Bitcoin has the fifty one percent attack, which is, you know, just something that hopefully is too expensive for anyone to do. Is kind of the best solution that you can even then, it's identifiable, like the whole network because of its public and census model. It you know, there's been a fifty one percent attack on Bitcoin and it was successful and it lasted for a little while and it didn't actually have any negative consequences. Nobody's able to capitalize on it. But in this particular system we can actually attack your straight up business model here. So, yeah, so, so the goal, I mean, yeah, the question is how how? How much work would someone have to do to be able to attack our network? And ultimately, as state actor is going to be able if they are motivated. There there are a lot of attacks that are going to be very difficult to deal with. Right, okay, if we're storing eighty five pieces of anyone's file, and you know, we looking at our statistics, storage nodes have a certain you know, failure, Ray, there's a certain turn right, but all of a sudden the entire Internet is shutdown. Yeah, the day is not coming back until the network's back up. Right. So there's a number of different things that certainly motivated, powerful actors can do. But so the question is, in terms of our threat model, we want to make sure that we are robust enough against the things that are that are within reach, I think of most most attacks. And so one of the things is we were, we were kind of vulnerable to this problem of someone's spinning up a bunch of nodes. In our early draft period of the White Paper. We send our white paper off to a bunch of reviewers and one of our reviewers, graciously, was the actually the author of Pademlia, our may monke of and so he did a fantastic job reviewing our white paper and came up with an improvement actually, which is part of our system with Cademlia. The goal is to make it so that the storage nodes have to provide I mean it's kind of easy to spin up some storage notes that just sit on the network and then start working with things. It's kind of hard to provide peda bytes of hard drive space and then start working with things right like. That's actually a really expensive thing to do. If you're that motivated to provide our network with peda bytes of good hard drive space, then yeah, that's kind of really hard to defend against. So that's just sort of a problem in general with this decentralized storage platforms and we believe that that's as that's unlikely enough of a problem to not worry about that too much someone being able to spend Pata Bytes, you know, supply PEDA bytes of data to the network only to just mess around with people. And so the goal is we actually made a change in our architecture based on the feedback that we got, which was that Coulddemlia. There's actually two tiers with Condemlia, the the DHT, that distributed hashtable we use, and you can't participate in Condemilia until you've passed enough audits, until you've proven that you have hard drive space on that node. And so that's a way of making sure that these aren't this isn't just a sibil attack. If that type of thing happens. Where immune to Sibil attacks or immune to these types of things, because it's actually really expensive to get a node that has enough reputation in the network to be able to in end up manipulating or denying service to someone with the data, and that's it's also important to get, you know, the scale and get as large as possible, because as the network grows larger than it becomes much more difficult to pull off these attacks. Definitely, and so I guess the question then is there are...

...certain major systems, so you say, like replacing data centers, six hundred computers whatever. But, like most of those systems that require that kind of storage space, also require certain amount of guarantee and a certain amount of security against attacks like, like we just said, to an extremely high five nine degree. Do you feel like storage will be a direct replacement for them, or is this more tuned towards a casual small business audience? Small amid that, I think, I think that this is this is an this is an observation, just sort of for my experience at Mosey. The hardest part at Mosey initially was when we were doing so. Mosy was an online backup company and it was one of the first online backup companies that said we're going to store your most important data in the cloud, and so it was actually really challenging to convince companies that what they wanted to do is take their, you know, their most important data and store it off premise. You know, this was, you know, back in the time when like Iron Mountain and some of these like huge backup companies, you know, would bring tapes and then they'd like truck your tapes off to make your backups, just because people really cared about the backups and they weren't really sure if they should trust someone just store their, you know, your backups in someone else's servers. Is that's safe to do? Is that good? and kind of mosies whole premise was yes, it's fine, encryption is good, we can store the data. You got to trust somebody like you're trusting your tape provider. And so, ultimately it was kind of a fight and it's kind of laughable now thinking about how every business has moved to the cloud. Everything's run and Google cloud platform or Azure or aws. It's been such a dramatic change in the last ten years in terms of what people are willing to accept and willing to adopt. And so, ultimately, for storage, you know this this this shift to decentralization. Off The bat it's going to not be something that will be able to convince everyone is a good idea. We're definitely going to be talking with early adopters at the beginning, but the goal is to make something that is high quality enough and performing enough and reliable enough and secure enough that people will you know, it'll turn heads and people will start to really open up to the idea of decentralized storage and not say things that, you know, the sorts of times that I say all that. You know, I hear all the time like, Oh wait, so you're going to just store my data under my neighbor's bed. Well, it's encrypted, right, and so it's the same kind of conversation that that you know, Mosey had ten years ago. Is this is a new paradigm. The paradigm has enough benefits that we should consider it and it really just is going to take some people jumping in and again, it just goes right back to what Sean said, which is the larger than network, the better this is going to be. I'd be curious about so the part of part of your earlier option is going to be as always, and new, new, brand new technology that distributes things is going to be some type of illicit media. How do you how you approach that subject? Are you everything's encrypted, of course, but are you breaking up the files of the rate like is it you store all of the erasure codes and a single place as as distributed across people? So I'm not a file isn't completely stored on a single node so that people who are using your services, just like hosting stories is media, aren't? Don't? You basically can't ever be complicit in any type of storage of illicit media? Well, this is a great question. Yeah, we got this question all the time at Mosey actually, that whether or not we were storing. You know what we were storing? We got this question at space monkey a lot. Yeah, it's basically it comes down to, you know, saying we use, you know, the best encryption possible to protect you from privacy, you know, to protect your privacy with your text documents and your baby photos and and, and you know the those the sorts of things that are important to you. It cuts both ways. US saying that we can't access your data means we can't access to your data and and that means that we don't know what people are storing in terms of in terms of you know where that file is stored because of the read Solomon. No, there there isn't a place where the file is stored in its entirety anywhere. It's broken up with this mathematical trick, after being encrypted and then stored in little tiny pieces across all these nodes. I think it was either Shan or been or our CEO who said that. We're Englishan. I don't I guess I don't remember. So someone said we're what we're doing is we're making encrypted sand and spreading it across and encrypting what. How Eve, what even is a file like? It's just it's all math, and from this math you're able to get back your...

...data. Yeah, I think John John said that one, John Quinn. Okay, cool, he loves that analogy. But yeah, I mean it's it's now. There's there's many different aspects that. But we really want to be a platform that that gives users, you know, control over their data and where you know that the data is encrypted, the users have the keys, you know, where you know we have a breach or get hacked, like it doesn't matter because we don't have access to any of the data, which is a very different sense from many other, you know, cloudsword providers where someone could get access to that data, whether it's just some malicious entity or, you know, a government. I mean we know the the stories of no, how the B FBI wanted, you know, backdoors and data from apple and other companies. And you know, we take a very different approach, right, is that we want to have a secure platform headed space. We, you know, open source all our code so you can verify that there's no nothing fishy going in there, and we just kind of want to be the kind of the neutral, you know, Sweden data providers that you know. Does you know, our best stop to to protect you in your data and, like jt said, you know sometimes that that cuts boss ways. But you know, we really want to be on the side of of our users and their customers first. Yeah, that's great. So, speaking of users, and I know we're running a little along here, but I have actually two more questions I'd like to bring up. We could be brief about this. How do you see people integrating yours? Just some with decentralized application, specifically etherium smart contracts? Who? That's that's a good question. So, like we said before, they're, you know, our focus right now is more in terms of the traditional clouds platform. You know, that already exists and people use its scale. But we do have many other people that are, you know, in the build stages of their technology, that are looking to use this as as a kind of distributed data platform, like DOC AI is one, son Am is another, and so these are, you know, still in the very early stages of their development process, but a lot of people are looking to use us as kind of that that data lay of applications. I've seen, you know, a couple of examples of people who have build distributed applications adding an air quotes here and you know, the application will be running and then suddenly, which you know happens every couple of years, you know, Amazon US three will go down and take a quarter of the Internet with it and that application as well, and everyone's just like, oh, why did this decentralize and distributed application, you know, go down? And then you realize, though, they were just storing all of the data on a centralized, you know, cloud storage provider. So we're really, you know, use the platform for kind of these distributed, a decentralized applications. But it's still still were very early in terms of development. So a lot of people are integrating and learning the tool sets and building their platforms, but I think it's it's a bit it's very early on in the stage of development. Yeah, so specifically, one of the useful things about IPFs has is content addressed and I can fit that content address within a two hundred and fifty six bit register and evm. Is there an analogous thing that I could do with that on on storage? Yeah, so ifhypathetically, you know, in terms of that example, you could, since there's a lot of integrations of IPFs already, you could just essentially use storage as a back end for IPFs and then you would be able to reuse some of those same integrations and you'd have some guarantees at the files would be available in there and they would just be stored on storage. So you can use kind of existing layers to essentially, you know, kind of have your cake and eat it too with it. Yeah, because that is like I if this is becoming kind of like an open standard almost, but the facto, I guess you could call it IPN s specifically. So something that rapper, like you said earlier, would be extremely valuable to anybody because really what you are is a back end for the storage mechanism and the addressing ssstem should be independent of that to some degree. Yeah, well, we have if people are...

...interested in contributing to our platform, some of our open tickets are descriptions of the IPFs gateway that we have planned. I'm it's not on our immediate road map for our next release, but an IPFs gateway and a number of different directions actually is something that we are very interested in doing. Yeah, I mean that's the cool thing about, you know, distributing systems and open source, right you I have all these you know, cool projects in the space like IPFs or file Quayne, and you know, we're really trying to make a dent, you know, a bit for ideological reasons, and you know, the the large cloud providers, Amazon, Google, Microsoft, and we're kind of all idealistically aligned to to, you know, store people stayed in a better way, more secure way and more private way, more performant way. But you know, all this Cud is open source and so we can write, you know, integrations and and tools together to really have a unified front encode against these, you know, large providers who you know exists and people use, but maybe not have all the the interest in heart and the focus on the users that these platforms, you know, the platforms that we and others and the space are trying to build or, and I really think of the User Stars. So so I guess the last thing I we've kind of screwed it pass this issue and you've mentioned it a few times and I feel like it's something that just does deserve at least a few sentences on. You say people get paid. What does that mean? So if I'm using your service, I'm getting paid for putting up a node. What is your what is your model? For that. Just just clarify for that for the audience. You know, I'm going to take this one or you want to take a Jag? No, go ahead. That's that's why you've been reckon on. Yeah, so the the basic, you know premise that makes this work is people have extra, you know, hard drive space on their computer that they're not using and they, you know, already have a stable Internet connection and so they could, you know, download this program, take a few minutes to set it up and just said go and let it, let it run on the background and earn money if we're storing that date over time and serving that data friend requested. So users are essentially getting paid our native token storage, just to Rj for essentially providing you know, hard drive space and bandwidth. So it's kind of variable what they would earn based on, you know, the the stability and up time of their node, how much data their storing, how much bandwidth that they're providing. But you know, we were kind of providing incentive and a direct inside of for for people to to write out the hard drive space and actually have some benefit for them, you know, participating in this network which incidimus make things, make things work and so that's really important. This is a departure for some, you know, other cloud storage platforms, a Serberty clouds, worts platforms in the past where people have said, okay, you know, if you share out your space, will give you more space on that at work on this network. And that's just kind of you have apples and you know, some trade see more apples. You know, you don't. You don't come out any better, but if you can, you know, take those apples and trade them for something of some tangible value, then you know, works out a lot better. So highly variable in terms of really can make based on kind of your setup in your note, but it at the end of the day, it's going to be a market for people to participate in. I think that's a we'rere to wrap this episode up. As always, I'd like to ask her, our guests, is there something that we should have asked you or you hoped we asked you that we didn't get around to? No, I mean, personally, I thought this was a great conversation. I wonderful questions, very, very insightful. I appreciated the conversation. Yeah, that's great. I would like to I like one thing I'd really recommend your listeners to get a storage that Io sto R J, that I oh, and take a look at the white paper, become a storage note operator and play around with her our tools and library, or maybe make some contributions towards that IPFs gateward we talked about. So plenty of stuff where people look around and do if you didn't get enough during this conversation, I'll standing hope those linksis and our shoutouts and I'm rappy to have y'all on as always. As if you...

...like this episode, audience, please link it, share it on twitter, share it on whatever media platform you you enjoy yourself, contact us, join the slack and tell us what we you liked, you didn't like anything. Just share it as much as possible so people can understand a little bit more about how the strigreted storage or decentralized storage works, with white useful and how storage is is solving that as you thanks, guys. Yeah, and maybe you guys will be storing your data on I'm so, yeah, well, so, yeah, and it also, as always, follow us on twitter at hashing it out hot and Corey is at Corpetti and I am at column Cucher. That's cool in cusc thanks guys. It's great. Great episode.

In-Stream Audio Search

NEW

Search across all episodes within this podcast

Episodes (111)